NlpTools

Natural language processing in php

Introducing Latent Dirichlet Allocation Apr 30th, 2013

In this article I will introduce work still in progress in NlpTools, Latent Dirichlet Allocation with batch Gibbs sampling. I haven't updated the API Docs yet so one can only see the API by looking to the code.

The implementation is based on the paper by Griffiths and Steyvers. With this new model a new namespace is also introduced in NlpTools (although it is arguable if it will stay) the NlpTools\Random namespace. It provides useful constructs for dealing with random distributions. As always it is coded in an as needed basis and only Gamma, Dirichlet and Normal distributions are implemented (much based on http://www.johndcook.com/SimpleRNG.cpp)

I would also like to mention Mathieu Blondel's python implementation that helped seriously in debugging mine and from which I took the log likelihood implementation.

A graphical example

The implementation comes with an accompanying test, the beautiful graphical example used by Griffiths and Steyvers in their own paper.

We create a set of documents (images) while sticking to the model's assumptions, sampling each topic from a Dirichlet distribution.

Topics:

Examples of generated documents:

Topics found in 1st iteration (log likelihood -278203.915):

Topics found in 5th iteration (log likelihood -205116.986):

Topics found in 50th iteration (log likelihood -133652.225):

Usage

For example the following code finds the topics of an array of text documents.

  1. use NlpTools\FeatureFactories\DataAsFeatures;
  2. use NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use NlpTools\Documents\TokensDocument;
  4. use NlpTools\Documents\TrainingSet;
  5. use NlpTools\Models\Lda;
  6. // supposedly the list of documents is in a file that is passed
  7. // as first parameter
  8. if (file_exists($argv[1])
  9. $docs = file($argv[1]);
  10. else
  11. die(1);
  12. $tok = new WhitespaceTokenizer();
  13. $tset = new TrainingSet();
  14. foreach ($docs as $f) {
  15. $tset->addDocument(
  16. '', // the class is not used by the lda model
  17. new TokensDocument(
  18. $tok->tokenize(
  19. )
  20. )
  21. );
  22. }
  23. $lda = new Lda(
  24. new DataAsFeatures(), // a feature factory to transform the document data
  25. 5, // the number of topics we want
  26. 1, // the dirichlet prior assumed for the per document topic distribution
  27. 1 // the dirichlet prior assumed for the per word topic distribution
  28. );
  29. // run the sampler 50 times
  30. $lda->train($tset,50);
  31. // synonymous to calling getPhi() as per Griffiths and Steyvers
  32. // it returns a mapping of words to probabilities for each topic
  33. // ex.:
  34. // Array(
  35. // [0] => Array(
  36. // [word1] => 0.0013...
  37. // ....................
  38. // [wordn] => 0.0001...
  39. // ),
  40. // [1] => Array(
  41. // ....
  42. // )
  43. // )
  44. // $lda->getPhi(10)
  45. // just the 10 largest probabilities
  46. $lda->getWordsPerTopicsProbabilities(10)
  47. );