Introducing Latent Dirichlet Allocation Apr 30th, 2013
In this article I will introduce work still in progress in NlpTools, Latent Dirichlet Allocation with batch Gibbs sampling. I haven't updated the API Docs yet so one can only see the API by looking to the code.
The implementation is based on the paper by Griffiths and Steyvers. With this new model a new namespace is also introduced in NlpTools (although it is arguable if it will stay) the NlpTools\Random namespace. It provides useful constructs for dealing with random distributions. As always it is coded in an as needed basis and only Gamma, Dirichlet and Normal distributions are implemented (much based on http://www.johndcook.com/SimpleRNG.cpp)
I would also like to mention Mathieu Blondel's python implementation that helped seriously in debugging mine and from which I took the log likelihood implementation.
A graphical example
The implementation comes with an accompanying test, the beautiful graphical example used by Griffiths and Steyvers in their own paper.
We create a set of documents (images) while sticking to the model's assumptions, sampling each topic from a Dirichlet distribution.
Topics:
Examples of generated documents:
Topics found in 1st iteration (log likelihood -278203.915):
Topics found in 5th iteration (log likelihood -205116.986):
Topics found in 50th iteration (log likelihood -133652.225):
Usage
For example the following code finds the topics of an array of text documents.
use NlpTools\FeatureFactories\DataAsFeatures; use NlpTools\Tokenizers\WhitespaceTokenizer; use NlpTools\Documents\TokensDocument; use NlpTools\Documents\TrainingSet; use NlpTools\Models\Lda; // supposedly the list of documents is in a file that is passed // as first parameter else $tok = new WhitespaceTokenizer(); $tset = new TrainingSet(); foreach ($docs as $f) { $tset->addDocument( '', // the class is not used by the lda model new TokensDocument( $tok->tokenize( ) ) ); } $lda = new Lda( new DataAsFeatures(), // a feature factory to transform the document data 5, // the number of topics we want 1, // the dirichlet prior assumed for the per document topic distribution 1 // the dirichlet prior assumed for the per word topic distribution ); // run the sampler 50 times $lda->train($tset,50); // synonymous to calling getPhi() as per Griffiths and Steyvers // it returns a mapping of words to probabilities for each topic // ex.: // Array( // [0] => Array( // [word1] => 0.0013... // .................... // [wordn] => 0.0001... // ), // [1] => Array( // .... // ) // ) // $lda->getPhi(10) // just the 10 largest probabilities $lda->getWordsPerTopicsProbabilities(10) );