NlpTools API
Class

NlpTools\Models\Lda

class Lda

Topic discovery with latent dirchlet allocation using gibbs sampling.

The implementation is based on the paper by Griffiths and Steyvers
that can be found http://www.ncbi.nlm.nih.gov/pmc/articles/PMC387300/.

It is also heavily influenced (especially on the implementation and
debugging of the online gibbs sampler) by the python implementation
by Mathieu Blondel at https://gist.github.com/mblondel/542786

Methods

__construct(FeatureFactoryInterface $ff, integer $ntopics, float $a = 1, float $b = 1)

generateDocs(TrainingSet $tset)

Generate an array suitable for use with Lda::initialize and Lda::gibbsSample from a training set.

initialize(array $docs)

Count initially the co-occurences of documents,topics and topics,words and cache them to run Gibbs sampling faster

train(TrainingSet $tset, $it $it)

Run the gibbs sampler $it times.

gibbsSample(array $docs)

Generate one gibbs sample.

array getWordsPerTopicsProbabilities($limit_words $limit_words = -1)

Get the probability of a word given a topic (phi according to Griffiths and Steyvers)

getPhi($limit_words = -1)

Shortcut to getWordsPerTopicsProbabilities

array getDocumentsPerTopicsProbabilities($limit_docs $limit_docs = -1)

Get the probability of a document given a topic (theta according to Griffiths and Steyvers)

getTheta($limit_docs = -1)

Shortcut to getDocumentsPerTopicsProbabilities

getLogLikelihood()

Log likelihood of the model having generated the data as implemented by M.

Details

at line 45
public __construct(FeatureFactoryInterface $ff, integer $ntopics, float $a = 1, float $b = 1)

Parameters

FeatureFactoryInterface $ff The feature factory will be applied to each document and the resulting feature array will be considered as a document for LDA
integer $ntopics The number of topics assumed by the model
float $a The dirichlet prior assumed for the per document topic distribution
float $b The dirichlet prior assumed for the per word topic distribution

at line 60
public generateDocs(TrainingSet $tset)

Generate an array suitable for use with Lda::initialize and Lda::gibbsSample from a training set.

Parameters

TrainingSet $tset

at line 75
public initialize(array $docs)

Count initially the co-occurences of documents,topics and topics,words and cache them to run Gibbs sampling faster

Parameters

array $docs The docs that we will use to generate the sample

at line 135
public train(TrainingSet $tset, $it $it)

Run the gibbs sampler $it times.

Parameters

TrainingSet $tset The docs to run lda on
$it $it The number of iterations to run

at line 153
public gibbsSample(array $docs)

Generate one gibbs sample.

The docs must have been passed to initialize previous to calling
this function.

Parameters

array $docs The docs that we will use to generate the sample

at line 192
public array getWordsPerTopicsProbabilities($limit_words $limit_words = -1)

Get the probability of a word given a topic (phi according to Griffiths and Steyvers)

Parameters

$limit_words $limit_words Limit the results to the top n words

Return Value

array A two dimensional array that contains the probabilities for each topic

at line 219
public getPhi($limit_words = -1)

Shortcut to getWordsPerTopicsProbabilities

Parameters

$limit_words

at line 231
public array getDocumentsPerTopicsProbabilities($limit_docs $limit_docs = -1)

Get the probability of a document given a topic (theta according to Griffiths and Steyvers)

Parameters

$limit_docs $limit_docs Limit the results to the top n docs

Return Value

array A two dimensional array that contains the probabilities for each document

at line 262
public getTheta($limit_docs = -1)

Shortcut to getDocumentsPerTopicsProbabilities

Parameters

$limit_docs

at line 271
public getLogLikelihood()

Log likelihood of the model having generated the data as implemented by M.

Blondel