NlpTools\Models\Lda

class Lda

Topic discovery with latent dirchlet allocation using gibbs sampling.

The implementation is based on the paper by Griffiths and Steyvers
that can be found http://www.ncbi.nlm.nih.gov/pmc/articles/PMC387300/.

It is also heavily influenced (especially on the implementation and
debugging of the online gibbs sampler) by the python implementation
by Mathieu Blondel at https://gist.github.com/mblondel/542786

Methods

	__construct(FeatureFactoryInterface $ff, integer $ntopics, float $a = 1, float $b = 1)
	generateDocs(TrainingSet $tset) Generate an array suitable for use with Lda::initialize and Lda::gibbsSample from a training set.
	initialize(array $docs) Count initially the co-occurences of documents,topics and topics,words and cache them to run Gibbs sampling faster
	train(TrainingSet $tset, $it $it) Run the gibbs sampler $it times.
	gibbsSample(array $docs) Generate one gibbs sample.
array	getWordsPerTopicsProbabilities($limit_words $limit_words = -1) Get the probability of a word given a topic (phi according to Griffiths and Steyvers)
	getPhi($limit_words = -1) Shortcut to getWordsPerTopicsProbabilities
array	getDocumentsPerTopicsProbabilities($limit_docs $limit_docs = -1) Get the probability of a document given a topic (theta according to Griffiths and Steyvers)
	getTheta($limit_docs = -1) Shortcut to getDocumentsPerTopicsProbabilities
	getLogLikelihood() Log likelihood of the model having generated the data as implemented by M.

Details

at line 45
`public __construct(FeatureFactoryInterface $ff, integer $ntopics, float $a = 1, float $b = 1)`

Parameters

FeatureFactoryInterface	$ff	The feature factory will be applied to each document and the resulting feature array will be considered as a document for LDA
integer	$ntopics	The number of topics assumed by the model
float	$a	The dirichlet prior assumed for the per document topic distribution
float	$b	The dirichlet prior assumed for the per word topic distribution

at line 60
`public generateDocs(TrainingSet $tset)`

Generate an array suitable for use with Lda::initialize and Lda::gibbsSample from a training set.

Parameters

TrainingSet

$tset

at line 75
`public initialize(array $docs)`

Count initially the co-occurences of documents,topics and topics,words and cache them to run Gibbs sampling faster

Parameters

array

$docs

The docs that we will use to generate the sample

at line 135
`public train(TrainingSet $tset, $it $it)`

Run the gibbs sampler $it times.

Parameters

TrainingSet	$tset	The docs to run lda on
$it	$it	The number of iterations to run

at line 153
`public gibbsSample(array $docs)`

Generate one gibbs sample.

The docs must have been passed to initialize previous to calling
this function.

Parameters

array

$docs

The docs that we will use to generate the sample

at line 192
`public array getWordsPerTopicsProbabilities($limit_words $limit_words = -1)`

Get the probability of a word given a topic (phi according to Griffiths and Steyvers)

Parameters

$limit_words

Limit the results to the top n words

Return Value

array

A two dimensional array that contains the probabilities for each topic

at line 219
`public getPhi($limit_words = -1)`

Shortcut to getWordsPerTopicsProbabilities

Parameters

$limit_words

at line 231
`public array getDocumentsPerTopicsProbabilities($limit_docs $limit_docs = -1)`

Get the probability of a document given a topic (theta according to Griffiths and Steyvers)

Parameters

$limit_docs

Limit the results to the top n docs

Return Value

array

A two dimensional array that contains the probabilities for each document

at line 262
`public getTheta($limit_docs = -1)`

Shortcut to getDocumentsPerTopicsProbabilities

Parameters

$limit_docs

at line 271
`public getLogLikelihood()`

Log likelihood of the model having generated the data as implemented by M.

Blondel

Methods

Details

at line 45 public __construct(FeatureFactoryInterface $ff, integer $ntopics, float $a = 1, float $b = 1)

Parameters

at line 60 public generateDocs(TrainingSet $tset)

Parameters

at line 75 public initialize(array $docs)

Parameters

at line 135 public train(TrainingSet $tset, $it $it)

Parameters

at line 153 public gibbsSample(array $docs)

Parameters

at line 192 public array getWordsPerTopicsProbabilities($limit_words $limit_words = -1)

Parameters

Return Value

at line 219 public getPhi($limit_words = -1)

Parameters

at line 231 public array getDocumentsPerTopicsProbabilities($limit_docs $limit_docs = -1)

Parameters

Return Value

at line 262 public getTheta($limit_docs = -1)

Parameters

at line 271 public getLogLikelihood()

at line 45
`public __construct(FeatureFactoryInterface $ff, integer $ntopics, float $a = 1, float $b = 1)`

at line 60
`public generateDocs(TrainingSet $tset)`

at line 75
`public initialize(array $docs)`

at line 135
`public train(TrainingSet $tset, $it $it)`

at line 153
`public gibbsSample(array $docs)`

at line 192
`public array getWordsPerTopicsProbabilities($limit_words $limit_words = -1)`

at line 219
`public getPhi($limit_words = -1)`

at line 231
`public array getDocumentsPerTopicsProbabilities($limit_docs $limit_docs = -1)`

at line 262
`public getTheta($limit_docs = -1)`

at line 271
`public getLogLikelihood()`