NlpTools

Natural language processing in php

Other

In this page there will be documentation that is either incomplete or the part of the library being documented does not require a complete separate section.

Inverse document frequency

NlpTools\Analysis\Idf is a class that implements the inverse document frequency measure. The inverse document frequency is a measure of how common or rare a term T is inside a collection of documents. Terms that are rare tend to be more important for some tasks, especially for information retrieval.

Idf requires a TrainingSet (and a FeatureFactory to translate the document data to feature vectors) upon construction and afterwards it can be used as an array via the ArrayAccess php interface. The values of the array return the inverse document frequency of the term (feature). It can be used in another feature factory to create weighted feature vectors.

  1. use NlpTools\Documents\DocumentInterface;
  2. use NlpTools\Documents\TrainingSet;
  3. use NlpTools\Documents\TokensDocument;
  4. use NlpTools\FeatureFactories\FunctionFeatures;
  5. use NlpTools\Analysis\Idf;
  6. class TfIdfFeatureFactory extends FunctionFeatures
  7. {
  8. protected $idf;
  9. public function __construct(Idf $idf, array $functions)
  10. {
  11. parent::__construct($functions);
  12. $this->modelFrequency();
  13. $this->idf = $idf;
  14. }
  15. public function getFeatureArray($class, DocumentInterface $doc)
  16. {
  17. $frequencies = parent::getFeatureArray($class, $doc);
  18. foreach ($frequencies as $term=>&$value) {
  19. $value = $value*$this->idf[$term];
  20. }
  21. return $frequencies;
  22. }
  23. }
  24. $tset = new TrainingSet();
  25. $tset->addDocument(
  26. "",
  27. new TokensDocument(
  28. " ",
  29. "Don't go around saying the world owes you a living . The world owes you nothing . It was here first ."
  30. )
  31. )
  32. );
  33. $tset->addDocument(
  34. "",
  35. new TokensDocument(
  36. " ",
  37. "Go to Heaven for the climate , Hell for the company ."
  38. )
  39. )
  40. );
  41. $tset->addDocument(
  42. "",
  43. new TokensDocument(
  44. " ",
  45. "If you tell the truth , you don't have to remember anything ."
  46. )
  47. )
  48. );
  49. $idf = new Idf($tset);
  50. $ff = new TfIdfFeatureFactory(
  51. $idf,
  52. function ($c, $d) {
  53. return $d->getDocumentData();
  54. }
  55. )
  56. );
  57. print_r($ff->getFeatureArray("", $tset[0]));

Latent Dirichlet Allocation

Latent dirichlet allocation is a topic model that aims to identify distributions over words as topics and then distributions over topics for the documents. It is a very useful and relatively new model that deserves a separate mention in the documentation similar to the mention for the Bayesian model and the Maxent model.

The best resource for examples is the functional test and the blog post written when the model was first coded.