NlpTools

Natural language processing in php

Maximum Entropy

The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy. Wikipedia

So in this maximum entropy model we try to find a set of weights that correctly predict the training data without making any assumptions. For more information regarding the underlying math of the model I would suggest reading this very informative pdf by Stanford.

Optimizers

Because training a maximum entropy model is not as trivial as training a Naive Bayes model, Maxent also takes a MaxentOptimizer as a third parameter.

Example use

In the following example we train a classifier to recognise the start of a sentence. I, purposefully, use very little features (not the data for example) in order to be easy to see what happens to the weights assigned by the optimizer.

Specifically we have two classes and three features which results in 6 different class,feature pairs. The dump of the weights is the following.

Array
(
    [sentence ^ startsUpper] => 0.055798773871977
    [other ^ startsUpper] => -0.055798773871977
    [sentence ^ isLower] => -36.5
    [other ^ isLower] => 36.5
    [sentence ^ prev_ends_with(.)] => 1.7698599992222
    [other ^ prev_ends_with(.)] => -1.7698599992222
)

We see that if the word is all lowercase it is really unlikely to start a sentence and it is given a weight of -36.5.

  1. include('vendor/autoload.php');
  2. use NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use NlpTools\Documents\WordDocument;
  4. use NlpTools\Documents\Document;
  5. use NlpTools\Documents\TrainingSet;
  6. use NlpTools\Models\Maxent;
  7. use NlpTools\Optimizers\MaxentGradientDescent;
  8. use NlpTools\FeatureFactories\FunctionFeatures;
  9. $s = "When the objects of an inquiry, in any department, have principles, conditions, or elements, it is through acquaintance with these that knowledge, that is to say scientific knowledge, is attained.
  10. For we do not think that we know a thing until we are acquainted with its primary conditions or first principles, and have carried our analysis as far as its simplest elements.
  11. Plainly therefore in the science of Nature, as in other branches of study, our first task will be to try to determine what relates to its principles.";
  12. // tokens 0,30,62 are the start of a new sentence
  13. // the rest will be said to have the class other
  14. $tok = new WhitespaceTokenizer();
  15. $tokens = $tok->tokenize($s);
  16. $tset = new TrainingSet();
  17. $tokens,
  18. function ($t,$i) use($tset,$tokens) {
  19. if (!in_array($i,array(0,30,62))) {
  20. $tset->addDocument(
  21. 'other',
  22. new WordDocument($tokens,$i,1) // get word and the previous/next
  23. );
  24. } else {
  25. $tset->addDocument(
  26. 'sentence',
  27. new WordDocument($tokens,$i,1) // get word and the previous/next
  28. );
  29. }
  30. }
  31. );
  32. // Remember that in maxent a feature should also target the class
  33. // thus we prepend each feature name with the class name
  34. $ff = new FunctionFeatures(
  35. function ($class, Document $d) {
  36. // $data[0] is the current word
  37. // $data[1] is an array of previous words
  38. // $data[2] is an array of following words
  39. $data = $d->getDocumentData();
  40. // check if the previous word ends with '.'
  41. if (isset($data[1][0])) {
  42. return (substr($data[1][0],-1)=='.') ? "$class ^ prev_ends_with(.)" : null;
  43. }
  44. },
  45. function ($class, Document $d) {
  46. $data = $d->getDocumentData();
  47. // check if this word starts with a capital
  48. return (ctype_upper($data[0][0])) ? "$class ^ startsUpper" : null;
  49. },
  50. function ($class, Document $d) {
  51. $data = $d->getDocumentData();
  52. // check if this word is all lowercase
  53. return (ctype_lower($data[0])) ? "$class ^ isLower" : null;
  54. }
  55. )
  56. );
  57. // instanciate a gradient descent optimizer for maximum entropy
  58. $optimizer = new MaxentGradientDescent(
  59. 0.001, // Stop if each weight changes less than 0.001
  60. 0.1, // learning rate
  61. 10 // maximum iterations
  62. );
  63. // an empty maxent model
  64. $maxent = new Maxent(array());
  65. // train
  66. $maxent->train($ff,$tset,$optimizer);
  67. // show the weights
  68. $maxent->dumpWeights();

« Bayesian Model / Similarity »