NlpTools

Natural language processing in php

Bayesian model

NlpTools, for now at least, only implements the Naive Bayes model. The interface of the model is the following.

  1. interface MultinomialNBModelInterface
  2. {
  3. public function getPrior($class);
  4. public function getCondProb($term,$class);
  5. }

There is one class that currently implements the above interface FeatureBasedNB.

Feature based NB

FeatureBasedNB has a train method that implements the Naive Bayes supervised learning. FeatureBasedNB's train method takes a feature factory and a training set as parameters. Simply by counting the occurences of each feature in each document it computes the necessary probabilites.

It also uses additive smoothing to account for features never seen in the training set.

The train method returns a training context that can be used with the function train_with_context for incremental training.

A good example of the use of the FeatureBasedNB class is shown on the documentation index (it is repeated here).

  1. include('vendor/autoload.php'); // won't include it again in the following examples
  2. use NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use NlpTools\Models\FeatureBasedNB;
  4. use NlpTools\Documents\TrainingSet;
  5. use NlpTools\Documents\TokensDocument;
  6. use NlpTools\FeatureFactories\DataAsFeatures;
  7. use NlpTools\Classifiers\MultinomialNBClassifier;
  8. // ---------- Data ----------------
  9. // data is taken from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  10. // we use a part for training
  11. $training = array(
  12. array('ham','Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
  13. ...
  14. array('ham','Fine if that\'s the way u feel. That\'s the way its gota b'),
  15. array('spam','England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+')
  16. );
  17. // and another for evaluating
  18. $testing = array(
  19. array('ham','I\'ve been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.'),
  20. ...
  21. array('ham','I HAVE A DATE ON SUNDAY WITH WILL!!'),
  22. array('spam','XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL')
  23. );
  24. $tset = new TrainingSet(); // will hold the training documents
  25. $tok = new WhitespaceTokenizer(); // will split into tokens
  26. $ff = new DataAsFeatures(); // see features in documentation
  27. // ---------- Training ----------------
  28. foreach ($training as $d)
  29. {
  30. $tset->addDocument(
  31. $d[0], // class
  32. new TokensDocument(
  33. $tok->tokenize($d[1]) // The actual document
  34. )
  35. );
  36. }
  37. $model = new FeatureBasedNB(); // train a Naive Bayes model
  38. $model->train($ff,$tset);
  39. // ---------- Classification ----------------
  40. $cls = new MultinomialNBClassifier($ff,$model);
  41. $correct = 0;
  42. foreach ($testing as $d)
  43. {
  44. // predict if it is spam or ham
  45. $prediction = $cls->classify(
  46. array('ham','spam'), // all possible classes
  47. new TokensDocument(
  48. $tok->tokenize($d[1]) // The document
  49. )
  50. );
  51. if ($prediction==$d[0])
  52. $correct ++;
  53. }
  54. printf("Accuracy: %.2f\n", 100*$correct / count($testing));

« Classifiers / Maximum Entropy Model »