NlpTools

Natural language processing in php

Getting started with NlpTools

NlpTools provides the building blocks for training classification models for natural language processing. NlpTools requires php 5.3 or greater. It is composer-able so all you have to do to use it is write a composer.json like this

{
    "require": {
        "nlp-tools/nlp-tools": "1.0.*@dev"
    }
}

and then an example

  1. include('vendor/autoload.php');
  2. use NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
  3. $text = "Please allow me to introduce myself
  4. I'm a man of wealth and taste";
  5. $tok = new WhitespaceAndPunctuationTokenizer();
  6. print_r($tok->tokenize($text));
  7. // Array
  8. // (
  9. // [0] => Please
  10. // [1] => allow
  11. // [2] => me
  12. // [3] => to
  13. // [4] => introduce
  14. // [5] => myself
  15. // [6] => I
  16. // [7] => '
  17. // [8] => m
  18. // [9] => a
  19. // [10] => man
  20. // [11] => of
  21. // [12] => wealth
  22. // [13] => and
  23. // [14] => taste
  24. // )

Classification

NlpTools does not contain (yet) prebuilt models. The examples that will be given here will also contain the training of the model thus they may seem a bit lengthy. The above example doesn't classify anything and doesn't need any training.

Next we will add a classification example.

  1. include('vendor/autoload.php'); // won't include it again in the following examples
  2. use NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use NlpTools\Models\FeatureBasedNB;
  4. use NlpTools\Documents\TrainingSet;
  5. use NlpTools\Documents\TokensDocument;
  6. use NlpTools\FeatureFactories\DataAsFeatures;
  7. use NlpTools\Classifiers\MultinomialNBClassifier;
  8. // ---------- Data ----------------
  9. // data is taken from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  10. // we use a part for training
  11. $training = array(
  12. array('ham','Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
  13. ...
  14. array('ham','Fine if that\'s the way u feel. That\'s the way its gota b'),
  15. array('spam','England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+')
  16. );
  17. // and another for evaluating
  18. $testing = array(
  19. array('ham','I\'ve been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.'),
  20. ...
  21. array('ham','I HAVE A DATE ON SUNDAY WITH WILL!!'),
  22. array('spam','XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL')
  23. );
  24. $tset = new TrainingSet(); // will hold the training documents
  25. $tok = new WhitespaceTokenizer(); // will split into tokens
  26. $ff = new DataAsFeatures(); // see features in documentation
  27. // ---------- Training ----------------
  28. foreach ($training as $d)
  29. {
  30. $tset->addDocument(
  31. $d[0], // class
  32. new TokensDocument(
  33. $tok->tokenize($d[1]) // The actual document
  34. )
  35. );
  36. }
  37. $model = new FeatureBasedNB(); // train a Naive Bayes model
  38. $model->train($ff,$tset);
  39. // ---------- Classification ----------------
  40. $cls = new MultinomialNBClassifier($ff,$model);
  41. $correct = 0;
  42. foreach ($testing as $d)
  43. {
  44. // predict if it is spam or ham
  45. $prediction = $cls->classify(
  46. array('ham','spam'), // all possible classes
  47. new TokensDocument(
  48. $tok->tokenize($d[1]) // The document
  49. )
  50. );
  51. if ($prediction==$d[0])
  52. $correct ++;
  53. }
  54. printf("Accuracy: %.2f\n", 100*$correct / count($testing));

Tokenizers »