NlpTools

Natural language processing in php

Tokenizers

Tokenizing in regards to computer science, to quote wikipedia, is the process of converting a sequence of characters to a sequence of tokens.

In NLP we tokenize a large piece of text to generate tokens which are smaller pieces of text (words, sentences, etc.) that are easier to work with. For instance we might want to apply a stop word list , we will be applying it to the tokens and not the original text.

Tokenizer interface

The tokenizer interface is a very simple one with only one method.

  1. interface TokenizerInterface
  2. {
  3. /*
  4. * @param string $str The text for tokenization
  5. * @return array The list of tokens from the string
  6. */
  7. public function tokenize($str);
  8. }

NlpTools does not introduce a new data structure for the token sequences they are plain arrays.

Tokenizers provided

The following tokenizers reside in the \NlpTools\Tokenizers namespace.

  1. WhitespaceTokenizer
  2. WhitespaceAndPunctuationTokenizer
  3. ClassifierBasedTokenizer
  4. RegexTokenizer
  5. PennTreeBankTokenizer
  1. use \NlpTools\Tokenizers\WhitespaceTokenizer;
  2. use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
  3. use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
  4. $s = "Please allow me to introduce myself
  5. I'm a man of wealth and taste";
  6. $space = new WhitespaceTokenizer();
  7. $punct = new WhitespaceAndPunctuationTokenizer();
  8. // instantiate a binary classifier that chooses among the labels
  9. // EOW --> End Of Word
  10. // O --> Other
  11. // $cls <-- that classifier
  12. $clstok = new ClassifierBasedTokenizer($cls);
  13. $space->tokenize($s);
  14. // array('Please','allow','me','to','introduce','myself',
  15. // 'I\'m','a','man','of','wealth','and','taste')
  16. $punct->tokenize($s);
  17. // array('Please','allow','me','to','introduce','myself',
  18. // 'I','\'','m','a','man','of','wealth','and','taste')
  19. $clstok->tokenize($s);
  20. // output depends on the classifier

Although WhitespaceTokenizer and WhitespaceAndPunctuationTokenizer are straight forward the rest of them need a bit of separate mention.

Classifier based tokenization

The constructor of the ClassifierBasedTokenizer looks like this

  1. class ClassifierBasedTokenizer implements TokenizerInterface
  2. {
  3. public function __construct(Classifier $cls, Tokenizer $tok=null,$sep=' ') {
  4. ...
  5. }
  6. ...

We will see how the classifier based tokenizer works with a simple example that implements a naive rule based sentence tokenizer and then I will explain a bit the simple algorithm that this tokenizer uses.

ClassifierBasedTokenizer sends to the classifier a document of type WordDocument (you might want to check the documents documentation).

  1. <?php
  2. include ('vendor/autoload.php');
  3. use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
  4. use \NlpTools\Tokenizers\WhitespaceTokenizer;
  5. use \NlpTools\Classifiers\ClassifierInterface;
  6. use \NlpTools\Documents\DocumentInterface;
  7. class EndOfSentence implements ClassifierInterface
  8. {
  9. public function classify(array $classes, DocumentInterface $d) {
  10. list($token,$before,$after) = $d->getDocumentData();
  11. $dotcnt = count(explode('.',$token))-1;
  12. $lastdot = substr($token,-1)=='.';
  13. if (!$lastdot) // assume that all sentences end in full stops
  14. return 'O';
  15. if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
  16. return 'O';
  17. return 'EOW';
  18. }
  19. }
  20. $tok = new ClassifierBasedTokenizer(
  21. new EndOfSentence(),
  22. new WhitespaceTokenizer()
  23. );
  24. $text = "We are what we repeatedly do.
  25. Excellence, then, is not an act, but a habit.";
  26. print_r($tok->tokenize($text));
  27. // Array
  28. // (
  29. // [0] => We are what we repeatedly do.
  30. // [1] => Excellence, then, is not an act, but a habit.
  31. // )

ClassifierBasedTokenizer follows the following simple algorithm to combine tokens from another tokenizer into larger tokens.

  1. Break the character sequence into a token sequence using another Tokenizer instance
  2. Classify each token whether it is an EOW or an O. EOW stands for "End of word" and O stands for "Other".
  3. Join all O tokens up to an EOW token using a given separator (any character sequence)

RegexTokenizer

This tokenizer tokenizes text based on a set of regexes. The regexes are passed as a constructor parameter and can have the following three forms.

  1. A simple string
  2. An array of a string and an integer
  3. An array of two strings

Each case is handled differently by the tokenizer. In the first case the input will be split using preg_split and the provided string as pattern. In the second case the integer denotes the subpattern to keep and the first the pattern to be used with preg_match. Finally in the third case preg_replace is used and the matches are replaced with the second string.

The results of each regex are then passed to the next as a pipeline.

  1. use NlpTools\Tokenizers\RegexTokenizer;
  2. $s = "Please allow me to introduce myself
  3. I'm a man of wealth and taste";
  4. $rtok = new RegexTokenizer(array(
  5. array("/\s+/"," "), // replace many spaces with a single space
  6. array("/'(m|ve|d|s)/", " '\$1"), // split I've, it's, we've, we'd, ...
  7. "/ /" // split on every space
  8. ));
  9. print_r($rtok->tokenize($s));
  10. // Array
  11. // (
  12. // [0] => Please
  13. // [1] => allow
  14. // [2] => me
  15. // [3] => to
  16. // [4] => introduce
  17. // [5] => myself
  18. // [6] => I
  19. // [7] => 'm
  20. // [8] => a
  21. // [9] => man
  22. // [10] => of
  23. // [11] => wealth
  24. // [12] => and
  25. // [13] => taste
  26. // )

Treebank tokenization

This is a very popular tokenization for the english language. We use the Penn treebank tokenization for which you can read more at upenn.edu.

« Getting started / Documents »