Tokenizers
Tokenizing in regards to computer science, to quote wikipedia, is the process of converting a sequence of characters to a sequence of tokens.
In NLP we tokenize a large piece of text to generate tokens which are smaller pieces of text (words, sentences, etc.) that are easier to work with. For instance we might want to apply a stop word list , we will be applying it to the tokens and not the original text.
Tokenizer interface
The tokenizer interface is a very simple one with only one method.
interface TokenizerInterface { /* * @param string $str The text for tokenization * @return array The list of tokens from the string */ public function tokenize($str); }
NlpTools does not introduce a new data structure for the token sequences they are plain arrays.
Tokenizers provided
The following tokenizers reside in the \NlpTools\Tokenizers namespace.
- WhitespaceTokenizer
- WhitespaceAndPunctuationTokenizer
- ClassifierBasedTokenizer
- RegexTokenizer
- PennTreeBankTokenizer
use \NlpTools\Tokenizers\WhitespaceTokenizer; use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer; use \NlpTools\Tokenizers\ClassifierBasedTokenizer; $s = "Please allow me to introduce myself I'm a man of wealth and taste"; $space = new WhitespaceTokenizer(); $punct = new WhitespaceAndPunctuationTokenizer(); // instantiate a binary classifier that chooses among the labels // EOW --> End Of Word // O --> Other // $cls <-- that classifier $clstok = new ClassifierBasedTokenizer($cls); $space->tokenize($s); // array('Please','allow','me','to','introduce','myself', // 'I\'m','a','man','of','wealth','and','taste') $punct->tokenize($s); // array('Please','allow','me','to','introduce','myself', // 'I','\'','m','a','man','of','wealth','and','taste') $clstok->tokenize($s); // output depends on the classifier
Although WhitespaceTokenizer and WhitespaceAndPunctuationTokenizer are straight forward the rest of them need a bit of separate mention.
Classifier based tokenization
The constructor of the ClassifierBasedTokenizer looks like this
class ClassifierBasedTokenizer implements TokenizerInterface { public function __construct(Classifier $cls, Tokenizer $tok=null,$sep=' ') { ... } ...
We will see how the classifier based tokenizer works with a simple example that implements a naive rule based sentence tokenizer and then I will explain a bit the simple algorithm that this tokenizer uses.
ClassifierBasedTokenizer sends to the classifier a document of type WordDocument (you might want to check the documents documentation).
<?php include ('vendor/autoload.php'); use \NlpTools\Tokenizers\ClassifierBasedTokenizer; use \NlpTools\Tokenizers\WhitespaceTokenizer; use \NlpTools\Classifiers\ClassifierInterface; use \NlpTools\Documents\DocumentInterface; class EndOfSentence implements ClassifierInterface { if (!$lastdot) // assume that all sentences end in full stops return 'O'; if ($dotcnt>1) // to catch some naive abbreviations U.S.A. return 'O'; return 'EOW'; } } $tok = new ClassifierBasedTokenizer( new EndOfSentence(), new WhitespaceTokenizer() ); $text = "We are what we repeatedly do. Excellence, then, is not an act, but a habit."; // Array // ( // [0] => We are what we repeatedly do. // [1] => Excellence, then, is not an act, but a habit. // )
ClassifierBasedTokenizer follows the following simple algorithm to combine tokens from another tokenizer into larger tokens.
- Break the character sequence into a token sequence using another Tokenizer instance
- Classify each token whether it is an EOW or an O. EOW stands for "End of word" and O stands for "Other".
- Join all O tokens up to an EOW token using a given separator (any character sequence)
RegexTokenizer
This tokenizer tokenizes text based on a set of regexes. The regexes are passed as a constructor parameter and can have the following three forms.
- A simple string
- An array of a string and an integer
- An array of two strings
Each case is handled differently by the tokenizer. In the first case the input will be split using preg_split and the provided string as pattern. In the second case the integer denotes the subpattern to keep and the first the pattern to be used with preg_match. Finally in the third case preg_replace is used and the matches are replaced with the second string.
The results of each regex are then passed to the next as a pipeline.
use NlpTools\Tokenizers\RegexTokenizer; $s = "Please allow me to introduce myself I'm a man of wealth and taste"; "/ /" // split on every space )); // Array // ( // [0] => Please // [1] => allow // [2] => me // [3] => to // [4] => introduce // [5] => myself // [6] => I // [7] => 'm // [8] => a // [9] => man // [10] => of // [11] => wealth // [12] => and // [13] => taste // )
Treebank tokenization
This is a very popular tokenization for the english language. We use the Penn treebank tokenization for which you can read more at upenn.edu.