Tokenizers
Tokenizing in regards to computer science, to quote wikipedia, is the process of converting a sequence of characters to a sequence of tokens.
In NLP we tokenize a large piece of text to generate tokens which are smaller pieces of text (words, sentences, etc.) that are easier to work with. For instance we might want to apply a stop word list , we will be applying it to the tokens and not the original text.
Tokenizer interface
The tokenizer interface is a very simple one with only one method.
interface TokenizerInterface
{
/*
* @param string $str The text for tokenization
* @return array The list of tokens from the string
*/
public function tokenize($str);
}
NlpTools does not introduce a new data structure for the token sequences they are plain arrays.
Tokenizers provided
The following tokenizers reside in the \NlpTools\Tokenizers namespace.
- WhitespaceTokenizer
- WhitespaceAndPunctuationTokenizer
- ClassifierBasedTokenizer
- RegexTokenizer
- PennTreeBankTokenizer
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
$s = "Please allow me to introduce myself
I'm a man of wealth and taste";
$space = new WhitespaceTokenizer();
$punct = new WhitespaceAndPunctuationTokenizer();
// instantiate a binary classifier that chooses among the labels
// EOW --> End Of Word
// O --> Other
// $cls <-- that classifier
$clstok = new ClassifierBasedTokenizer($cls);
$space->tokenize($s);
// array('Please','allow','me','to','introduce','myself',
// 'I\'m','a','man','of','wealth','and','taste')
$punct->tokenize($s);
// array('Please','allow','me','to','introduce','myself',
// 'I','\'','m','a','man','of','wealth','and','taste')
$clstok->tokenize($s);
// output depends on the classifier
Although WhitespaceTokenizer and WhitespaceAndPunctuationTokenizer are straight forward the rest of them need a bit of separate mention.
Classifier based tokenization
The constructor of the ClassifierBasedTokenizer looks like this
class ClassifierBasedTokenizer implements TokenizerInterface
{
public function __construct(Classifier $cls, Tokenizer $tok=null,$sep=' ') {
...
}
...
We will see how the classifier based tokenizer works with a simple example that implements a naive rule based sentence tokenizer and then I will explain a bit the simple algorithm that this tokenizer uses.
ClassifierBasedTokenizer sends to the classifier a document of type WordDocument (you might want to check the documents documentation).
<?php
include ('vendor/autoload.php');
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
class EndOfSentence implements ClassifierInterface
{
if (!$lastdot) // assume that all sentences end in full stops
return 'O';
if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
return 'O';
return 'EOW';
}
}
$tok = new ClassifierBasedTokenizer(
new EndOfSentence(),
new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
Excellence, then, is not an act, but a habit.";
// Array
// (
// [0] => We are what we repeatedly do.
// [1] => Excellence, then, is not an act, but a habit.
// )
ClassifierBasedTokenizer follows the following simple algorithm to combine tokens from another tokenizer into larger tokens.
- Break the character sequence into a token sequence using another Tokenizer instance
- Classify each token whether it is an EOW or an O. EOW stands for "End of word" and O stands for "Other".
- Join all O tokens up to an EOW token using a given separator (any character sequence)
RegexTokenizer
This tokenizer tokenizes text based on a set of regexes. The regexes are passed as a constructor parameter and can have the following three forms.
- A simple string
- An array of a string and an integer
- An array of two strings
Each case is handled differently by the tokenizer. In the first case the input will be split using preg_split and the provided string as pattern. In the second case the integer denotes the subpattern to keep and the first the pattern to be used with preg_match. Finally in the third case preg_replace is used and the matches are replaced with the second string.
The results of each regex are then passed to the next as a pipeline.
use NlpTools\Tokenizers\RegexTokenizer;
$s = "Please allow me to introduce myself
I'm a man of wealth and taste";
"/ /" // split on every space
));
// Array
// (
// [0] => Please
// [1] => allow
// [2] => me
// [3] => to
// [4] => introduce
// [5] => myself
// [6] => I
// [7] => 'm
// [8] => a
// [9] => man
// [10] => of
// [11] => wealth
// [12] => and
// [13] => taste
// )
Treebank tokenization
This is a very popular tokenization for the english language. We use the Penn treebank tokenization for which you can read more at upenn.edu.