Tokenizers

Tokenizing in regards to computer science, to quote wikipedia, is the process of converting a sequence of characters to a sequence of tokens.

In NLP we tokenize a large piece of text to generate tokens which are smaller pieces of text (words, sentences, etc.) that are easier to work with. For instance we might want to apply a stop word list , we will be applying it to the tokens and not the original text.

Tokenizer interface

The tokenizer interface is a very simple one with only one method.

interface TokenizerInterface
{
    /*
     * @param string $str The text for tokenization
     * @return array The list of tokens from the string
     */
    public function tokenize($str);
}

NlpTools does not introduce a new data structure for the token sequences they are plain arrays.

Tokenizers provided

The following tokenizers reside in the \NlpTools\Tokenizers namespace.

WhitespaceTokenizer
WhitespaceAndPunctuationTokenizer
ClassifierBasedTokenizer
RegexTokenizer
PennTreeBankTokenizer

use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 
$s = "Please allow me to introduce myself 
        I'm a man of wealth and taste";
 
$space = new WhitespaceTokenizer();
$punct = new WhitespaceAndPunctuationTokenizer();
// instantiate a binary classifier that chooses among the labels
// EOW --> End Of Word
// O   --> Other
// $cls <-- that classifier
$clstok = new ClassifierBasedTokenizer($cls);
 
$space->tokenize($s);
// array('Please','allow','me','to','introduce','myself',
//      'I\'m','a','man','of','wealth','and','taste')
 
$punct->tokenize($s);
// array('Please','allow','me','to','introduce','myself',
//      'I','\'','m','a','man','of','wealth','and','taste')
 
$clstok->tokenize($s);
// output depends on the classifier

Although WhitespaceTokenizer and WhitespaceAndPunctuationTokenizer are straight forward the rest of them need a bit of separate mention.

Classifier based tokenization

The constructor of the ClassifierBasedTokenizer looks like this

class ClassifierBasedTokenizer implements TokenizerInterface
{
    public function __construct(Classifier $cls, Tokenizer $tok=null,$sep=' ') {
        ...
    }
    ...

We will see how the classifier based tokenizer works with a simple example that implements a naive rule based sentence tokenizer and then I will explain a bit the simple algorithm that this tokenizer uses.

ClassifierBasedTokenizer sends to the classifier a document of type WordDocument (you might want to check the documents documentation).

<?php
 
include ('vendor/autoload.php');
 
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
 
class EndOfSentence implements ClassifierInterface
{
    public function classify(array $classes, DocumentInterface $d) {
        list($token,$before,$after) = $d->getDocumentData();
 
        $dotcnt = count(explode('.',$token))-1;
        $lastdot = substr($token,-1)=='.';
 
        if (!$lastdot) // assume that all sentences end in full stops
            return 'O';
 
        if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
            return 'O';
 
        return 'EOW';
    }
}
 
$tok = new ClassifierBasedTokenizer(
    new EndOfSentence(),
    new WhitespaceTokenizer()
);
 
$text = "We are what we repeatedly do.
        Excellence, then, is not an act, but a habit.";
 
print_r($tok->tokenize($text));
 
// Array
// (
//    [0] => We are what we repeatedly do.
//    [1] => Excellence, then, is not an act, but a habit.
// )

ClassifierBasedTokenizer follows the following simple algorithm to combine tokens from another tokenizer into larger tokens.

Break the character sequence into a token sequence using another Tokenizer instance
Classify each token whether it is an EOW or an O. EOW stands for "End of word" and O stands for "Other".
Join all O tokens up to an EOW token using a given separator (any character sequence)

RegexTokenizer

This tokenizer tokenizes text based on a set of regexes. The regexes are passed as a constructor parameter and can have the following three forms.

A simple string
An array of a string and an integer
An array of two strings

Each case is handled differently by the tokenizer. In the first case the input will be split using preg_split and the provided string as pattern. In the second case the integer denotes the subpattern to keep and the first the pattern to be used with preg_match. Finally in the third case preg_replace is used and the matches are replaced with the second string.

The results of each regex are then passed to the next as a pipeline.

use NlpTools\Tokenizers\RegexTokenizer;
 
$s = "Please allow me to introduce myself 
        I'm a man of wealth and taste";
 
$rtok = new RegexTokenizer(array(
    array("/\s+/"," "),              // replace many spaces with a single space
    array("/'(m|ve|d|s)/", " '\$1"), // split I've, it's, we've, we'd, ...
    "/ /"                            // split on every space
));
 
print_r($rtok->tokenize($s));
// Array
// (
//     [0] => Please
//     [1] => allow
//     [2] => me
//     [3] => to
//     [4] => introduce
//     [5] => myself
//     [6] => I
//     [7] => 'm
//     [8] => a
//     [9] => man
//     [10] => of
//     [11] => wealth
//     [12] => and
//     [13] => taste
// )

Treebank tokenization

This is a very popular tokenization for the english language. We use the Penn treebank tokenization for which you can read more at upenn.edu.

« Getting started / Documents »

NlpTools

Natural language processing in php

Api Docs

Documentation

Navigation