NlpTools API


class ClassifierBasedTokenizer implements TokenizerInterface

A tokenizer that uses a classifier (of any type) to determine if there is an "end of word" (EOW).

It takes as a parameter an initial
tokenizer and then determines if any two following tokens should in
fact be one token.

Those tokenizers could be nested to produce sentence tokenizers.


If we were for example to tokenize the following sentence
"Me and O'Brien, we 'll go!" and we used a simple space tokenizer we
would end up with this
if we used a space and punctuation tokenizer we 'd end up with
but we want
so we should train a classifier to do the following

Token | Cls
Me | EOW
and | EOW
O | O
' | O
Brien | EOW
, | EOW
we | EOW
' | O
ll | EOW
go | EOW
! | EOW




__construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')

array tokenize(string $str)

Break a character sequence to a token sequence


at line 56
public __construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')


ClassifierInterface $cls
TokenizerInterface $tok

at line 77
public array tokenize(string $str)

Break a character sequence to a token sequence


string $str The character sequence to be broken in tokens

Return Value

array The token array