NlpTools API
Class

NlpTools\Tokenizers\ClassifierBasedTokenizer

class ClassifierBasedTokenizer implements TokenizerInterface

A tokenizer that uses a classifier (of any type) to determine if there is an "end of word" (EOW).

It takes as a parameter an initial
tokenizer and then determines if any two following tokens should in
fact be one token.

Those tokenizers could be nested to produce sentence tokenizers.

Example:

If we were for example to tokenize the following sentence
"Me and O'Brien, we 'll go!" and we used a simple space tokenizer we
would end up with this
["Me","and","O'Brien,","we","'ll","go!"]
if we used a space and punctuation tokenizer we 'd end up with
["Me","and","O","'","Brien",",","we","'","ll","go","!"]
but we want
["Me","and","O'Brien",",","we","'ll","go","!"]
so we should train a classifier to do the following

Token | Cls
------------
Me | EOW
and | EOW
O | O
' | O
Brien | EOW
, | EOW
we | EOW
' | O
ll | EOW
go | EOW
! | EOW

Constants

EOW

Methods

__construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')

array tokenize(string $str)

Break a character sequence to a token sequence

Details

at line 56
public __construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')

Parameters

ClassifierInterface $cls
TokenizerInterface $tok
$sep

at line 77
public array tokenize(string $str)

Break a character sequence to a token sequence

Parameters

string $str The character sequence to be broken in tokens

Return Value

array The token array