class ClassifierBasedTokenizer implements TokenizerInterface
A tokenizer that uses a classifier (of any type) to determine if there is an "end of word" (EOW).
It takes as a parameter an initial
tokenizer and then determines if any two following tokens should in
fact be one token.
Those tokenizers could be nested to produce sentence tokenizers.
Example:
If we were for example to tokenize the following sentence
"Me and O'Brien, we 'll go!" and we used a simple space tokenizer we
would end up with this
["Me","and","O'Brien,","we","'ll","go!"]
if we used a space and punctuation tokenizer we 'd end up with
["Me","and","O","'","Brien",",","we","'","ll","go","!"]
but we want
["Me","and","O'Brien",",","we","'ll","go","!"]
so we should train a classifier to do the following
Token | Cls
------------
Me | EOW
and | EOW
O | O
' | O
Brien | EOW
, | EOW
we | EOW
' | O
ll | EOW
go | EOW
! | EOW
Constants
EOW |
|
Methods
__construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ') | ||
array |
tokenize(string $str)
Break a character sequence to a token sequence |
Details
at line 56
public
__construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')
at line 77
public array
tokenize(string $str)
Break a character sequence to a token sequence