NlpTools\Tokenizers\ClassifierBasedTokenizer

class ClassifierBasedTokenizer implements TokenizerInterface

A tokenizer that uses a classifier (of any type) to determine if there is an "end of word" (EOW).

It takes as a parameter an initial
tokenizer and then determines if any two following tokens should in
fact be one token.

Those tokenizers could be nested to produce sentence tokenizers.

Example:

If we were for example to tokenize the following sentence
"Me and O'Brien, we 'll go!" and we used a simple space tokenizer we
would end up with this
["Me","and","O'Brien,","we","'ll","go!"]
if we used a space and punctuation tokenizer we 'd end up with
["Me","and","O","'","Brien",",","we","'","ll","go","!"]
but we want
["Me","and","O'Brien",",","we","'ll","go","!"]
so we should train a classifier to do the following

Token | Cls
------------
Me | EOW
and | EOW
O | O
' | O
Brien | EOW
, | EOW
we | EOW
' | O
ll | EOW
go | EOW
! | EOW

Constants

EOW

Methods

	__construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')
array	tokenize(string $str) Break a character sequence to a token sequence

Details

at line 56
`public __construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')`

Parameters

ClassifierInterface	$cls
TokenizerInterface	$tok
	$sep

at line 77
`public array tokenize(string $str)`

Break a character sequence to a token sequence

Parameters

string

$str

The character sequence to be broken in tokens

Return Value

array

The token array

NlpTools\Tokenizers\ClassifierBasedTokenizer

Constants

Methods

Details

at line 56 public __construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')

Parameters

at line 77 public array tokenize(string $str)

Parameters

Return Value

at line 56
`public __construct(ClassifierInterface $cls, TokenizerInterface $tok = null, $sep = ' ')`

at line 77
`public array tokenize(string $str)`