Documents
Documents wrap the actual document that gets classified (whatever that may be). A document can be anything from a single word to a word with context or maybe a whole email with metadata about previous conversations.
The Document interface
interface DocumentInterface { /* * Return the data of what is being represented. If it were a word * we could return a word. If it were a blog post we could return * an array(Title,Body,array(Comments)). * * @return mixed */ public function getDocumentData(); }
Document is very generic and it can return anything as its data. It is, though, useful to design generic document classes that can be used by purpose-specific feature factories.
With the mindset expressed above, NlpTools contains three generic document types that could be used for many applications.
- TokensDocument
- WordDocument
- RawDocument
Documents also get to decide how to apply a transformation. For instance to apply a stemmer to a document one need not know how is the data structured in the document but can simply delegate the that to the document instance.
TokensDocument
TokensDocument is useful for feature factories and algorithms that use the "Bag of words" model for representing a large group of text. It wraps an array of tokens and simply returns this array as the data.
When applying a transformation it applies it to every single token removing the ones which are transformed to null.
To continue the tokenizers' example.
use \NlpTools\Tokenizers\WhitespaceTokenizer; use \NlpTools\Documents\TokensDocument; $s1 = "Please allow me to introduce myself I'm a man of wealth and taste"; $s2 = "Hello, I love you, won't you tell me your name Hello, I love you, let me jump in your game"; $tok = new WhitespaceTokenizer(); $stones = new TokensDocument($tok->tokenize($s1)); // $stones now represents the "Sympathy for the devil" song $doors = new TokensDocument($tok->tokenize($s2)); // $doors now represents the "Hello, I love you" song
WordDocument
WordDocument represents a single token (usually a word) surrounded by a token context of some length. It is useful for representing certain elements, like person names, places, etc, in the context of a larger piece of text.
Also as the TokensDocument this document applies a transformation to each token separately and removes the nulls.
To illustrate the difference from above.
use \NlpTools\Tokenizers\WhitespaceTokenizer; use \NlpTools\Documents\WordDocument; $s1 = "Please allow me to introduce myself I'm a man of wealth and taste"; $s2 = "Hello, I love you, won't you tell me your name Hello, I love you, let me jump in your game"; $tok = new WhitespaceTokenizer(); $stones = new WordDocument( $tok->tokenize($s1), 2, 4 ); // $stones now represents the word 'me' in the context of 4 // surrounding words in any direction $doors = new WordDocument( $tok->tokenize($s2), 7, 4 ); // $doors now also represents the word 'me' in the context of 4 // surrounding words in any direction
Because we have also captured the context of those words, above, we can distinguish them and produce different features.
RawDocument
RawDocument wraps a single php variable. The transformations are, as is expected, applied to the variable.
Training document and Training set
In the Documents namespace there exist two more classes created to make working with documents a lot easier.
TrainingDocument is a simple wrapper for any Document class that also contains the label (or class) information regarding that specific Document instance. Thus, TrainingDocument is useful for supervised learning where the model is trained on documents whose labels are known beforehand.
// $d is a Document instance $td = new TrainingDocument('CLASS 1',$d);
TrainingSet is a TrainingDocument container. Besides adding an easier way of instanciating TrainingDocuments it also provides the useful \Iterator, \ArrayAccess and \Countable interfaces for manipulating many TrainingDocuments and it allows for an easy way to apply a transformation to every single document.
$tset = new TrainingSet(); $tset->addDocument('CLASS 1',$a); foreach ($tset as $class=>$d) { ... }