NlpTools

Natural language processing in php

Feature Factories

In machine learning and pattern recognition, a feature is an individual measurable heuristic property of a phenomenon being observed. Wikipedia

In NlpTools features are an array of either values or values paired with a weight. Features can be thought of as a set of functions and their return values. The feature array is an array of the functions that fired. If the value of each function is always 0 or 1 then the feature array is a sparse matrix with only the function names of the functions that return 1.

For example a common set of features for NLP is the set of the words that are contained in the Document. Those features can be thought of as the functions "Is 'worda' found in the document?". Those features model the presence of the words in the Document.

Types of Feature Arrays

Both Maxent and Naive Bayes receive feature arrays that are the set of the names of the functions that returned 1. Both feature presence and feature frequency can be modeled using this type of feature array.

Naive Bayes is usually trained with feature frequency, which means that in the feature array one function can exist many times, while Maxent is trained with feature presence.

Feature Factory Interface

  1. interface FeatureFactoryInterface
  2. {
  3. /*
  4. * Return an array with unique strings that are the features that
  5. * "fire" for the specified Document $d and class $class
  6. *
  7. * name: getFeatureArray
  8. * @return array
  9. */
  10. public function getFeatureArray($class, Document $d);
  11. }

Callables as Features

FunctionFeatures is a class that receives a number of callbacks the returns of which after they have been merged is the feature array.

The class can be asked to model presence or frequency using the methods modelPresence and modelFrequency. If the class models presence then it returns an array of feature names that "fired" while if it models frequency it returns an array of key-value pairs where keys are the feature names and the values are the number of times a feature "fired".

The example that follows showcases the use of the class FunctionFeatures. See also Documents and Tokenizers.

  1. use NlpTools\FeatureFactories\FunctionFeatures;
  2. use NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use NlpTools\Documents\Document;
  4. use NlpTools\Documents\WordDocument;
  5. // Define your features
  6. $feats = new FunctionFeatures();
  7. $feats->add(function ($class,Document $d) {
  8. // this feature is the presence of the word
  9. return current($d->getDocumentData());
  10. });
  11. $feats->add(function ($class,Document $d) {
  12. // this feature is the function 'is the word capitalized?'
  13. $w = current($d->getDocumentData());
  14. if (ctype_upper($w[0]))
  15. return "isCapitalized";
  16. });
  17. // tokenize the data and create documents
  18. $text = "Please allow me to introduce myself
  19. I'm a man of wealth and taste";
  20. $tokenizer = new WhitespaceTokenizer();
  21. $tokens = $tokenizer->tokenize($text);
  22. $documents = array();
  23. foreach ($tokens as $index=>$token)
  24. {
  25. $documents[$index] = new WordDocument($tokens,$index,5);
  26. }
  27. // print the features that fired for each document given the class '0'
  28. echo implode(
  29. PHP_EOL,
  30. function ($d) use($feats) {
  31. return '['.implode(
  32. ',',
  33. $feats->getFeatureArray('0',$d)
  34. ).']';
  35. },
  36. $documents
  37. )
  38. );
  39. // print the features with their frequencies
  40. $feats->modelFrequency();
  41. $feats->getFeatureArray('0', $d)
  42. );

Data as Features

This simple feature factory returns all the document data as features. It is very useful for a quick and dirty Naive Bayes for example.

This feature factory could easily be implemented (although it would model presence by default while it now models frequency) with the following.

  1. use NlpTools\FeatureFactories\FunctionFeatures;
  2. use NlpTools\FeatureFactories\TokensDocument;
  3. $feats = new FunctionFeatures();
  4. $feats->add(function ($class,TokensDocument $d) {
  5. return $d->getDocumentData();
  6. });

What is the class argument

In case you did not check, the feature factory, except the Document, also takes a class parameter. This is to create features that only fire for certain classes (commonly used in Maxent).

An easy way to create features that fire only for a given class is to prepend the class name to the feature name.

  1. use NlpTools\FeatureFactories\FunctionFeatures;
  2. use NlpTools\FeatureFactories\TokensDocument;
  3. $feats = new FunctionFeatures();
  4. $feats->add(function ($class,TokensDocument $d) {
  5. // prepend the class name in each
  6. return array_map(
  7. function ($token) use($class) {
  8. return $class.$token;
  9. }.
  10. $d->getDocumentData()
  11. );
  12. });

« Documents / Classifiers »