NlpTools

Natural language processing in php

Transformations

In any machine learning task, including natural language processing, a big amount of preprocessing work is required as well as feature engineering.

NlpTools has Feature Factories for feature engineering and lately it has added a TransformationInterface for general preprocessing work.

  1. interface TransformationInterface
  2. {
  3. /**
  4. * Return the value transformed.
  5. * @param mixed $value The value to transform
  6. * @return mixed
  7. */
  8. public function transform($value);
  9. }

Again the interface is very generic that gives to the library a lot of flexibility. Below I will list several classes that implement the aforementioned interface and can be used with the documents for easy preprocessing.

Stemmers

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. Wikipedia

Stemmers can be used either via the TransformationInterface above or standalone via the stem function.

Available stemmers in NlpTools

  1. use NlpTools\Stemmers\PorterStemmer;
  2. use NlpTools\Documents\TokensDocument;
  3. $s = "People are strange when you 're a stranger
  4. faces look ugly when you 're alone";
  5. $stemmer = new PorterStemmer();
  6. // standalone usage
  7. print_r($stemmer->stemAll(explode(" ", $s)));
  8. // usage via TransformationInterface
  9. $d = new TokensDocument(explode(" ", $s));
  10. $d->applyTransformation($stemmer);
  11. print_r($d->getDocumentData());

Stop words

Words that are filtered out because they do not add any value to the meaning of the text.

  1. use NlpTools\Utils\StopWords;
  2. use NlpTools\Documents\TokensDocument;
  3. $s = "People are strange when you 're a stranger
  4. faces look ugly when you 're alone";
  5. $stop = new StopWords(array(
  6. "are",
  7. "you",
  8. "'re",
  9. "a"
  10. ));
  11. $d = new TokensDocument(explode(" ", $s));
  12. $d->applyTransformation($stop);
  13. print_r($d->getDocumentData());

Normalizers

A Normalizer's purpose is to transform any word from any one of the possible writings to a single writing consistently. Most stemming algorithms expect normalized text anyway.

The most common normalization would be to transform the words to lower case. There are languages though that this is not enough since there may be other diacritics that need to be removed.

Available normalizers exist for the following languages

  • English
  • Greek
  1. use NlpTools\Utils\Normalizers\Normalizer;
  2. use NlpTools\Documents\TokensDocument;
  3. $s = "People Are Strange WhEn you 're A Stranger
  4. faces look Ugly when you 're Alone";
  5. $norm = Normalizer::factory("English");
  6. // standalone usage
  7. print_r($norm->normalizeAll(explode(" ", $s)));
  8. // usage via TransformationInterface
  9. $d = new TokensDocument(explode(" ", $s));
  10. $d->applyTransformation($norm);
  11. print_r($d->getDocumentData());

Classifier Based Transformation

This transformation classifies the input passed, and then applies a different set of transformations depending on the class of the input.

A simple use could be to create a multilingual transformation pipeline that applies different set of preprocessors (different stop words, stemmers, etc) depending on the language of the input document.

  1. use NlpTools\Classifiers\ClassifierInterface;
  2. use NlpTools\Utils\ClassifierBasedTransformation;
  3. use NlpTools\Stemmers;
  4. use NlpTools\Utils\Normalizers\Normalizer;
  5. use NlpTools\Utils\StopWords;
  6. use NlpTools\Documents\TokensDocument;
  7. class LanguageDetector extends ClassifierInterface
  8. {
  9. ...
  10. }
  11. $lang_detector = new LanguageDetector(...);
  12. $greek = array(
  13. Normalizer::factory("Greek"),
  14. new StopWords(
  15. "\n",
  16. file_get_contents("greek_stop_words")
  17. )
  18. ),
  19. new Stemmers\GreekStemmer()
  20. );
  21. $english = array(
  22. Normalizer::factory("English"),
  23. new StopWords(
  24. "\n",
  25. file_get_contents("english_stop_words")
  26. )
  27. ),
  28. new Stemmers\PorterStemmer()
  29. );
  30. $transform = new ClassifierBasedTransformation($lang_detector);
  31. $transform->register("English", $english);
  32. $transform->register("Greek", $greek);
  33. $s = "This text contains both Ελληνικά and English";
  34. $d = new TokensDocument(explode(" ", $s));
  35. $d->applyTransformation($transform);
  36. print_r($d->getDocumentData());

« Clustering