Transformations
In any machine learning task, including natural language processing, a big amount of preprocessing work is required as well as feature engineering.
NlpTools has Feature Factories for feature engineering and lately it has added a TransformationInterface for general preprocessing work.
interface TransformationInterface { /** * Return the value transformed. * @param mixed $value The value to transform * @return mixed */ public function transform($value); }
Again the interface is very generic that gives to the library a lot of flexibility. Below I will list several classes that implement the aforementioned interface and can be used with the documents for easy preprocessing.
Stemmers
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. Wikipedia
Stemmers can be used either via the TransformationInterface above or standalone via the stem function.
Available stemmers in NlpTools
use NlpTools\Stemmers\PorterStemmer; use NlpTools\Documents\TokensDocument; $s = "People are strange when you 're a stranger faces look ugly when you 're alone"; $stemmer = new PorterStemmer(); // standalone usage // usage via TransformationInterface $d->applyTransformation($stemmer);
Stop words
Words that are filtered out because they do not add any value to the meaning of the text.
use NlpTools\Utils\StopWords; use NlpTools\Documents\TokensDocument; $s = "People are strange when you 're a stranger faces look ugly when you 're alone"; "are", "you", "'re", "a" )); $d->applyTransformation($stop);
Normalizers
A Normalizer's purpose is to transform any word from any one of the possible writings to a single writing consistently. Most stemming algorithms expect normalized text anyway.
The most common normalization would be to transform the words to lower case. There are languages though that this is not enough since there may be other diacritics that need to be removed.
Available normalizers exist for the following languages
- English
- Greek
use NlpTools\Utils\Normalizers\Normalizer; use NlpTools\Documents\TokensDocument; $s = "People Are Strange WhEn you 're A Stranger faces look Ugly when you 're Alone"; $norm = Normalizer::factory("English"); // standalone usage // usage via TransformationInterface $d->applyTransformation($norm);
Classifier Based Transformation
This transformation classifies the input passed, and then applies a different set of transformations depending on the class of the input.
A simple use could be to create a multilingual transformation pipeline that applies different set of preprocessors (different stop words, stemmers, etc) depending on the language of the input document.
use NlpTools\Classifiers\ClassifierInterface; use NlpTools\Utils\ClassifierBasedTransformation; use NlpTools\Stemmers; use NlpTools\Utils\Normalizers\Normalizer; use NlpTools\Utils\StopWords; use NlpTools\Documents\TokensDocument; class LanguageDetector extends ClassifierInterface { ... } $lang_detector = new LanguageDetector(...); Normalizer::factory("Greek"), new StopWords( "\n", ) ), new Stemmers\GreekStemmer() ); Normalizer::factory("English"), new StopWords( "\n", ) ), new Stemmers\PorterStemmer() ); $transform = new ClassifierBasedTransformation($lang_detector); $transform->register("English", $english); $transform->register("Greek", $greek); $s = "This text contains both Ελληνικά and English"; $d->applyTransformation($transform);