NlpTools

Natural language processing in php

Tagging parts of speech in Greek Jul 21st, 2013

In this final post I will present a usable pos tagger model and cli application with 93.2% accuracy in tagging parts of speech on the held out test set (improving noticeably on the results of the thesis I base my work on).

Features

  1. The word
  2. One, Two and Three letter suffixes
  3. The word without the above suffixes
  4. The previous word
  5. The next word
  6. If the word contains a number
  7. If the word is one letter
  1. function ($class, $doc) {
  2. list($w, $prev, $next) = $doc->getDocumentData();
  3. $features = array();
  4. $len = mb_strlen($w, "utf-8");
  5. // the actual word in lower case
  6. $features[] = "$class ^ ".mb_strtolower($w,"utf-8");
  7. if ($len>3) {
  8. // the word's suffixes
  9. $features[]="$class ^ sub(-1)=".mb_strtolower(mb_substr($w,-1, 3, "utf-8"), "utf-8");
  10. $features[]="$class ^ sub(-2)=".mb_strtolower(mb_substr($w,-2, 3, "utf-8"), "utf-8");
  11. $features[]="$class ^ sub(-3)=".mb_strtolower(mb_substr($w,-3, 3, "utf-8"), "utf-8");
  12. }
  13. // the words without the suffixes
  14. if ($len>5)
  15. $features[] = "$class ^ pre(-3)=".mb_strtolower(mb_substr($w, 0, -3, "utf-8"), "utf-8");
  16. if ($len>4)
  17. $features[] = "$class ^ pre(-2)=".mb_strtolower(mb_substr($w, 0, -2, "utf-8"), "utf-8");
  18. if ($len>3)
  19. $features[] = "$class ^ pre(-1)=".mb_strtolower(mb_substr($w, 0, -1, "utf-8"), "utf-8");
  20. // the previous word
  21. if (isset($prev[0]))
  22. $features[] = "$class ^ ctx(-1)=".mb_strtolower($prev[0],"utf-8");
  23. // the next word
  24. if (isset($next[0]))
  25. $features[] = "$class ^ ctx(1)=".mb_strtolower($next[0],"utf-8");
  26. if (preg_match("/\d/u",$w))
  27. $features[] = "$class ^ has_number";
  28. if (mb_strlen($w,"utf-8")==1)
  29. $features[] = "$class ^ one_letter";
  30. return $features;
  31. }

Code

The code (not the models) can be found at github. Most of the contents of the repository are for the console application. There are two files that might interest anyone who wants to extend the tagger, PosTagger.php and PosTrainingSet.php.

Models

I am publishing one model trained on both the test set and the training set, on a total of 31554 tokens. The complete model is 77MB serialized and contains more than half a million features. Although it contains all the information learned from the training it is impractical to use and with a little bit of pruning we can keep the same level of accuracy with one fifth the model size and startup time.

Thus, I will publish here 4 different files. 'model.bin' is the complete model. Each one of the others named 'model_thre_{num}.bin' with num being a variable number is simply produced from model.bin after removing all features with value less than or equal to {num}.

File Size in bytes Size in features Accuracy in training set
model.bin 77M ~ 500K 99.974%
model_thre_0.09.bin 17M ~ 132K 99.974%
model_thre_0.49.bin 6.2M ~ 52K 99.974%
model_thre_0.99.bin 3.2M ~ 31K 99.822%

-

Usage

You can dowload the code from github and run composer install or download a usable (and executable) phar archive.

  1. git clone https://github.com/angeloskath/pos-tag.git
  2. cd pos-tag/
  3. composer install

With code

  1. // require("phar://pos-tag.phar/vendor/autoload.php");
  2. require("vendor/autoload.php");
  3. $tagger = new \PosTagger();
  4. $tagger->loadModelFromFile("path/to/model.bin");
  5. // $tok = new NlpTools\Tokenizers\WhitespaceTokenizer();
  6. // $tokens = $tok->tokenize("Η καλή μας αγελάδα βόσκει κάτω στην λιακάδα");
  7. // $tokens = $tok->tokenize("The quick brown fox jumped over the lazy dog");
  8. $tokens = array("Η","καλή","μας","αγελάδα","βόσκει","κάτω","στην","λιακάδα");
  9. $tags = $tagger->tag($tokens);
  10. echo implode(" ",$tags), PHP_EOL;
  11. // article adjective pronoun noun verb adverb article noun

With the console app

Download the console application and the models.

  1. # download the console app
  2. wget http://php-nlp-tools.com/files/pos-tag/pos-tag.phar -O pos-tag
  3. chmod +x pos-tag
  4. # download the model
  5. wget http://php-nlp-tools.com/files/pos-tag/models/model_thre_0.49.bin -O model.bin

Now you can tag plain text in Greek. Examples:

  1. # -m model.bin is not necessary because model.bin is the default
  2. ./pos-tag tag -m model.bin "Η καλή μας αγελάδα βόσκει κάτω στην λιακάδα"
  3. # Assuming greek_text is a file containing some text in greek.
  4. ./pos-tag tag <greek_text
  5. # You can see a pretty useful help message for every command
  6. ./pos-tag help tag
  7. ./pos-tag help features
  8. # and of course you will see a list of available commands if
  9. # you simply run the app
  10. ./pos-tag
  11. # now follows an example tagging the first greek text that
  12. # I found in one of the open tabs in my browser
  13. ./pos-tag tag -o "<w> <info><t></info><n>" "Το σώμα κειμένων του Ινστιτούτου Επεξεργασίας του Λόγου αναπτύχθηκε επί σειρά ετών και σήμερα περιλαμβάνει περισσότερες από 47.000.000 λέξεις"
  14. Το article
  15. σώμα noun
  16. κειμένων noun
  17. του article
  18. Ινστιτούτου noun
  19. Επεξεργασίας noun
  20. του article
  21. Λόγου noun
  22. αναπτύχθηκε verb
  23. επί preposition
  24. σειρά noun
  25. ετών noun
  26. και conjunction
  27. σήμερα adverb
  28. περιλαμβάνει verb
  29. περισσότερες adjective
  30. από preposition
  31. 47.000.000 numeral
  32. λέξεις noun
« Previous