Tagging parts of speech in Greek Jul 21st, 2013
In this final post I will present a usable pos tagger model and cli application with 93.2% accuracy in tagging parts of speech on the held out test set (improving noticeably on the results of the thesis I base my work on).
Features
- The word
- One, Two and Three letter suffixes
- The word without the above suffixes
- The previous word
- The next word
- If the word contains a number
- If the word is one letter
function ($class, $doc) { // the actual word in lower case if ($len>3) { // the word's suffixes } // the words without the suffixes if ($len>5) if ($len>4) if ($len>3) // the previous word // the next word $features[] = "$class ^ has_number"; $features[] = "$class ^ one_letter"; return $features; }
Code
The code (not the models) can be found at github. Most of the contents of the repository are for the console application. There are two files that might interest anyone who wants to extend the tagger, PosTagger.php and PosTrainingSet.php.
Models
I am publishing one model trained on both the test set and the training set, on a total of 31554 tokens. The complete model is 77MB serialized and contains more than half a million features. Although it contains all the information learned from the training it is impractical to use and with a little bit of pruning we can keep the same level of accuracy with one fifth the model size and startup time.
Thus, I will publish here 4 different files. 'model.bin' is the complete model. Each one of the others named 'model_thre_{num}.bin' with num being a variable number is simply produced from model.bin after removing all features with value less than or equal to {num}.
File | Size in bytes | Size in features | Accuracy in training set |
model.bin | 77M | ~ 500K | 99.974% |
model_thre_0.09.bin | 17M | ~ 132K | 99.974% |
model_thre_0.49.bin | 6.2M | ~ 52K | 99.974% |
model_thre_0.99.bin | 3.2M | ~ 31K | 99.822% |
-
Usage
You can dowload the code from github and run composer install or download a usable (and executable) phar archive.
git clone https://github.com/angeloskath/pos-tag.git cd pos-tag/ composer install
With code
// require("phar://pos-tag.phar/vendor/autoload.php"); require("vendor/autoload.php"); $tagger = new \PosTagger(); $tagger->loadModelFromFile("path/to/model.bin"); // $tok = new NlpTools\Tokenizers\WhitespaceTokenizer(); // $tokens = $tok->tokenize("Η καλή μας αγελάδα βόσκει κάτω στην λιακάδα"); // $tokens = $tok->tokenize("The quick brown fox jumped over the lazy dog"); $tags = $tagger->tag($tokens); // article adjective pronoun noun verb adverb article noun
With the console app
Download the console application and the models.
# download the console app wget http://php-nlp-tools.com/files/pos-tag/pos-tag.phar -O pos-tag chmod +x pos-tag # download the model wget http://php-nlp-tools.com/files/pos-tag/models/model_thre_0.49.bin -O model.bin
Now you can tag plain text in Greek. Examples:
# -m model.bin is not necessary because model.bin is the default ./pos-tag tag -m model.bin "Η καλή μας αγελάδα βόσκει κάτω στην λιακάδα" # Assuming greek_text is a file containing some text in greek. ./pos-tag tag <greek_text # You can see a pretty useful help message for every command ./pos-tag help tag ./pos-tag help features # and of course you will see a list of available commands if # you simply run the app ./pos-tag # now follows an example tagging the first greek text that # I found in one of the open tabs in my browser ./pos-tag tag -o "<w> <info><t></info><n>" "Το σώμα κειμένων του Ινστιτούτου Επεξεργασίας του Λόγου αναπτύχθηκε επί σειρά ετών και σήμερα περιλαμβάνει περισσότερες από 47.000.000 λέξεις" Το article σώμα noun κειμένων noun του article Ινστιτούτου noun Επεξεργασίας noun του article Λόγου noun αναπτύχθηκε verb επί preposition σειρά noun ετών noun και conjunction σήμερα adverb περιλαμβάνει verb περισσότερες adjective από preposition 47.000.000 numeral λέξεις noun