NlpTools

Natural language processing in php

Developing a Greek part of speech tagger Jun 29th, 2013

Within a recent attempt of mine to develop tools for natural language processing in Greek I decided to implement the state of the art in Greek POS tagging (the paper reporting the best accuracy that I could find) and offer the result in a way that it will be easy to use in one's projects as a subsystem.

The tagger is based on the work of Evangelia Koleli and her BSc thesis (the paper is in Greek) in 2011. She was kind enough to provide me with the dataset she used. Evangelia's work also provides an API for using the tagger in your own projects (in Java), I am aiming at even simpler integration (and especially web projects hence php).

In this series of posts I will be documenting the development of the tagger.

Problem statement

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context.

Wikipedia

In this project I will be using only the small set of parts of speech because I believe it is a good tradeoff between information and model simplicity. The small set of categories as defined by Evangelia Koleli is the following:

  • verb
  • noun
  • adjective
  • adverb
  • article
  • pronoun
  • numeral
  • preposition
  • particle
  • conjuction
  • punctuation
  • other

The approach

POS Tagging is a sequence tagging problem. The algorithms used aim at finding the most likely sequence of tags that could correspond to the observed sequence, the words. It is also, usually, dealt with on a per sentence basis.

For this tagger the approach is actually quite simpler and straightforward. We deal with the problem as a simple classification problem. We can use the complete word sequence for feature creation but none of the tag sequence (ex.: we cannot use the knowledge that the previous word is a noun but we can use the previous word).

Baseline

Before we start discussing the tagger implementation I would like to present a baseline parser. The baseline parser simply tags each word with the most common tag this word had in the training set. If it has never seen the word before then it tags it with the most common tag (usually noun). Actually, adding a set of targeted transformation rules to the above baseline classifier results in the simple and effective Brill Tagger.

The Baseline parser usually has an acuraccy of approximately 90% which is huge for a baseline, compared to other classification tasks, but it makes intuitive sense for this task since many words can only be tagged with one part of speech.

  1. class Baseline implements Classifier
  2. {
  3. protected $map;
  4. protected $classmap;
  5. public function __construct(TrainingSet $training) {
  6. $entry = array_fill_keys($training->getClassSet(), 0);
  7. $this->classmap = $entry;
  8. foreach ($training as $class=>$d) {
  9. list($w) = $d->getDocumentData();
  10. $w = mb_strtolower($w,"utf-8");
  11. if (!isset($this->map[$w]))
  12. $this->map[$w] = $entry;
  13. $this->map[$w][$class]++;
  14. $this->classmap[$class]++;
  15. }
  16. foreach ($this->map as &$occ) {
  17. arsort($occ);
  18. }
  19. arsort($this->classmap);
  20. }
  21. public function classify(array $classes, Document $d) {
  22. list($w) = $d->getDocumentData();
  23. $w = mb_strtolower($w,"utf-8");
  24. if (!isset($this->map[$w]))
  25. return key($this->classmap);
  26. return key($this->map[$w]);
  27. }
  28. }

This Baseline tagger learned from the training set has an accuracy of 82.62% on the test set. Yet this tagger is a very poor tagger and this can be easily illustrated if one sees the confusion matrix or simply the per tag precision and recall shown in the table below.

verb noun adjective adverb article pronoun numeral preposition particle conjuction punctuation other
Precision 100.0 60.65 91.03 96.21 94.80 92.46 96.95 98.11 98.92 99.78 100.0 100.0
Recall 49.47 98.90 41.90 85.23 98.55 73.50 72.94 99.18 98.92 98.09 98.93 26.82

From the table above one can see that more than half of the verbs have been tagged incorectly. By checking the confusion matrix we see that they have been mostly tagged as nouns (most misclassifications are) which also explains the really bad precision of the noun tag.

The above shows that accuracy or even precision, recall and F1 score are not good metrics of the quality of a POS tagger. Using the confusion matrix one can understand how badly a misclassification would stand out to a human evaluator. Classifying verbs as nouns definitely does not look good even with 90% accuracy.

In the next post we will be evaluating features using the much simpler Naive Bayes model.

Next »