Natural language processing in php

Source classifier in PHP Nov 13th, 2013

Project: Programming language detection

Recently I had the need to identify the programming language of several snippets of code without relying on the file extension. I originally thought that it would be trivial to find code that does exactly that but as it turns out I have only found one implementation in Ruby. Thus, I decided to write my own source classifier in php.

Finding a dataset

The hardest part was to find and download a dataset for use with the classifier. I decided to download sources from Google code jam 2013 solutions. This decision was made for the following two simple reasons, firstly I would use the model to classify similar types of source code (small solutions to programming problems) and secondly it provides easy access to many different programming languages.

The quality of the produced classifier heavily depends on the dataset so one could improve the performance of the provided model by training on a different dataset.

Classification model and Features

Since it seemed to work well for Mr Chris Lowis I decided to use the same simple Naive Bayes model. The sources are read as is, they are not normalized. I use the WhitespaceAndPunctuationTokenizer to split the sources into tokens.

The feature factory is a simple frequency feature factory that cuts off the maximum frequency at 4 occurences.

  1. class CodeFeatures implements FeatureFactoryInterface
  2. {
  3. public function getFeatureArray($class, DocumentInterface $doc)
  4. {
  5. $tokens = $doc->getDocumentData();
  6. $tokens = array_count_values($tokens);
  7. foreach ($tokens as $tok=>&$v) {
  8. $v = min($v, 4);
  9. }
  10. return $tokens;
  11. }
  12. }

Using the LanguageDetector

I have wrapped the MultinomialNBClassifier in a class to ease its use, its training and its serialization.

  1. $detector = LanguageDetector::loadFromFile("/path/to/model");
  2. $lang = $detector->classify(<<<CODE
  3. #include <stdio.h>
  4. int main() {
  5. printf("Hello world!");
  6. }
  7. CODE
  8. );
  9. echo $lang, PHP_EOL; // C

In the github repository there is already a pretrained model that classifies among the following languages (the popular ones according to Google code jam):

  • C
  • C#
  • C++
  • Clojure
  • Go
  • Haskell
  • Java
  • Javascript
  • Pascal
  • Perl
  • PHP
  • Python
  • Ruby
  • Scala
  • Visual Basic

Train on your own files

In the repo there is also a bin directory that provides a lang-detect script. It is a simple script that allows training of new models on new datasets, evaluation of models and using a model to classify a file as a source code.


  1. # retrain the provided model
  2. bin/lang-detect train "data/train"
  3. # evaluate the trained model
  4. bin/lang-detect evaluate "data/test" # should print 0.98
  5. # classify some code
  6. bin/lang-detect classify "some/path/code.cpp"

The structure of the directories for training and evaluating should be one subdirectory per class and each subdirectory should contain one file per document (one source file). You can see an example of the above structure in the data/train and data/test directories.

Future work

Since it is a much less interesting and unimportant problem than the ones I usually like to battle with in my free time I have left a lot of concepts unexplored and went for the fastest half good solution (although it turned out better than half good).

One should be able to get much better results by changing both the feature factory and the tokenizer.


The tokenizer should probably be able to understand different types of strings and ignore them or mark them as "string". Maybe the same should apply to comments as well. Operators, parentheses and brackets should be parsed as separate tokens but not every single punctuation character. Some could be grouped like -> or =>.

Feature factory

As far as the feature factory is concerned, document frequency dictionaries could be used to emphasize to the keywords per class and differentiate them from the identifiers. Words at the beginning and ending of the document should be weighted differently as they are intuitively more important in differentiating the programming languages (imports, opening/closing tags, etc). Finally statistics about the identifiers could be collected, for instance whether camel casing or underscores are preferred.

Dbscan text clustering Sep 3rd, 2013

In this article I will be introducing Density Based Spatial Clustering of Applications with Noise, known as DBSCAN. DBSCAN is the latest addition to the Clustering namespace of php (it is still under development and not merged into master).

DBSCAN is a density based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes


Dbscan Implementation

The current implementation in NlpTools takes as a parameter an object that implements SpatialIndexInterface.

  1. // comments stripped
  2. interface SpatialIndexInterface
  3. {
  4. public function setDistanceMetric(Distance $d);
  5. public function index(array &$docs);
  6. public function add($doc);
  7. public function regionQuery($doc, $e);
  8. public function kNearestNeighbors($doc, $k);
  9. }

This is used to abstract away the logic of searching a set of points for neighbors so that this process can be optimized. For instance we could implement k-d trees or ball trees for faster neighbor searching. The current implementation uses naive linear search and has complexity O(n2).


  1. use NlpTools\Clustering\Dbscan;
  2. use NlpTools\Similarity\Euclidean;
  3. use NlpTools\Documents\TrainingSet;
  4. use NlpTools\FeatureFactories\DataAsFeatures;
  5. $tset = new TrainingSet();
  6. // load the training set with documents
  7. // create a large outer circle of points
  8. for ($i=0.0;$i<2*M_PI;$i+=M_PI/20) {
  9. $tset->addDocument(
  10. 'A',
  11. EuclideanPoint::getRandomPointAround(
  12. 150+70*cos($i),
  13. 100+70*sin($i),
  14. 8
  15. )
  16. );
  17. $tset->addDocument(
  18. 'A',
  19. EuclideanPoint::getRandomPointAround(
  20. 150+70*cos($i),
  21. 100+70*sin($i),
  22. 8
  23. )
  24. );
  25. }
  26. // create an inner smaller circle of points
  27. for ($i=0.0;$i<2*M_PI;$i+=M_PI/10) {
  28. $tset->addDocument(
  29. 'B',
  30. EuclideanPoint::getRandomPointAround(
  31. 150+20*cos($i),
  32. 100+20*sin($i),
  33. 8
  34. )
  35. );
  36. $tset->addDocument(
  37. 'B',
  38. EuclideanPoint::getRandomPointAround(
  39. 150+20*cos($i),
  40. 100+20*sin($i),
  41. 8
  42. )
  43. );
  44. }
  45. // we are stating that we consider a region to be dense
  46. // if it has at least 4 points in a circle of radius 20
  47. $clust = new Dbscan(
  48. 4, // the minimum number of points needed
  49. 20, // the radius in which we are looking for points
  50. new Euclidean() // we will use the Euclidean distance
  51. );
  52. // do the actual clustering
  53. list($clusters,$noise) = $clust->cluster($tset, new DataAsFeatures());

The above code creates a very simple dataset with two distinct clusters in the two-dimensional euclidean space. This simple dataset is often used as an example to illustrate the differences between DBSCAN and K-Means.

K-Means DBSCAN  

The second example that follows illustrates the property of DBSCAN to identify noise points and determine the number of clusters in the datatset. The dark blue points are noise. The points to be clustered are uniformly random.

Tagging parts of speech in Greek Jul 21st, 2013

Project: Greek POS Tagger

In this final post I will present a usable pos tagger model and cli application with 93.2% accuracy in tagging parts of speech on the held out test set (improving noticeably on the results of the thesis I base my work on).


  1. The word
  2. One, Two and Three letter suffixes
  3. The word without the above suffixes
  4. The previous word
  5. The next word
  6. If the word contains a number
  7. If the word is one letter
  1. function ($class, $doc) {
  2. list($w, $prev, $next) = $doc->getDocumentData();
  3. $features = array();
  4. $len = mb_strlen($w, "utf-8");
  5. // the actual word in lower case
  6. $features[] = "$class ^ ".mb_strtolower($w,"utf-8");
  7. if ($len>3) {
  8. // the word's suffixes
  9. $features[]="$class ^ sub(-1)=".mb_strtolower(mb_substr($w,-1, 3, "utf-8"), "utf-8");
  10. $features[]="$class ^ sub(-2)=".mb_strtolower(mb_substr($w,-2, 3, "utf-8"), "utf-8");
  11. $features[]="$class ^ sub(-3)=".mb_strtolower(mb_substr($w,-3, 3, "utf-8"), "utf-8");
  12. }
  13. // the words without the suffixes
  14. if ($len>5)
  15. $features[] = "$class ^ pre(-3)=".mb_strtolower(mb_substr($w, 0, -3, "utf-8"), "utf-8");
  16. if ($len>4)
  17. $features[] = "$class ^ pre(-2)=".mb_strtolower(mb_substr($w, 0, -2, "utf-8"), "utf-8");
  18. if ($len>3)
  19. $features[] = "$class ^ pre(-1)=".mb_strtolower(mb_substr($w, 0, -1, "utf-8"), "utf-8");
  20. // the previous word
  21. if (isset($prev[0]))
  22. $features[] = "$class ^ ctx(-1)=".mb_strtolower($prev[0],"utf-8");
  23. // the next word
  24. if (isset($next[0]))
  25. $features[] = "$class ^ ctx(1)=".mb_strtolower($next[0],"utf-8");
  26. if (preg_match("/\d/u",$w))
  27. $features[] = "$class ^ has_number";
  28. if (mb_strlen($w,"utf-8")==1)
  29. $features[] = "$class ^ one_letter";
  30. return $features;
  31. }


The code (not the models) can be found at github. Most of the contents of the repository are for the console application. There are two files that might interest anyone who wants to extend the tagger, PosTagger.php and PosTrainingSet.php.


I am publishing one model trained on both the test set and the training set, on a total of 31554 tokens. The complete model is 77MB serialized and contains more than half a million features. Although it contains all the information learned from the training it is impractical to use and with a little bit of pruning we can keep the same level of accuracy with one fifth the model size and startup time.

Thus, I will publish here 4 different files. 'model.bin' is the complete model. Each one of the others named 'model_thre_{num}.bin' with num being a variable number is simply produced from model.bin after removing all features with value less than or equal to {num}.

File Size in bytes Size in features Accuracy in training set
model.bin 77M ~ 500K 99.974%
model_thre_0.09.bin 17M ~ 132K 99.974%
model_thre_0.49.bin 6.2M ~ 52K 99.974%
model_thre_0.99.bin 3.2M ~ 31K 99.822%



You can dowload the code from github and run composer install or download a usable (and executable) phar archive.

  1. git clone
  2. cd pos-tag/
  3. composer install

With code

  1. // require("phar://pos-tag.phar/vendor/autoload.php");
  2. require("vendor/autoload.php");
  3. $tagger = new \PosTagger();
  4. $tagger->loadModelFromFile("path/to/model.bin");
  5. // $tok = new NlpTools\Tokenizers\WhitespaceTokenizer();
  6. // $tokens = $tok->tokenize("Η καλή μας αγελάδα βόσκει κάτω στην λιακάδα");
  7. // $tokens = $tok->tokenize("The quick brown fox jumped over the lazy dog");
  8. $tokens = array("Η","καλή","μας","αγελάδα","βόσκει","κάτω","στην","λιακάδα");
  9. $tags = $tagger->tag($tokens);
  10. echo implode(" ",$tags), PHP_EOL;
  11. // article adjective pronoun noun verb adverb article noun

With the console app

Download the console application and the models.

  1. # download the console app
  2. wget -O pos-tag
  3. chmod +x pos-tag
  4. # download the model
  5. wget -O model.bin

Now you can tag plain text in Greek. Examples:

  1. # -m model.bin is not necessary because model.bin is the default
  2. ./pos-tag tag -m model.bin "Η καλή μας αγελάδα βόσκει κάτω στην λιακάδα"
  3. # Assuming greek_text is a file containing some text in greek.
  4. ./pos-tag tag <greek_text
  5. # You can see a pretty useful help message for every command
  6. ./pos-tag help tag
  7. ./pos-tag help features
  8. # and of course you will see a list of available commands if
  9. # you simply run the app
  10. ./pos-tag
  11. # now follows an example tagging the first greek text that
  12. # I found in one of the open tabs in my browser
  13. ./pos-tag tag -o "<w> <info><t></info><n>" "Το σώμα κειμένων του Ινστιτούτου Επεξεργασίας του Λόγου αναπτύχθηκε επί σειρά ετών και σήμερα περιλαμβάνει περισσότερες από 47.000.000 λέξεις"
  14. Το article
  15. σώμα noun
  16. κειμένων noun
  17. του article
  18. Ινστιτούτου noun
  19. Επεξεργασίας noun
  20. του article
  21. Λόγου noun
  22. αναπτύχθηκε verb
  23. επί preposition
  24. σειρά noun
  25. ετών noun
  26. και conjunction
  27. σήμερα adverb
  28. περιλαμβάνει verb
  29. περισσότερες adjective
  30. από preposition
  31. 47.000.000 numeral
  32. λέξεις noun

Evaluating features with Naive Bayes Jul 21st, 2013

Project: Greek POS Tagger

Following Evangelia's paper we will be using a Maxent model for our final prediction model. But training Maxent models is an order of magnitude slower than training a Naive Bayes model. We could simply implement the features defined in the paper and train one final model, but because I want to gain a better understanding of the POS tagger I will be implementing them one at a time and evaluating each time the improvement. To avoid waiting, I will be using a simple Naive Bayes model.

Only the data as features

If we simply train a Naive Bayes model with the only features the presence of the words themselves then we have a model that does not differ much from our baseline. Only one feature fires for each word and for the unknown words the prior gets to decide the class (actually the results would be identical if it were not for smoothing the model).

Consequently the results are a model with a bit worse accuracy than the baseline but with much the same behaviour (it still mistakes half the verbs for nouns).

Options at hand

Up until now the features do not rely at all on our knowledge as a linguist or even on our knowledge of the greek language. As a matter of fact we could create such a baseline parser in any language just by feeding it a big pre-annotated corpus.

To improve further we will have to add features that more or less depend on our knowledge of the task of recognizing parts of speech. For instance the knowldge that the ending of verbs is an important characteristic of the greek language or that if the previous word is "την" (an article) it is very unlikely that an article will follow.

We can separate the features in two categories.

  • Contextual
  • Language based

We will of course add features of both types in the end, but I would like to discuss the differences of those two categories. I believe they deserve to be in different categories because they rely on different knowledge of the language.

The contextual features are features that try to exploit the fact that combinations of words have meaning e.g. "from the house" and "to the house" have exactly the opposite meaning. Contextual features can be generic, they will always carry information about the current word, regardless the language.

Language based methods rely on our knowledge of the specific language (or family of languages) that we are targeting. e.g. The fact that if the suffix of the word is 'ω' the word is probably a verb.

Contemplating on the Brill Tagger, what I have been describing above are rules that could be used in such a tagger. For this machine learning based tagger we will describe (in code) the way of coming up with rules and then the model, through the training process, will evaluate which rules are successful and should be relied upon and which are not.

For instance we will not be using as a feature the following function 'ends_in_omega' but the function 'last_char'.

  1. // not a good feature that only checks one possible
  2. // suffix and relies on our expert knowledge that this
  3. // suffix is important
  4. function ends_in_omega($w) {
  5. return mb_substr($w,-1,1,'utf-8')=='ω';
  6. }
  7. // a good feature. It does not rely at all on our
  8. // knowledge of a specific suffix. It simply asks for
  9. // the model to evaluate if a suffix is important for
  10. // a specific part of speech
  11. function last_char($w) {
  12. return mb_substr($w,-1,1,'utf-8');
  13. }

Actually trying out features


I have already mentioned that having only the data as features results in a slightly worse result than our baseline due to model smoothing.

  1. function ($class, $doc) {
  2. list($w) = $doc->getDocumentData();
  3. return mb_strtolower($w,"utf-8");
  4. }


Using the last three letters of the word should drastically improve the performance of our tagger because it shall now recognize the common suffixes produced by the greek grammatical conjugation, thus it shall improve upon the most important flaw of our tagger, the huge misclassification of verbs.

  1. function ($class, $doc) {
  2. list($w) = $doc->getDocumentData();
  3. if (mb_strlen($w,"utf-8")<=3)
  4. return;
  5. $suffixes = array();
  6. $suffixes[] = 'substr(-1) = '.mb_strtolower(mb_substr($w,-1, 3, "utf-8"), "utf-8");
  7. $suffixes[] = 'substr(-2) = '.mb_strtolower(mb_substr($w,-2, 3, "utf-8"), "utf-8");
  8. $suffixes[] = 'substr(-3) = '.mb_strtolower(mb_substr($w,-3, 3, "utf-8"), "utf-8");
  9. return $suffixes;
  10. }

Indeed those features improve the accuracy from the baseline (Data only) 80.6% to 87% and most importantly they improve verb's recall from 49.47% to 95.59% .


Using the context of the two surrounding words we expect a general improvement on the quality of the tagging not specific to any part of speech.

  1. function ($class, $doc) {
  2. list($w,$prev,$next) = $doc->getDocumentData();
  3. $f = array();
  4. for ($i=0;$i<count($prev);$i++) {
  5. $f[]= "prev($i) = {$prev[$i]}";
  6. }
  7. for ($i=0;$i<count($next);$i++) {
  8. $f[] = "next($i) = {$next[$i]}";
  9. }
  10. return $f;
  11. },

What we are actually getting though is more complex than that. If applying this feature on the baseline we will see a decrease in accuracy from 80.6% to 78.7%. Yet, with a second look we can see that the quality of the tagging has increased. The mistakes are spread out more (instead of being clustered on the nouns (not necessarily a good thing)) and we can now identify approximately 80% of the verbs instead of 50%.

In addition, if combined with the suffixes feature the overall accuracy increases to 88.3% from 87%.

In the next post

In the next and last post of this series I will present all the feature functions, package the tagger code in a github repository and package the trained model for ease of use (maybe even a cli application).

Developing a Greek part of speech tagger Jun 29th, 2013

Project: Greek POS Tagger

Within a recent attempt of mine to develop tools for natural language processing in Greek I decided to implement the state of the art in Greek POS tagging (the paper reporting the best accuracy that I could find) and offer the result in a way that it will be easy to use in one's projects as a subsystem.

The tagger is based on the work of Evangelia Koleli and her BSc thesis (the paper is in Greek) in 2011. She was kind enough to provide me with the dataset she used. Evangelia's work also provides an API for using the tagger in your own projects (in Java), I am aiming at even simpler integration (and especially web projects hence php).

In this series of posts I will be documenting the development of the tagger.

Problem statement

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context.


In this project I will be using only the small set of parts of speech because I believe it is a good tradeoff between information and model simplicity. The small set of categories as defined by Evangelia Koleli is the following:

  • verb
  • noun
  • adjective
  • adverb
  • article
  • pronoun
  • numeral
  • preposition
  • particle
  • conjuction
  • punctuation
  • other

The approach

POS Tagging is a sequence tagging problem. The algorithms used aim at finding the most likely sequence of tags that could correspond to the observed sequence, the words. It is also, usually, dealt with on a per sentence basis.

For this tagger the approach is actually quite simpler and straightforward. We deal with the problem as a simple classification problem. We can use the complete word sequence for feature creation but none of the tag sequence (ex.: we cannot use the knowledge that the previous word is a noun but we can use the previous word).


Before we start discussing the tagger implementation I would like to present a baseline parser. The baseline parser simply tags each word with the most common tag this word had in the training set. If it has never seen the word before then it tags it with the most common tag (usually noun). Actually, adding a set of targeted transformation rules to the above baseline classifier results in the simple and effective Brill Tagger.

The Baseline parser usually has an acuraccy of approximately 90% which is huge for a baseline, compared to other classification tasks, but it makes intuitive sense for this task since many words can only be tagged with one part of speech.

  1. class Baseline implements Classifier
  2. {
  3. protected $map;
  4. protected $classmap;
  5. public function __construct(TrainingSet $training) {
  6. $entry = array_fill_keys($training->getClassSet(), 0);
  7. $this->classmap = $entry;
  8. foreach ($training as $class=>$d) {
  9. list($w) = $d->getDocumentData();
  10. $w = mb_strtolower($w,"utf-8");
  11. if (!isset($this->map[$w]))
  12. $this->map[$w] = $entry;
  13. $this->map[$w][$class]++;
  14. $this->classmap[$class]++;
  15. }
  16. foreach ($this->map as &$occ) {
  17. arsort($occ);
  18. }
  19. arsort($this->classmap);
  20. }
  21. public function classify(array $classes, Document $d) {
  22. list($w) = $d->getDocumentData();
  23. $w = mb_strtolower($w,"utf-8");
  24. if (!isset($this->map[$w]))
  25. return key($this->classmap);
  26. return key($this->map[$w]);
  27. }
  28. }

This Baseline tagger learned from the training set has an accuracy of 82.62% on the test set. Yet this tagger is a very poor tagger and this can be easily illustrated if one sees the confusion matrix or simply the per tag precision and recall shown in the table below.

verb noun adjective adverb article pronoun numeral preposition particle conjuction punctuation other
Precision 100.0 60.65 91.03 96.21 94.80 92.46 96.95 98.11 98.92 99.78 100.0 100.0
Recall 49.47 98.90 41.90 85.23 98.55 73.50 72.94 99.18 98.92 98.09 98.93 26.82

From the table above one can see that more than half of the verbs have been tagged incorectly. By checking the confusion matrix we see that they have been mostly tagged as nouns (most misclassifications are) which also explains the really bad precision of the noun tag.

The above shows that accuracy or even precision, recall and F1 score are not good metrics of the quality of a POS tagger. Using the confusion matrix one can understand how badly a misclassification would stand out to a human evaluator. Classifying verbs as nouns definitely does not look good even with 90% accuracy.

In the next post we will be evaluating features using the much simpler Naive Bayes model.

Maximum entropy model for sentiment detection Jun 20th, 2013

Project: Sentiment detection

In the second post in the series of sentiment detection we will train a maximum entropy model to perform the exact same classification on the exact same data and compare the procedure and the results.

Getting the optimizer

Later on NlpTools will support more external optimizers like Megam or Mallet but for now the only optimizer supported is the one developed specifically for NlpTools in Go that performs parallel batch gradient descent.

You can download binaries for you architecture or build from source.

Training a Maximum Entropy model

There are only three changes required (to our previous file) to train and evaluate a Maxent model instead of a Naive Bayes model.

Feature Factory

You need to create a different type of feature factory. In Maxent features should target specific classes so in each feature the class name should be prepended. In addition, we should model presence and not frequency of features. To achieve the above we will use the FeatureFactory FunctionFeatures.

  1. $ff = new FunctionFeatures(
  2. function ($c, $d) {
  3. return array_map(
  4. function ($t) use($c) {
  5. return "$c ^ $t"; // target the feature to a specific class
  6. },
  7. $d->getDocumentData()
  8. );
  9. }
  10. )
  11. );

Model instantiation

We should change the model instatiation code to now create and train a Maxent model.

  1. // create empty Maxent
  2. $model = new Maxent(array());
  3. $model->train(
  4. $ff, // the feature factory
  5. $tset, // the documents
  6. new ExternalMaxentOptimizer(
  7. 'gradient-descent' // the path to the external optimizer
  8. )
  9. );


Finally, we need to change the classifier from MultinomialNaiveBayes to FeatureBasedLinearClassifier.

  1. $cls = new FeatureBasedLinearClassifier($ff, $model);

We then run the php script with the exact same parameters as in the previous post. Training time will take longer this time.


Maximum entropy models usually perform a bit better than Naive Bayes (on par with SVM). They are, though, much harder to train. Our model confirms the rule by achieving 86% accuracy on Pang's and Lee's dataset (No features other than the data).

Hierarchical Clustering - NlpTools vs NLTK Jun 15th, 2013

First of all, I want to clarify that this post is not about bashing NLTK or even implying that NlpTools can be a replacement for the great tool NLTK is. It is simply a comment on a bad implementation of an algorithm that is probably not considered important by the NLTK developers.

In the rest of the post I will assume that you have read the previous post regarding hierarchical clustering algorithms, their asymptotic time complexities and what I have chosen to implement so far in NlpTools.


I wanted to test the output of the hierarchical clustering implementation in NlpTools so I decided to run a comparison of the resulting dendrograms with an already established and tested library. I browsed NLTK and read through the clustering package's documentation page. Based on the information there I went on to look at the implementation at github.

Naive implementation

NLTK's implementation is the one described in the previous post as naive. It has time complexity O(n3). According to the docs, they chose to support only the cosine similarity distance metric and the group average linkage method because thus they could do some optimizations and gain a speed up, so there must be something I am missing. In addition, NLTK uses numpy for efficient computation of matrix operations.

Considering the above, I still thought php would be slower since I did not even use the native heap implementation (SplHeap) for the linkage method. I created a dataset of two-dimensional random points and decided to run and compare the times. The code that I used to run the clustering and get the times can be found in this gist (I wrote the gist before I decided to write this post).

How bad the implementation actually is

As I have mentioned already the implementation is asymptotically worse than the one in NlpTools. But does this really account for the huge difference shown on the above graph. Actually it doesn't. NLTK does not use a dissimilarity matrix at all in this implementation. It recomputes the pairwise cluster similarity again for each merge. That means, besides the unnecessary function call, that the asymptotic complexity of the algorithm is even worse (something not shown by this sample dataset). The complexity of the algorithm is actually O(n3•M) where M is the number of dimensions that our points have.

Keep in mind that this dataset's points only have two dimensions which means that the difference we see here is mostly due to the function call overhead and not due to the data dimensionality problem (which can be huge especially when dealing with text data).

Assumptions about why

Saving memory

Using a matrix of pairwise similarities or dissimilarities requires O(n2) memory space while computing the similarities each time only requires O(n) for the clusters. Thus NlpTools would exhaust a computers memory before NLTK ever reached to 50%. On the other hand, if it ever reached 50% there is no way it would finish computing the clusters in our life time, so there is not much point in saving memory (except if the computer had really small memory, about 100MB).

Relying too much on numpy

Numpy is a great library for fast matrix operations in python, that doesn't mean that numpy operations are O(1) constant time. Accessing lists and dicts in python, on the other hand, is. One thing one can assume is that NLTK developers got a bit carried away by the greatness of numpy and thought it to be the tool for every job. The "efficient speed up" is actually what makes this implementation that slow.

10+ times speed up with 46 lines of code

The following changes are a quick and dirty addition of a similarity matrix to the algorithm. Asymptotic time complexity still remains O(n3) and memory usage is now O(n2). The code is a drop in replacement for the lines 46 to 73 of nltk/cluster/ that doesn't require changes in any other part of the nltk cluster module. Moreover, by making a small change in the 11th line below this implementation could accept any function as a distance metric.

  1. # variables describing the initial situation
  2. N = len(vectors)
  3. cluster_len = [1]*N
  4. cluster_count = N
  5. index_map = range(N)
  6. # construct the similarity matrix
  7. sim= {}
  8. for i in xrange(N):
  9. for j in xrange(i+1,N):
  10. sim[ (i,j) ] = self._average_similarity(vectors[i],1,vectors[j],1)
  11. while cluster_count > max(self._num_clusters, 1):
  12. (i,j),_sim = max(
  13. sim.iteritems(),
  14. key=lambda x: x[1])
  15. # update similarities
  16. cli = cluster_len[i]
  17. clj = cluster_len[j]
  18. cl = cli+clj
  19. del sim[ (i,j) ]
  20. for x in xrange(i):
  21. if (x,i) not in sim: continue
  22. sim[ (x,i) ] = (sim[ (x,i) ]*cli + sim[ (x,j) ]*clj)/cl
  23. del sim[ (x,j) ]
  24. for x in xrange(i+1,j):
  25. if (i,x) not in sim: continue
  26. sim[ (i,x) ] = (sim[ (i,x) ]*cli + sim[ (x,j) ]*clj)/cl
  27. del sim[ (x,j) ]
  28. for x in xrange(j+1,N):
  29. if (i,x) not in sim: continue
  30. sim[ (i,x) ] = (sim[ (i,x) ]*cli + sim[ (j,x) ]*clj)/cl
  31. del sim[ (j,x) ]
  32. # merge the clusters
  33. cluster_len[i] = cl
  34. self._dendrogram.merge(index_map[i],index_map[j])
  35. cluster_count -= 1
  36. # update the index map to reflect the indexes if we
  37. # had removed j
  38. for x in xrange(j+1,N):
  39. index_map[x] -= 1
  40. index_map[j] = -1

Graph running times

For the following graphs each algorithm was run 10 times and the average is plotted. I would also plot the standard deviation but it was insignificant. The data can be found here as csv and here as a gnumeric document.

Comparison of every implementation

NLTK series is incomplete because it took too much time to run the tests. You can interpolate that it would take more than 8 hours to actually cluster 1200 points ten times.

Relative performance

The following graph is actually the most interesting. It shows that with this improvement we have not actually solved any problem, we have simply made the algorithm 13 times faster. The execution time still increases by the cube of the number of points which means that the NlpTools implementation will always be incrementally faster as the number of points increases.

O(n2logn) hierarchical clustering Jun 7th, 2013

Hierarchical clustering is the process in which given a dissimilarity matrix between documents one has to join those that are most similar and update the matrix given an update strategy (single/complete link, group average, etc)

Naive case

The above process, in the general case, is of complexity O(n3) since we have to traverse n2 elements (in the matrix) and find their minimum and we have to do this n times. If one does code it this way, especially in PHP that is a relatively slow language, the maximum number of documents that can be clustered in a reasonable time (less than a day) is about 1000 (depending heavily on the computer).

Improving the general case

One could improve the general case of O(n3) simply by noticing that most of the time is spent looking for a minimum. If we use a heap instead of a matrix, then finding the minimum is an O(1) operation. Yet inserting and removing from the heap (as we updated the matrix previously) is an O(logn) operation and we have to do it n times for each merge. Thus the new complexity of the algorithm is now O(n2logn) which improves substantially the usefulness of the algorithm in php. (for instance if n=1000 we improve the running time approximately by 100).

As fast as possible

More than 40 years ago Mr. Sibson developed the SLINK algorithm for single link hierarchical clustering, based on a new mathematical representation of a dendrogram he devised an algorithm with time complexity O(n2). It is easily provable that no algorithm can be asymptotically faster than that. The SLINK algorithm, although easy to write in code, it deviates quite a lot from intuition in the sense that the algorithm does not merge two clusters n times untill all clusters are merged into one but instead it recursively updates the dendrogram from having i points to having i+1 points (thus the new dendrogram representation).

What is NlpTools using

PHP is a relatively slow language (captain obvious). The fastest algorithm possible should be used in order to achieve usable (or good) performance. That being said, NlpTools except for a toolkit and library for NLP, it also aims to be a good resource for introduction to machine learning and NLP algorithms. Thus, sometimes a simpler solution is chosen than the faster one.

The first implementation of hierarchical clustering in NlpTools will be using a custom made heap that makes it easy to understand what is going on in the Hierarchical class that implements the Clusterer interface and yet provides much better performance than the naive implementation.

What about SLINK and CLINK

Those algorithms will probably be added in the clustering namespace as two completely new classes that provide the same results as using Hierarchical with SingleLink and CompleteLink strategies accordingly. Top priority though is to implement yet more merge strategies like Ward and Centroid.

And a beautiful looking dendrogram?

The clustering above is done using single link hierarchical clustering with euclidean distance. The images were created using gd in php and the code is included in NlpTools as tests for the clustering namespace.

Text clustering May 10th, 2013

In this article I will introduce one thing that has been missing from NlpTools, clustering. The base similarity concepts have long been included in NlpTools and it was a matter of time to implement some clustering methods.

Current status

So far only KMeans has been implemented and I have not yet published it. KMeans is, probably, the simplest clustering algorithm and one of the fastest. It allows for clustering with many distance metrics (euclidean, cosine similarity, hamming, etc). Combined with simhash it can be used to cluster big corpuses even in php.


Simple euclidean clustering at two dimensions.

  1. use NlpTools\Clustering\KMeans;
  2. use NlpTools\Clustering\CentroidFactories\Euclidean as EuclideanCF;
  3. use NlpTools\Similarity\Euclidean;
  4. $clust = new KMeans(
  5. 2, // 2 clusters
  6. new Euclidean(), // the distance metric
  7. // must implement Distance interface
  8. new EuclideanCF(), // how we will be creating centroids
  9. 0.001 // the cutoff value (can be ommited)
  10. );
  11. $clust->cluster(
  12. $tset, // a training set
  13. $ff // a feature factory that will transform the document data
  14. // to data compatible with the distance metric and the
  15. // centroid factory
  16. );

The only thing that needs to change, in order to cluster using cosine similarity at two dimensions, is the parameters at the initialization of the KMeans class.

  1. use NlpTools\Clustering\KMeans;
  2. use NlpTools\Clustering\CentroidFactories\MeanAngle;
  3. use NlpTools\Similarity\CosineSimilarity;
  4. $clust = new KMeans(
  5. 4, // 4 clusters
  6. new CosineSimilarity(), // the distance metric
  7. new MeanAngle(), // how we will be creating centroids
  8. 0.001 // the cutoff value (can be ommited)
  9. );


NlpTools will also contain the following clustering methods

  • Hierarchical clustering
  • Expectation Maximization clustering (probably with Naive Bayes model)

Moreover I plan to add some tools for internal evaluation of clusters (silhouette, SSE, Correlation).

Backwards incompatible changes

When the clustering branch is merged with the master branch the following backwards incompatible changes will take place.

  • SetDistance and SetSimilarity, the similarity and distance interfaces accordingly are renamed to Distance and Similarity
  • The above interfaces no longer require arrays as parameters, thus we could even define the distance between two strings without the need to split those strings in character sets

Introducing Latent Dirichlet Allocation Apr 30th, 2013

In this article I will introduce work still in progress in NlpTools, Latent Dirichlet Allocation with batch Gibbs sampling. I haven't updated the API Docs yet so one can only see the API by looking to the code.

The implementation is based on the paper by Griffiths and Steyvers. With this new model a new namespace is also introduced in NlpTools (although it is arguable if it will stay) the NlpTools\Random namespace. It provides useful constructs for dealing with random distributions. As always it is coded in an as needed basis and only Gamma, Dirichlet and Normal distributions are implemented (much based on

I would also like to mention Mathieu Blondel's python implementation that helped seriously in debugging mine and from which I took the log likelihood implementation.

A graphical example

The implementation comes with an accompanying test, the beautiful graphical example used by Griffiths and Steyvers in their own paper.

We create a set of documents (images) while sticking to the model's assumptions, sampling each topic from a Dirichlet distribution.


Examples of generated documents:

Topics found in 1st iteration (log likelihood -278203.915):

Topics found in 5th iteration (log likelihood -205116.986):

Topics found in 50th iteration (log likelihood -133652.225):


For example the following code finds the topics of an array of text documents.

  1. use NlpTools\FeatureFactories\DataAsFeatures;
  2. use NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use NlpTools\Documents\TokensDocument;
  4. use NlpTools\Documents\TrainingSet;
  5. use NlpTools\Models\Lda;
  6. // supposedly the list of documents is in a file that is passed
  7. // as first parameter
  8. if (file_exists($argv[1])
  9. $docs = file($argv[1]);
  10. else
  11. die(1);
  12. $tok = new WhitespaceTokenizer();
  13. $tset = new TrainingSet();
  14. foreach ($docs as $f) {
  15. $tset->addDocument(
  16. '', // the class is not used by the lda model
  17. new TokensDocument(
  18. $tok->tokenize(
  19. )
  20. )
  21. );
  22. }
  23. $lda = new Lda(
  24. new DataAsFeatures(), // a feature factory to transform the document data
  25. 5, // the number of topics we want
  26. 1, // the dirichlet prior assumed for the per document topic distribution
  27. 1 // the dirichlet prior assumed for the per word topic distribution
  28. );
  29. // run the sampler 50 times
  30. $lda->train($tset,50);
  31. // synonymous to calling getPhi() as per Griffiths and Steyvers
  32. // it returns a mapping of words to probabilities for each topic
  33. // ex.:
  34. // Array(
  35. // [0] => Array(
  36. // [word1] => 0.0013...
  37. // ....................
  38. // [wordn] => 0.0001...
  39. // ),
  40. // [1] => Array(
  41. // ....
  42. // )
  43. // )
  44. // $lda->getPhi(10)
  45. // just the 10 largest probabilities
  46. $lda->getWordsPerTopicsProbabilities(10)
  47. );
Next »