Natural language processing in php

Source classifier in PHP Nov 13th, 2013

Recently I had the need to identify the programming language of several snippets of code without relying on the file extension. I originally thought that it would be trivial to find code that does exactly that but as it turns out I have only found one implementation in Ruby. Thus, I decided to write my own source classifier in php.

Finding a dataset

The hardest part was to find and download a dataset for use with the classifier. I decided to download sources from Google code jam 2013 solutions. This decision was made for the following two simple reasons, firstly I would use the model to classify similar types of source code (small solutions to programming problems) and secondly it provides easy access to many different programming languages.

The quality of the produced classifier heavily depends on the dataset so one could improve the performance of the provided model by training on a different dataset.

Classification model and Features

Since it seemed to work well for Mr Chris Lowis I decided to use the same simple Naive Bayes model. The sources are read as is, they are not normalized. I use the WhitespaceAndPunctuationTokenizer to split the sources into tokens.

The feature factory is a simple frequency feature factory that cuts off the maximum frequency at 4 occurences.

  1. class CodeFeatures implements FeatureFactoryInterface
  2. {
  3. public function getFeatureArray($class, DocumentInterface $doc)
  4. {
  5. $tokens = $doc->getDocumentData();
  6. $tokens = array_count_values($tokens);
  7. foreach ($tokens as $tok=>&$v) {
  8. $v = min($v, 4);
  9. }
  10. return $tokens;
  11. }
  12. }

Using the LanguageDetector

I have wrapped the MultinomialNBClassifier in a class to ease its use, its training and its serialization.

  1. $detector = LanguageDetector::loadFromFile("/path/to/model");
  2. $lang = $detector->classify(<<<CODE
  3. #include <stdio.h>
  4. int main() {
  5. printf("Hello world!");
  6. }
  7. CODE
  8. );
  9. echo $lang, PHP_EOL; // C

In the github repository there is already a pretrained model that classifies among the following languages (the popular ones according to Google code jam):

  • C
  • C#
  • C++
  • Clojure
  • Go
  • Haskell
  • Java
  • Javascript
  • Pascal
  • Perl
  • PHP
  • Python
  • Ruby
  • Scala
  • Visual Basic

Train on your own files

In the repo there is also a bin directory that provides a lang-detect script. It is a simple script that allows training of new models on new datasets, evaluation of models and using a model to classify a file as a source code.


  1. # retrain the provided model
  2. bin/lang-detect train "data/train"
  3. # evaluate the trained model
  4. bin/lang-detect evaluate "data/test" # should print 0.98
  5. # classify some code
  6. bin/lang-detect classify "some/path/code.cpp"

The structure of the directories for training and evaluating should be one subdirectory per class and each subdirectory should contain one file per document (one source file). You can see an example of the above structure in the data/train and data/test directories.

Future work

Since it is a much less interesting and unimportant problem than the ones I usually like to battle with in my free time I have left a lot of concepts unexplored and went for the fastest half good solution (although it turned out better than half good).

One should be able to get much better results by changing both the feature factory and the tokenizer.


The tokenizer should probably be able to understand different types of strings and ignore them or mark them as "string". Maybe the same should apply to comments as well. Operators, parentheses and brackets should be parsed as separate tokens but not every single punctuation character. Some could be grouped like -> or =>.

Feature factory

As far as the feature factory is concerned, document frequency dictionaries could be used to emphasize to the keywords per class and differentiate them from the identifiers. Words at the beginning and ending of the document should be weighted differently as they are intuitively more important in differentiating the programming languages (imports, opening/closing tags, etc). Finally statistics about the identifiers could be collected, for instance whether camel casing or underscores are preferred.