NlpTools

Natural language processing in php

Sentiment detection Mar 30th, 2013

Project: Sentiment detection

We will be doing a bit of sentiment detection using NlpTools and aim at recreating the results of a popular sentiment classification paper of Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, "Thumbs up? Sentiment Classification using Machine Learning Techniques"

Getting and preparing the dataset

The dataset that we will be using is the polarity dataset v2.0 found here. The dataset is a collection of movie reviews from imdb, already labeled and tokenized. NlpTools does not have any tools for model evaluation so we will have to code our own manually for this mini nlp project.

We will use 10-fold cross-validation. We will be creating lists of files split for training and evaluation. I am assuming that you are using either Linux or Mac, if that is not the case, really sorry, you can either download the lists from here and replace _DIR_ with the path to the files or write a script to create them on your own.

First we will create a list of all the files and shuffle it. Run the following commands in a terminal, one directory above the neg and pos directories.

  1. ls neg/ | xargs -I file echo `pwd`/neg/file >> /tmp/imdb.list
  2. ls pos/ | xargs -I file echo `pwd`/pos/file >> /tmp/imdb.list
  3. shuf /tmp/imdb.list >/tmp/imdb-shuffled.list
  4. for i in {200..2000..200};do p=$(($i/200)); head -n $i /tmp/imdb-shuffled.list | tail -n 200 >test_$p ;done
  5. for i in {200..2000..200};do p=$(($i/200)); s=$(($i-200)); e=$((2000-$i)); head -n $s /tmp/imdb-shuffled.list >train_$p; tail -n $e /tmp/imdb-shuffled.list >>train_$p; done

We now have 20 files named test_*, train_* that contain lists to the documents to be used for each fold. We will next code a function to create a training set from a list of documents.

  1. use NlpTools\Tokenizers\WhitespaceTokenizer;
  2. use NlpTools\Documents\TokensDocument;
  3. use NlpTools\Documents\TrainingSet;
  4. function create_training_set($fname) {
  5. $tset = new TrainingSet();
  6. $tok = new WhitespaceTokenizer();
  7. $subs = function ($s) { return substr($s,0,-1); };
  8. array_filter( // filter empty lines (''==false)
  9. array_map( // map $subs to remove new line from the end
  10. $subs,
  11. file($fname)
  12. )
  13. ),
  14. function ($file) use($tset,$tok) {
  15. $tset->addDocument(
  16. (strpos($file,'pos')===FALSE) ? 'neg' : 'pos',
  17. new TokensDocument(
  18. $tok->tokenize(
  19. )
  20. )
  21. );
  22. }
  23. );
  24. return $tset;
  25. }

Classification

In this post we will train a Naive Bayes model without removing stopwords and/or infrequent words as done by Pang, Lee. We will be using frequency of unigrams.

Let's set things up.

  1. include('vendor/autoload.php');
  2. include('create_training_set.php');
  3. $train = $argv[1]; // the first parameter is the training set list file
  4. $test = $argv[2]; // the second parameter is the test set list file
  5. // ... check if the files are ok ...
  6. // use the function we coded above to create the sets of documents
  7. $training_set = create_training_set($train);
  8. $test_set = create_test_set($test);

Now that we have created the training/test sets we must train our model. Firstly we will decide what features to use and whether we will create our own feature factory or use one provided by NlpTools. Since we want to simply use as features the frequency of unigrams we can use DataAsFeatures.

Let's train our model.

  1. $feature_factory = new DataAsFeatures();
  2. $model = new FeatureBasedNB();
  3. $model->train($feature_factory, $training_set);

Now we should evaluate our trained model on the test set. We will use the simple measure of accuracy (correct/total) as used by Pang, Lee in their paper.

  1. // counter of correct predictions
  2. $correct = 0;
  3. // the classifier
  4. $cls = new MultinomialNBClassifier($feature_factory, $model);
  5. $classes = $test_set->getClassSet();
  6. foreach ($test_set as $class=>$doc) {
  7. // predict a class for this doc
  8. $prediction = $cls->classify(
  9. $classes,
  10. $doc
  11. );
  12. // if correct add one to the counter
  13. $correct += (int)($prediction==$class);
  14. }
  15. echo ((float)$correct)/count($test_set), PHP_EOL;

Now to run the script, if we have saved the above 3 snippets in a file named nb.php, we can simply write the following in a terminal.

  1. php nb.php path/to/train_1 path/to/test_1

And to run the crossvalidation (replace the "data/" with the path that train_* is in).

  1. echo "(`for i in {1..10}; do php nb.php data/train_$i data/test_$i; done | paste -sd+`)*10" | bc

Using these lists (you need to replace _DIR_ with the path to the files), my model has an accuracy of 81.85%, which is actually quite better than 78.7 percent that was achieved by 3-fold crossvalidation in Pang and Lee 's paper.

Maxent

In the next post on this topic we will train a maximum entropy model and see if we can push this 81.85% any further.

Application outline Mar 19th, 2013

Project: Spam detection service

In the second post about this project, I will outline the silex application and the input/output formats of the RESTful service.

Input

The email input format as it has been previously mentioned will be in JSON. We will only be receiving the sender's email, the subject and the body of the email. Each endpoint that receives emails as input will be receiving many emails as an array.

{
    "documents": [
        {
            "from": "example@example.com",
            "subject": "Test subject",
            "body": "Lorem ipsum dolor sit amet...",
            "class": "HAM"
        },
        {
            "from": "example@spam.com",
            "subject": "Cheap phones",
            "body": "Lorem ipsum dolor sit amet...",
            "class": "SPAM"
        },
        ...
        ...
        ...
    ]
}

If the emails are not for training but for classification, the class attribute will be ignored.

Output

The app's response will always be in JSON too. We will be either outputing an error message in the following form (with HTTP status 5xx or 4xx of course)

{
    "error": "This is a message"
}

or an array of labels with HTTP status 200

["HAM","SPAM",....]

or an empty body with an informative HTTP status (like 204).

Bootstraping

Below follows the skeleton of our service's code. There are two functions that train the models (train_new, retrain) and both return a model and a training context (so that we can perform incremental training).

  1. require __DIR__.'/../vendor/autoload.php';
  2. // Silex
  3. use Symfony\Component\HttpFoundation\Request;
  4. use Symfony\Component\HttpFoundation\Response;
  5. // NlpTools
  6. use NlpTools\Models\FeatureBasedNB;
  7. use NlpTools\Documents\TrainingSet;
  8. use NlpTools\FeatureFactories\DataAsFeatures;
  9. use NlpTools\Classifiers\MultinomialNBClassifier;
  10. /*
  11. * Train a model
  12. * */
  13. function train_new(Request $req) {
  14. ...
  15. }
  16. function retrain(FeatureBasedNB $model,array &$ctx, Request $req) {
  17. ...
  18. }
  19. // new App
  20. $app = new Silex\Application();
  21. // a UID provider
  22. $app->register(new UIDServiceProvider());
  23. // a middleware to parse json requests to array
  24. $app->before(function (Request $req) {
  25. if (substr($req->headers->get('Content-Type'),0,16) === 'application/json') {
  26. $data = json_decode($req->getContent(),true);
  27. $req->request->replace(is_array($data) ? $data : array());
  28. }
  29. });
  30. // Create a new model
  31. $app->post('/models',function (Request $req) use($app) {
  32. ...
  33. });
  34. // Check if a model exists
  35. $app->get('/{uid}',function (Request $req,$uid) use($app) {
  36. ...
  37. })->assert('uid','[0-9a-f]+');
  38. // Delete an existing model
  39. $app->delete('/{uid}',function (Request $req,$uid) use($app) {
  40. ...
  41. })->assert('uid','[0-9a-f]+');
  42. // Report a number of emails as spam or not
  43. $app->post('/{uid}/report',function (Request $req,$uid) use($app) {
  44. ...
  45. })->assert('uid','[0-9a-f]+');
  46. // Classify a set of documents
  47. $app->post('/{uid}/classify',function (Request $req,$uid) use($app) {
  48. ...
  49. })->assert('uid','[0-9a-f]+');
  50. $app->run();

Let's take each endpoint in turns.

POST /models

Posting to /models creates a new model and trains it if emails are provided. $app['UID'] is a unique id provider. A folder is created for the model and two files one representing the model and another the training context. Both files contain Base64 encoded serialized php arrays or objects.

  1. // Create a new model
  2. $app->post('/models',function (Request $req) use($app) {
  3. $uid = $app['UID']();
  4. if ( file_exists("models/$uid") )
  5. {
  6. return $app->json(
  7. "error"=>"Could not allocate a unique id for the new model"
  8. ),
  9. 500
  10. );
  11. }
  12. if ( !mkdir("models/$uid") )
  13. {
  14. return $app->json(
  15. "error"=>"Could not allocate a unique id for the new model"
  16. ),
  17. 500
  18. );
  19. }
  20. list($model,$ctx) = train_new($req);
  21. if ($ctx !== null)
  22. file_put_contents("models/$uid/ctx",base64_encode(serialize($ctx)));
  23. file_put_contents("models/$uid/model",base64_encode(serialize($model)));
  24. return $app->json(
  25. "id" => $uid
  26. ),
  27. 201
  28. );
  29. });

GET /{uid}

This endpoint simply returns 204 if the model was found or 404 if it was not.

DELETE /{uid}

This edpoint simply deletes the model if it exists or returns a 404 error otherwise.

POST /{uid}/report

This endpoint is used to further train the model. The model is unserialized from the files and if a context exists is trained incrementally, otherwise is trained from scratch.

  1. $app->post('/{uid}/report',function (Request $req,$uid) use($app) {
  2. if (!file_exists("models/$uid"))
  3. {
  4. return $app->json(
  5. array('error'=>'Model not found'),
  6. 404
  7. );
  8. }
  9. if (!file_exists("models/$uid/ctx"))
  10. {
  11. list($model,$ctx) = train_new($req);
  12. if ($ctx === null)
  13. return $app->json(array('error'=>'No documents were reported'),400);
  14. file_put_contents("models/$uid/ctx",base64_encode(serialize($ctx)));
  15. file_put_contents("models/$uid/model",base64_encode(serialize($model)));
  16. }
  17. else
  18. {
  19. $ctx = unserialize(base64_decode(file_get_contents("models/$uid/ctx")));
  20. $model = unserialize(base64_decode(file_get_contents("models/$uid/model")));
  21. if (!retrain($model,$ctx,$req))
  22. return $app->json(array('error'=>'No documents were reported'),400);
  23. file_put_contents("models/$uid/ctx",base64_encode(serialize($ctx)));
  24. file_put_contents("models/$uid/model",base64_encode(serialize($model)));
  25. }
  26. return new Response('',204);
  27. })->assert('uid','[0-9a-f]+');

POST /{uid}/classify

This endpoint actually uses the model to classify a set of documents (emails). We will be talking about EmailFeatures, EmailDocument in the next post.

  1. $app->post('/{uid}/classify',function (Request $req,$uid) use($app) {
  2. if (!file_exists("models/$uid"))
  3. {
  4. return $app->json(
  5. array('error'=>'Model not found'),
  6. 404
  7. );
  8. }
  9. $response = array();
  10. if ($req->request->has('documents') && is_array($req->request->get('documents')))
  11. {
  12. // create the feature factory
  13. $ff = new EmailFeatures();
  14. // get the model from the file
  15. $model = unserialize(base64_decode(file_get_contents("models/$uid/model")));
  16. // create a classifier
  17. $cls = new MultinomialNBClassifier($ff,$model);
  18. // for each document
  19. foreach ($req->request->get('documents') as $doc)
  20. {
  21. // create an email document because that is what
  22. // EmailFeatures expects
  23. $email = new EmailDocument(
  24. $doc['from'],
  25. $doc['subject'],
  26. $doc['body']
  27. );
  28. // actually use the model to predict the class of this EmailDocument
  29. $response[] = $cls->classify(array('SPAM','HAM'),$email);
  30. }
  31. }
  32. // show the predictions
  33. return $app->json($response);
  34. })->assert('uid','[0-9a-f]+');

Spam detection as a service Mar 7th, 2013

Project: Spam detection service

For the first mini nlp project we will create a spam detection service for use with email systems and probably any system. Our purpose is to create an HTTP endpoint (REST service) that will be managing models (create / train / classify / delete). The classifier will be a binary classifier with the class names being SPAM and HAM for spam and not spam accordingly.

The purpose of the service we will create is not to make well known anti-spam services obsolete, it is to create a good starting ground for using NlpTools for Bayesian classification and for creating a custom classifier that given a specific context (and extra work by you) could even outperform well known solutions.

Classification model to use

At the time of writing, NlpTools only supports two models Naive Bayes and Maximum Entropy (conditional exponential model). Although Maximum Entropy models are known to perform better at most NLP tasks I will be using NaiveBayes for the following reasons:

  1. NB is a lot less computationally intensive to train than MaxEnt
  2. Because of reason 1 we can use pure php and still create a usable for real world scenarios service
  3. We can incrementally train a NB model without keeping all the previous documents (we will be keeping a training context)

Service overview

We want to be able to perform the following functions

  1. Create a model
  2. Train a model on a set of emails (report emails as spam)
  3. Classify a set of emails
  4. Delete a model

This means we will at least have 4 endpoints at our RESTful service. Let's also use JSON as the only transport format so as to not complicate things.

Tools

We will now setup our environment. Besides the obvious, NlpTools, we will also be using Silex to help us out with building our service. We will be installing both using composer. Create a directory to hold the project (ex.: spam-service), then create a composer.json .

{
    "require": {
        "silex/silex": "1.0.*@dev",
        "nlp-tools/nlp-tools": "1.0.*@dev"
    },
    "autoload": {
        "psr-0": {"": "web/"}
    }
}

It is also obvious from the above composer.json file that we will be coding in the web directory and that our code namimg conventions will comply to the psr-0 standard. Now run composer to install silex and NlpTools and their dependencies. If you don't have composer run the following.

$ curl -s http://getcomposer.org/installer | php
$ composer.phar install

Finally, for the first part we need to setup a webserver that will serve our spam service. Silex documentation has plenty of information to setup silex to play with any webserver. I will add the config file I am using for nginx.

server {
    listen localhost:80;
    server_name 127.0.0.1;

    location / {
        # example: unix:/var/run/php-fpm/www.sock
        fastcgi_pass unix:/path/to/spam-service/spam-service.sock;
        include fastcgi_params;
        # replace /path/to/spam-service with whatever directory you
        # created to put the project in
        fastcgi_param SCRIPT_FILENAME /path/to/spam-service/web/index.php;
    }
}
« Previous