Project: Sentiment detection
We will be doing a bit of sentiment detection using NlpTools and aim at
recreating the results of a popular sentiment classification paper
of Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan,
"Thumbs up? Sentiment Classification using Machine Learning Techniques"
Getting and preparing the dataset
The dataset that we will be using is the polarity dataset v2.0
found here. The dataset is a collection of movie
reviews from imdb, already labeled and tokenized.
NlpTools does not have any tools for model evaluation so we will have to
code our own manually for this mini nlp project.
We will use 10-fold cross-validation.
We will be creating lists of files split for training and evaluation.
I am assuming that you are using either Linux or Mac, if that is not the
case, really sorry, you can either download the lists from here
and replace _DIR_ with the path to the files or write a script to create
them on your own.
First we will create a list of all the files and shuffle it. Run the
following commands in a terminal, one directory above the neg and pos
directories.
ls neg/ | xargs -I file echo `pwd`/neg/file >> /tmp/imdb.list
ls pos/ | xargs -I file echo `pwd`/pos/file >> /tmp/imdb.list
shuf /tmp/imdb.list >/tmp/imdb-shuffled.list
for i in {200..2000..200};do p=$(($i/200)); head -n $i /tmp/imdb-shuffled.list | tail -n 200 >test_$p ;done
for i in {200..2000..200};do p=$(($i/200)); s=$(($i-200)); e=$((2000-$i)); head -n $s /tmp/imdb-shuffled.list >train_$p; tail -n $e /tmp/imdb-shuffled.list >>train_$p; done
We now have 20 files named test_*, train_* that contain lists to the
documents to be used for each fold. We will next code a function to
create a training set from a list of documents.
use NlpTools\Tokenizers\WhitespaceTokenizer;
use NlpTools\Documents\TokensDocument;
use NlpTools\Documents\TrainingSet;
function create_training_set($fname) {
$tset = new TrainingSet();
$tok = new WhitespaceTokenizer();
$subs = function ($s) { return substr($s,0,-1); }; array_map( // map $subs to remove new line from the end $subs,
)
),
function ($file) use($tset,$tok) {
$tset->addDocument(
(strpos($file,'pos')===FALSE) ?
'neg' : 'pos', new TokensDocument(
$tok->tokenize(
)
)
);
}
);
return $tset;
}
Classification
In this post we will train a Naive Bayes model without
removing stopwords and/or infrequent words as done by Pang, Lee. We will
be using frequency of unigrams.
Let's set things up.
include('vendor/autoload.php');
include('create_training_set.php');
$train = $argv[1]; // the first parameter is the training set list file
$test = $argv[2]; // the second parameter is the test set list file
// ... check if the files are ok ...
// use the function we coded above to create the sets of documents
$training_set = create_training_set($train);
$test_set = create_test_set($test);
Now that we have created the training/test sets we must train our model.
Firstly we will decide what features
to use and whether we will create our own feature factory or use one
provided by NlpTools. Since we want to simply use as features the
frequency of unigrams we can use DataAsFeatures.
Let's train our model.
$feature_factory = new DataAsFeatures();
$model = new FeatureBasedNB();
$model->train($feature_factory, $training_set);
Now we should evaluate our trained model on the test set. We will use
the simple measure of accuracy (correct/total) as used by Pang, Lee in
their paper.
// counter of correct predictions
$correct = 0;
// the classifier
$cls = new MultinomialNBClassifier($feature_factory, $model);
$classes = $test_set->getClassSet();
foreach ($test_set as $class=>$doc) {
// predict a class for this doc
$prediction = $cls->classify(
$classes,
$doc
);
// if correct add one to the counter
$correct += (int)($prediction==$class);
}
echo ((float
)$correct)/count($test_set), PHP_EOL
;
Now to run the script, if we have saved the above 3 snippets in a file
named nb.php, we can simply write the following in a terminal.
php nb.php path/to/train_1 path/to/test_1
And to run the crossvalidation (replace the "data/" with the path that
train_* is in).
echo "(`for i in {1..10}; do php nb.php data/train_$i data/test_$i; done | paste -sd+`)*10" | bc
Using these lists (you need to replace _DIR_ with the
path to the files), my model has an accuracy of 81.85%, which is
actually quite better than 78.7 percent that was achieved by 3-fold
crossvalidation in Pang and Lee 's paper.
Maxent
In the next post on this topic we will train a maximum entropy model
and see if we can push this 81.85% any further.
Project: Spam detection service
In the second post about this project, I will outline the silex
application and the input/output formats of the RESTful service.
Input
The email input format as it has been previously
mentioned will be in JSON. We will only be receiving the sender's email,
the subject and the body of the email. Each endpoint that receives emails
as input will be receiving many emails as an array.
{
"documents": [
{
"from": "example@example.com",
"subject": "Test subject",
"body": "Lorem ipsum dolor sit amet...",
"class": "HAM"
},
{
"from": "example@spam.com",
"subject": "Cheap phones",
"body": "Lorem ipsum dolor sit amet...",
"class": "SPAM"
},
...
...
...
]
}
If the emails are not for training but for classification, the class
attribute will be ignored.
Output
The app's response will always be in JSON too. We will be either outputing
an error message in the following form (with HTTP status 5xx or 4xx of
course)
{
"error": "This is a message"
}
or an array of labels with HTTP status 200
["HAM","SPAM",....]
or an empty body with an informative HTTP status (like 204).
Bootstraping
Below follows the skeleton of our service's code. There are two functions
that train the models (train_new, retrain) and both return a model and a
training context (so that we can perform incremental training).
require __DIR__.'/../vendor/autoload.php';
// Silex
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;
// NlpTools
use NlpTools\Models\FeatureBasedNB;
use NlpTools\Documents\TrainingSet;
use NlpTools\FeatureFactories\DataAsFeatures;
use NlpTools\Classifiers\MultinomialNBClassifier;
/*
* Train a model
* */
function train_new(Request $req) {
...
}
function retrain
(FeatureBasedNB
$model,array &$ctx, Request
$req) { ...
}
// new App
$app = new Silex\Application();
// a UID provider
$app->register(new UIDServiceProvider());
// a middleware to parse json requests to array
$app->before(function (Request $req) {
if (substr($req->headers->get('Content-Type'),0,16) === 'application/json') { }
});
// Create a new model
$app->post('/models',function (Request $req) use($app) {
...
});
// Check if a model exists
$app->get('/{uid}',function (Request $req,$uid) use($app) {
...
})->assert('uid','[0-9a-f]+');
// Delete an existing model
$app->delete('/{uid}',function (Request $req,$uid) use($app) {
...
})->assert('uid','[0-9a-f]+');
// Report a number of emails as spam or not
$app->post('/{uid}/report',function (Request $req,$uid) use($app) {
...
})->assert('uid','[0-9a-f]+');
// Classify a set of documents
$app->post('/{uid}/classify',function (Request $req,$uid) use($app) {
...
})->assert('uid','[0-9a-f]+');
$app->run();
Let's take each endpoint in turns.
POST /models
Posting to /models creates a new model and trains it if emails are
provided. $app['UID'] is a unique id provider. A folder is created
for the model and two files one representing the model and another the
training context. Both files contain Base64 encoded serialized php arrays
or objects.
// Create a new model
$app->post('/models',function (Request $req) use($app) {
$uid = $app['UID']();
{
return $app->json(
"error"=>"Could not allocate a unique id for the new model"
),
500
);
}
if ( !mkdir("models/$uid") ) {
return $app->json(
"error"=>"Could not allocate a unique id for the new model"
),
500
);
}
list($model,$ctx) = train_new
($req); if ($ctx !== null)
return $app->json(
"id" => $uid
),
201
);
});
GET /{uid}
This endpoint simply returns 204 if the model was found or 404 if it
was not.
DELETE /{uid}
This edpoint simply deletes the model if it exists or returns a 404 error
otherwise.
POST /{uid}/report
This endpoint is used to further train the model. The model is unserialized
from the files and if a context exists is trained incrementally, otherwise
is trained from scratch.
$app->post('/{uid}/report',function (Request $req,$uid) use($app) {
{
return $app->json(
array('error'=>'Model not found'), 404
);
}
{
list($model,$ctx) = train_new
($req); if ($ctx === null)
return $app->json(array('error'=>'No documents were reported'),400); }
else
{
if (!retrain($model,$ctx,$req))
return $app->json(array('error'=>'No documents were reported'),400); }
return new Response('',204);
})->assert('uid','[0-9a-f]+');
POST /{uid}/classify
This endpoint actually uses the model to classify a set of documents
(emails). We will be talking about EmailFeatures, EmailDocument in
the next post.
$app->post('/{uid}/classify',function (Request $req,$uid) use($app) {
{
return $app->json(
array('error'=>'Model not found'), 404
);
}
if ($req->request->has('documents') && is_array($req->request->get('documents'))) {
// create the feature factory
$ff = new EmailFeatures();
// get the model from the file
// create a classifier
$cls = new MultinomialNBClassifier($ff,$model);
// for each document
foreach ($req->request->get('documents') as $doc)
{
// create an email document because that is what
// EmailFeatures expects
$email = new EmailDocument(
$doc['from'],
$doc['subject'],
$doc['body']
);
// actually use the model to predict the class of this EmailDocument
$response[] = $cls->classify(array('SPAM','HAM'),$email); }
}
// show the predictions
return $app->json($response);
})->assert('uid','[0-9a-f]+');
Project: Spam detection service
For the first mini nlp project we will create a spam detection service for
use with email systems and probably any system. Our purpose is to create
an HTTP endpoint (REST service) that will be managing models (create /
train / classify / delete). The classifier will be a binary classifier
with the class names being SPAM and HAM
for spam and not spam accordingly.
The purpose of the service we will create is not to make well known
anti-spam services obsolete, it is to create a good starting ground for
using NlpTools for Bayesian classification and for creating a custom
classifier that given a specific context (and extra work by you) could
even outperform well known solutions.
Classification model to use
At the time of writing, NlpTools only supports two models Naive Bayes
and Maximum Entropy (conditional exponential model). Although Maximum
Entropy models are known to perform better at most NLP tasks I will be
using NaiveBayes for the following reasons:
- NB is a lot less computationally intensive to train than MaxEnt
- Because of reason 1 we can use pure php and still create a usable for real world scenarios service
- We can incrementally train a NB model without keeping all the previous documents (we will be keeping a training context)
Service overview
We want to be able to perform the following functions
- Create a model
- Train a model on a set of emails (report emails as spam)
- Classify a set of emails
- Delete a model
This means we will at least have 4 endpoints at our RESTful service. Let's
also use JSON as the only transport format so as to not complicate things.
Tools
We will now setup our environment. Besides the obvious, NlpTools, we will
also be using Silex to help us out with building
our service. We will be installing both using composer.
Create a directory to hold the project (ex.: spam-service), then create a composer.json .
{
"require": {
"silex/silex": "1.0.*@dev",
"nlp-tools/nlp-tools": "1.0.*@dev"
},
"autoload": {
"psr-0": {"": "web/"}
}
}
It is also obvious from the above composer.json file that we will be
coding in the web directory and that our code namimg conventions will
comply to the psr-0 standard. Now run composer to install silex and
NlpTools and their dependencies. If you don't have composer run the following.
$ curl -s http://getcomposer.org/installer | php
$ composer.phar install
Finally, for the first part we need to setup a webserver that will serve
our spam service. Silex documentation
has plenty of information to setup silex to play with any webserver. I
will add the config file I am using for nginx.
server {
listen localhost:80;
server_name 127.0.0.1;
location / {
# example: unix:/var/run/php-fpm/www.sock
fastcgi_pass unix:/path/to/spam-service/spam-service.sock;
include fastcgi_params;
# replace /path/to/spam-service with whatever directory you
# created to put the project in
fastcgi_param SCRIPT_FILENAME /path/to/spam-service/web/index.php;
}
}