NlpTools

Natural language processing in php

Application outline Mar 19th, 2013

Project: Spam detection service

In the second post about this project, I will outline the silex application and the input/output formats of the RESTful service.

Input

The email input format as it has been previously mentioned will be in JSON. We will only be receiving the sender's email, the subject and the body of the email. Each endpoint that receives emails as input will be receiving many emails as an array.

{
    "documents": [
        {
            "from": "example@example.com",
            "subject": "Test subject",
            "body": "Lorem ipsum dolor sit amet...",
            "class": "HAM"
        },
        {
            "from": "example@spam.com",
            "subject": "Cheap phones",
            "body": "Lorem ipsum dolor sit amet...",
            "class": "SPAM"
        },
        ...
        ...
        ...
    ]
}

If the emails are not for training but for classification, the class attribute will be ignored.

Output

The app's response will always be in JSON too. We will be either outputing an error message in the following form (with HTTP status 5xx or 4xx of course)

{
    "error": "This is a message"
}

or an array of labels with HTTP status 200

["HAM","SPAM",....]

or an empty body with an informative HTTP status (like 204).

Bootstraping

Below follows the skeleton of our service's code. There are two functions that train the models (train_new, retrain) and both return a model and a training context (so that we can perform incremental training).

  1. require __DIR__.'/../vendor/autoload.php';
  2. // Silex
  3. use Symfony\Component\HttpFoundation\Request;
  4. use Symfony\Component\HttpFoundation\Response;
  5. // NlpTools
  6. use NlpTools\Models\FeatureBasedNB;
  7. use NlpTools\Documents\TrainingSet;
  8. use NlpTools\FeatureFactories\DataAsFeatures;
  9. use NlpTools\Classifiers\MultinomialNBClassifier;
  10. /*
  11. * Train a model
  12. * */
  13. function train_new(Request $req) {
  14. ...
  15. }
  16. function retrain(FeatureBasedNB $model,array &$ctx, Request $req) {
  17. ...
  18. }
  19. // new App
  20. $app = new Silex\Application();
  21. // a UID provider
  22. $app->register(new UIDServiceProvider());
  23. // a middleware to parse json requests to array
  24. $app->before(function (Request $req) {
  25. if (substr($req->headers->get('Content-Type'),0,16) === 'application/json') {
  26. $data = json_decode($req->getContent(),true);
  27. $req->request->replace(is_array($data) ? $data : array());
  28. }
  29. });
  30. // Create a new model
  31. $app->post('/models',function (Request $req) use($app) {
  32. ...
  33. });
  34. // Check if a model exists
  35. $app->get('/{uid}',function (Request $req,$uid) use($app) {
  36. ...
  37. })->assert('uid','[0-9a-f]+');
  38. // Delete an existing model
  39. $app->delete('/{uid}',function (Request $req,$uid) use($app) {
  40. ...
  41. })->assert('uid','[0-9a-f]+');
  42. // Report a number of emails as spam or not
  43. $app->post('/{uid}/report',function (Request $req,$uid) use($app) {
  44. ...
  45. })->assert('uid','[0-9a-f]+');
  46. // Classify a set of documents
  47. $app->post('/{uid}/classify',function (Request $req,$uid) use($app) {
  48. ...
  49. })->assert('uid','[0-9a-f]+');
  50. $app->run();

Let's take each endpoint in turns.

POST /models

Posting to /models creates a new model and trains it if emails are provided. $app['UID'] is a unique id provider. A folder is created for the model and two files one representing the model and another the training context. Both files contain Base64 encoded serialized php arrays or objects.

  1. // Create a new model
  2. $app->post('/models',function (Request $req) use($app) {
  3. $uid = $app['UID']();
  4. if ( file_exists("models/$uid") )
  5. {
  6. return $app->json(
  7. "error"=>"Could not allocate a unique id for the new model"
  8. ),
  9. 500
  10. );
  11. }
  12. if ( !mkdir("models/$uid") )
  13. {
  14. return $app->json(
  15. "error"=>"Could not allocate a unique id for the new model"
  16. ),
  17. 500
  18. );
  19. }
  20. list($model,$ctx) = train_new($req);
  21. if ($ctx !== null)
  22. file_put_contents("models/$uid/ctx",base64_encode(serialize($ctx)));
  23. file_put_contents("models/$uid/model",base64_encode(serialize($model)));
  24. return $app->json(
  25. "id" => $uid
  26. ),
  27. 201
  28. );
  29. });

GET /{uid}

This endpoint simply returns 204 if the model was found or 404 if it was not.

DELETE /{uid}

This edpoint simply deletes the model if it exists or returns a 404 error otherwise.

POST /{uid}/report

This endpoint is used to further train the model. The model is unserialized from the files and if a context exists is trained incrementally, otherwise is trained from scratch.

  1. $app->post('/{uid}/report',function (Request $req,$uid) use($app) {
  2. if (!file_exists("models/$uid"))
  3. {
  4. return $app->json(
  5. array('error'=>'Model not found'),
  6. 404
  7. );
  8. }
  9. if (!file_exists("models/$uid/ctx"))
  10. {
  11. list($model,$ctx) = train_new($req);
  12. if ($ctx === null)
  13. return $app->json(array('error'=>'No documents were reported'),400);
  14. file_put_contents("models/$uid/ctx",base64_encode(serialize($ctx)));
  15. file_put_contents("models/$uid/model",base64_encode(serialize($model)));
  16. }
  17. else
  18. {
  19. $ctx = unserialize(base64_decode(file_get_contents("models/$uid/ctx")));
  20. $model = unserialize(base64_decode(file_get_contents("models/$uid/model")));
  21. if (!retrain($model,$ctx,$req))
  22. return $app->json(array('error'=>'No documents were reported'),400);
  23. file_put_contents("models/$uid/ctx",base64_encode(serialize($ctx)));
  24. file_put_contents("models/$uid/model",base64_encode(serialize($model)));
  25. }
  26. return new Response('',204);
  27. })->assert('uid','[0-9a-f]+');

POST /{uid}/classify

This endpoint actually uses the model to classify a set of documents (emails). We will be talking about EmailFeatures, EmailDocument in the next post.

  1. $app->post('/{uid}/classify',function (Request $req,$uid) use($app) {
  2. if (!file_exists("models/$uid"))
  3. {
  4. return $app->json(
  5. array('error'=>'Model not found'),
  6. 404
  7. );
  8. }
  9. $response = array();
  10. if ($req->request->has('documents') && is_array($req->request->get('documents')))
  11. {
  12. // create the feature factory
  13. $ff = new EmailFeatures();
  14. // get the model from the file
  15. $model = unserialize(base64_decode(file_get_contents("models/$uid/model")));
  16. // create a classifier
  17. $cls = new MultinomialNBClassifier($ff,$model);
  18. // for each document
  19. foreach ($req->request->get('documents') as $doc)
  20. {
  21. // create an email document because that is what
  22. // EmailFeatures expects
  23. $email = new EmailDocument(
  24. $doc['from'],
  25. $doc['subject'],
  26. $doc['body']
  27. );
  28. // actually use the model to predict the class of this EmailDocument
  29. $response[] = $cls->classify(array('SPAM','HAM'),$email);
  30. }
  31. }
  32. // show the predictions
  33. return $app->json($response);
  34. })->assert('uid','[0-9a-f]+');