NlpTools

Natural language processing in php

Dbscan text clustering Sep 3rd, 2013

In this article I will be introducing Density Based Spatial Clustering of Applications with Noise, known as DBSCAN. DBSCAN is the latest addition to the Clustering namespace of php (it is still under development and not merged into master).

DBSCAN is a density based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes

Wikipedia

Dbscan Implementation

The current implementation in NlpTools takes as a parameter an object that implements SpatialIndexInterface.

  1. // comments stripped
  2. interface SpatialIndexInterface
  3. {
  4. public function setDistanceMetric(Distance $d);
  5. public function index(array &$docs);
  6. public function add($doc);
  7. public function regionQuery($doc, $e);
  8. public function kNearestNeighbors($doc, $k);
  9. }

This is used to abstract away the logic of searching a set of points for neighbors so that this process can be optimized. For instance we could implement k-d trees or ball trees for faster neighbor searching. The current implementation uses naive linear search and has complexity O(n2).

Examples

  1. use NlpTools\Clustering\Dbscan;
  2. use NlpTools\Similarity\Euclidean;
  3. use NlpTools\Documents\TrainingSet;
  4. use NlpTools\FeatureFactories\DataAsFeatures;
  5. $tset = new TrainingSet();
  6. // load the training set with documents
  7. // create a large outer circle of points
  8. for ($i=0.0;$i<2*M_PI;$i+=M_PI/20) {
  9. $tset->addDocument(
  10. 'A',
  11. EuclideanPoint::getRandomPointAround(
  12. 150+70*cos($i),
  13. 100+70*sin($i),
  14. 8
  15. )
  16. );
  17. $tset->addDocument(
  18. 'A',
  19. EuclideanPoint::getRandomPointAround(
  20. 150+70*cos($i),
  21. 100+70*sin($i),
  22. 8
  23. )
  24. );
  25. }
  26. // create an inner smaller circle of points
  27. for ($i=0.0;$i<2*M_PI;$i+=M_PI/10) {
  28. $tset->addDocument(
  29. 'B',
  30. EuclideanPoint::getRandomPointAround(
  31. 150+20*cos($i),
  32. 100+20*sin($i),
  33. 8
  34. )
  35. );
  36. $tset->addDocument(
  37. 'B',
  38. EuclideanPoint::getRandomPointAround(
  39. 150+20*cos($i),
  40. 100+20*sin($i),
  41. 8
  42. )
  43. );
  44. }
  45. // we are stating that we consider a region to be dense
  46. // if it has at least 4 points in a circle of radius 20
  47. $clust = new Dbscan(
  48. 4, // the minimum number of points needed
  49. 20, // the radius in which we are looking for points
  50. new Euclidean() // we will use the Euclidean distance
  51. );
  52. // do the actual clustering
  53. list($clusters,$noise) = $clust->cluster($tset, new DataAsFeatures());

The above code creates a very simple dataset with two distinct clusters in the two-dimensional euclidean space. This simple dataset is often used as an example to illustrate the differences between DBSCAN and K-Means.

K-Means DBSCAN  

The second example that follows illustrates the property of DBSCAN to identify noise points and determine the number of clusters in the datatset. The dark blue points are noise. The points to be clustered are uniformly random.