Natural language processing in php


Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Wikipedia

The NlpTools\Clustering namespace will contain a number of clustering algorithms to be used with the library. Currently it contains:

  1. K-Means
  2. Hierarchical Agglomerative Clustering


All clustering classes extend the abstract class Clusterer and implement the following abstract method.

  1. abstract class Clusterer
  2. {
  3. ...
  4. abstract public function cluster(TrainingSet $tset, FeatureFactory $ff);
  5. ...
  6. }

The clusterers should return an array containing as a first element the clusters (an array with grouped document indices) and any other data produced by the clustering (ex.: centroids and distances for K-Means).


KMeans is a very popular, simple and fast clustering algorithm. It minimizes the within cluster sum of square distances from each point to the corresponding centroid. It can be implemented with any metric of distance coupled with a centroid generation strategy. In NlpTools this means that the KMeans instance receives a Distance instance and a new class, a CentroidFactory.

Centroid Factories

The centroid factories are usually coupled with a distance metric and simply produce a new data point, given a set of points, such that the sum of the distances of this point to every point in the set is minimized. In NlpTools there are a number of CentroidFactories already defined:

Simply the centroid of the points in the euclidean space (x_1 + x_2 + ...)/n
A new number that the sum of the hamming distance from every number in the provided set is minimized
Compute a unit vector in the euclidean space such that the sum of the cosine similarities between the centroid and every other vector in the set is maximized (cosine distance minimized)

You can see a very simple example (with semi decent visualization if you have gd) in the tests of NlpTools and a very simple one that follows.

  1. use NlpTools\Clustering\KMeans;
  2. use NlpTools\Similarity\Euclidean;
  3. use NlpTools\Clustering\CentroidFactories\Euclidean as EuclideanCF;
  4. use NlpTools\Documents\TrainingSet;
  5. use NlpTools\Documents\TokensDocument;
  6. use NlpTools\FeatureFactories\DataAsFeatures;
  7. $tset = new TrainingSet();
  8. $tset->addDocument(
  9. '', //class is not used so it can be empty
  10. new TokensDocument(array('x'=>0,'y'=>0))
  11. );
  12. $tset->addDocument(
  13. '', //class is not used so it can be empty
  14. new TokensDocument(array('x'=>1,'y'=>0))
  15. );
  16. $tset->addDocument(
  17. '', //class is not used so it can be empty
  18. new TokensDocument(array('x'=>0,'y'=>1))
  19. );
  20. $tset->addDocument(
  21. '', //class is not used so it can be empty
  22. new TokensDocument(array('x'=>3,'y'=>3))
  23. );
  24. $tset->addDocument(
  25. '', //class is not used so it can be empty
  26. new TokensDocument(array('x'=>2,'y'=>3))
  27. );
  28. $tset->addDocument(
  29. '', //class is not used so it can be empty
  30. new TokensDocument(array('x'=>3,'y'=>2))
  31. );
  32. $clust = new KMeans(
  33. 2, // two clusters
  34. new Euclidean(),
  35. new EuclideanCF()
  36. );
  37. $clust->cluster($tset, new DataAsFeatures())
  38. );

Hierarchical Clustering

NlpTools implements hierarchical agglomerative clustering. This clustering method works in the following steps. Each datapoint starts at its own cluster. Then a merging strategy is initialized (usually this initialization includes computing a dis-/similarity matrix). Then iteratively two clusters are merged until only one cluster remains.

  1. class Hierarchical extends Clusterer
  2. {
  3. public function __construct(
  4. MergeStrategyInterface $ms,
  5. DistanceInterface $d
  6. )
  7. {
  8. ...
  9. }
  10. }

The cluster method does not return grouped indices of the training set's documents but instead it returns a dendrogram which can be easily cut into any number of clusters using the static method dendrogramToClusters($tree, $clustercount).

Merge Strategies

The merge strategies need to implement the following interface.

  1. interface MergeStrategyInterface
  2. {
  3. /**
  4. * Study the docs and preprocess anything required for
  5. * computing the merges
  6. */
  7. public function initializeStrategy(DistanceInterface $d, array &$docs);
  8. /**
  9. * Return the next two clusters for merging and assume
  10. * they are merged (ex. update a similarity matrix)
  11. *
  12. * @return array An array with two numbers which are the cluster ids
  13. */
  14. public function getNextMerge();
  15. }

The merge strategies available in NlpTools are the following.

  • SingleLink
  • CompleteLink
  • GroupAverage

You can see examples of using hierarchical clustering in NlpTools' tests or the following small example.

  1. use NlpTools\Clustering\Hierarchical;
  2. use NlpTools\Clustering\MergeStrategies\SingleLink;
  3. use NlpTools\Similarity\Euclidean;
  4. use NlpTools\FeatureFactories\DataAsFeatures;
  5. $clust = new Hierarchical(
  6. new SingleLink(),
  7. new Euclidean()
  8. );
  9. // Assuming $tset is the same from the above example in K-Means
  10. $dendrogram = $clust->cluster($tset, new DataAsFeatures());
  11. print_r($dendrogram);
  12. print_r(Hierarchical::dendrogramToClusters($dendrogram, 2));

« Similarity / Transformations »