Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Wikipedia
The NlpTools\Clustering namespace will contain a number of clustering algorithms to be used with the library. Currently it contains:
- K-Means
- Hierarchical Agglomerative Clustering
Interfaces
All clustering classes extend the abstract class Clusterer and implement the following abstract method.
abstract class Clusterer { ... abstract public function cluster(TrainingSet $tset, FeatureFactory $ff); ... }
The clusterers should return an array containing as a first element the clusters (an array with grouped document indices) and any other data produced by the clustering (ex.: centroids and distances for K-Means).
KMeans
KMeans is a very popular, simple and fast clustering algorithm. It minimizes the within cluster sum of square distances from each point to the corresponding centroid. It can be implemented with any metric of distance coupled with a centroid generation strategy. In NlpTools this means that the KMeans instance receives a Distance instance and a new class, a CentroidFactory.
Centroid Factories
The centroid factories are usually coupled with a distance metric and simply produce a new data point, given a set of points, such that the sum of the distances of this point to every point in the set is minimized. In NlpTools there are a number of CentroidFactories already defined:
- Euclidean
- Simply the centroid of the points in the euclidean space (x_1 + x_2 + ...)/n
- Hamming
- A new number that the sum of the hamming distance from every number in the provided set is minimized
- MeanAngle
- Compute a unit vector in the euclidean space such that the sum of the cosine similarities between the centroid and every other vector in the set is maximized (cosine distance minimized)
You can see a very simple example (with semi decent visualization if you have gd) in the tests of NlpTools and a very simple one that follows.
use NlpTools\Clustering\KMeans; use NlpTools\Similarity\Euclidean; use NlpTools\Clustering\CentroidFactories\Euclidean as EuclideanCF; use NlpTools\Documents\TrainingSet; use NlpTools\Documents\TokensDocument; use NlpTools\FeatureFactories\DataAsFeatures; $tset = new TrainingSet(); $tset->addDocument( '', //class is not used so it can be empty ); $tset->addDocument( '', //class is not used so it can be empty ); $tset->addDocument( '', //class is not used so it can be empty ); $tset->addDocument( '', //class is not used so it can be empty ); $tset->addDocument( '', //class is not used so it can be empty ); $tset->addDocument( '', //class is not used so it can be empty ); $clust = new KMeans( 2, // two clusters new Euclidean(), new EuclideanCF() ); $clust->cluster($tset, new DataAsFeatures()) );
Hierarchical Clustering
NlpTools implements hierarchical agglomerative clustering. This clustering method works in the following steps. Each datapoint starts at its own cluster. Then a merging strategy is initialized (usually this initialization includes computing a dis-/similarity matrix). Then iteratively two clusters are merged until only one cluster remains.
class Hierarchical extends Clusterer { public function __construct( MergeStrategyInterface $ms, DistanceInterface $d ) { ... } }
The cluster method does not return grouped indices of the training set's documents but
instead it returns a dendrogram which can be
easily cut into any number of clusters using the static method
dendrogramToClusters($tree, $clustercount)
.
Merge Strategies
The merge strategies need to implement the following interface.
interface MergeStrategyInterface { /** * Study the docs and preprocess anything required for * computing the merges */ /** * Return the next two clusters for merging and assume * they are merged (ex. update a similarity matrix) * * @return array An array with two numbers which are the cluster ids */ public function getNextMerge(); }
The merge strategies available in NlpTools are the following.
- SingleLink
- CompleteLink
- GroupAverage
You can see examples of using hierarchical clustering in NlpTools' tests or the following small example.
use NlpTools\Clustering\Hierarchical; use NlpTools\Clustering\MergeStrategies\SingleLink; use NlpTools\Similarity\Euclidean; use NlpTools\FeatureFactories\DataAsFeatures; $clust = new Hierarchical( new SingleLink(), new Euclidean() ); // Assuming $tset is the same from the above example in K-Means $dendrogram = $clust->cluster($tset, new DataAsFeatures());