Similarity
Similarity in NlpTools is defined in the context of feature vectors. It is also very closely related to distance (many times one can be transformed into other).
Interfaces
We have two interfaces Similarity and Distance.
interface DistanceInterface { } interface SimilarityInterface { }
Jaccard Index
Jaccard Index is simply the count of items in the union of two sets divided by the count of items in their intersection. You can read more about Jaccard Index in Wikipedia.
Cosine Similarity
Given two vectors compute their similarity as the cosinus of their angle θ. In NlpTools the feature sets are made into vectors according to the following example.
$features = array('A','B','A','C','A','C');
will be made into the vector
$v = array('A'=>3,'B'=>1,'C'=>2);
Simhash
Simhash is an implementation of the locality sensitive hash function families proposed by Moses Charikar using the Earth Mover's Distance.
Locality sensitive hash functions are hash functions that map similar documents to neighboring (in hamming space) representations.
Examples
include ('vendor/autoload.php'); use \NlpTools\Tokenizers\WhitespaceTokenizer; use \NlpTools\Similarity\JaccardIndex; use \NlpTools\Similarity\CosineSimilarity; use \NlpTools\Similarity\Simhash; $s1 = "Please allow me to introduce myself I'm a man of wealth and tase"; $s2 = "Hello, I love you, won't you tell me your name Hello, I love you, let me jump in your game"; $tok = new WhitespaceTokenizer(); $J = new JaccardIndex(); $cos = new CosineSimilarity(); $simhash = new Simhash(16); // 16 bits hash $setA = $tok->tokenize($s1); $setB = $tok->tokenize($s2); printf ( " Jaccard: %.3f Cosine: %.3f Simhash: %.3f SimhashA: %s SimhashB: %s ", $J->similarity( $setA, $setB ), $cos->similarity( $setA, $setB ), $simhash->similarity( $setA, $setB ), $simhash->simhash($setA), $simhash->simhash($setB) );