Similarity in NlpTools is defined in the context of feature vectors. It is also very closely related to distance (many times one can be transformed into other).
We have two interfaces Similarity and Distance.
- interface DistanceInterface
- interface SimilarityInterface
Jaccard Index is simply the count of items in the union of two sets divided by the count of items in their intersection. You can read more about Jaccard Index in Wikipedia.
Given two vectors compute their similarity as the cosinus of their angle θ. In NlpTools the feature sets are made into vectors according to the following example.
$features = array('A','B','A','C','A','C'); will be made into the vector $v = array('A'=>3,'B'=>1,'C'=>2);
Locality sensitive hash functions are hash functions that map similar documents to neighboring (in hamming space) representations.
- include ('vendor/autoload.php');
- use \NlpTools\Tokenizers\WhitespaceTokenizer;
- use \NlpTools\Similarity\JaccardIndex;
- use \NlpTools\Similarity\CosineSimilarity;
- use \NlpTools\Similarity\Simhash;
- $s1 = "Please allow me to introduce myself
- I'm a man of wealth and tase";
- $s2 = "Hello, I love you, won't you tell me your name
- Hello, I love you, let me jump in your game";
- $tok = new WhitespaceTokenizer();
- $J = new JaccardIndex();
- $cos = new CosineSimilarity();
- $simhash = new Simhash(16); // 16 bits hash
- $setA = $tok->tokenize($s1);
- $setB = $tok->tokenize($s2);
- printf (
- Jaccard: %.3f
- Cosine: %.3f
- Simhash: %.3f
- SimhashA: %s
- SimhashB: %s