NlpTools

Natural language processing in php

Similarity

Similarity in NlpTools is defined in the context of feature vectors. It is also very closely related to distance (many times one can be transformed into other).

Interfaces

We have two interfaces Similarity and Distance.

  1. interface DistanceInterface
  2. {
  3. public function dist(array &$setA, array &$setB);
  4. }
  5. interface SimilarityInterface
  6. {
  7. public function similarity(array &$setA, array &$setB);
  8. }

Jaccard Index

Jaccard Index is simply the count of items in the union of two sets divided by the count of items in their intersection. You can read more about Jaccard Index in Wikipedia.

Cosine Similarity

Given two vectors compute their similarity as the cosinus of their angle θ. In NlpTools the feature sets are made into vectors according to the following example.

$features = array('A','B','A','C','A','C');
will be made into the vector
$v = array('A'=>3,'B'=>1,'C'=>2);

Simhash

Simhash is an implementation of the locality sensitive hash function families proposed by Moses Charikar using the Earth Mover's Distance.

Locality sensitive hash functions are hash functions that map similar documents to neighboring (in hamming space) representations.

Examples

  1. include ('vendor/autoload.php');
  2. use \NlpTools\Tokenizers\WhitespaceTokenizer;
  3. use \NlpTools\Similarity\JaccardIndex;
  4. use \NlpTools\Similarity\CosineSimilarity;
  5. use \NlpTools\Similarity\Simhash;
  6. $s1 = "Please allow me to introduce myself
  7. I'm a man of wealth and tase";
  8. $s2 = "Hello, I love you, won't you tell me your name
  9. Hello, I love you, let me jump in your game";
  10. $tok = new WhitespaceTokenizer();
  11. $J = new JaccardIndex();
  12. $cos = new CosineSimilarity();
  13. $simhash = new Simhash(16); // 16 bits hash
  14. $setA = $tok->tokenize($s1);
  15. $setB = $tok->tokenize($s2);
  16. "
  17. Jaccard: %.3f
  18. Cosine: %.3f
  19. Simhash: %.3f
  20. SimhashA: %s
  21. SimhashB: %s
  22. ",
  23. $J->similarity(
  24. $setA,
  25. $setB
  26. ),
  27. $cos->similarity(
  28. $setA,
  29. $setB
  30. ),
  31. $simhash->similarity(
  32. $setA,
  33. $setB
  34. ),
  35. $simhash->simhash($setA),
  36. $simhash->simhash($setB)
  37. );

« Maximum Entropy Model / Clustering »