class Simhash implements SimilarityInterface, DistanceInterface
Simhash is an implementation of the locality sensitive hash function families proposed by Moses Charikar using the Earth Mover's Distance http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf
A better description of the implementation can be found at
http://infolab.stanford.edu/~manku/papers/07www-duplicates.pdf
The current implementation uses md5 by default to hash the documents
features. Weighted features are not supported (unless duplicating a
feature is considered adding weight to it).
Methods
__construct(integer $len, callable $hash = 'self::md5') | ||
string |
simhash(array $set)
Compute the locality sensitive hash for this set. |
|
int | dist(array $A, array $B) | |
float | similarity(array $A, array $B) |
Details
at line 39
public
__construct(integer $len, callable $hash = 'self::md5')
at line 62
public string
simhash(array $set)
Compute the locality sensitive hash for this set.
Maintain a vector ($boxes) of length $this->length initialized to
0. Each member of the set is hashed to a {$this->length} bit vector.
For each of these bits we either increment or decrement the
corresponding $boxes dimension depending on the bit being either
1 or 0. Finally the signs of each dimension of the boxes vector
is the locality sensitive hash.
We have departed from the original implementation at the
following points:
1. Each feature has a weight of 1, but feature duplication is
allowed.