NlpTools API
Class

NlpTools\Similarity\Simhash

class Simhash implements SimilarityInterface, DistanceInterface

Simhash is an implementation of the locality sensitive hash function families proposed by Moses Charikar using the Earth Mover's Distance http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf

A better description of the implementation can be found at
http://infolab.stanford.edu/~manku/papers/07www-duplicates.pdf

The current implementation uses md5 by default to hash the documents
features. Weighted features are not supported (unless duplicating a
feature is considered adding weight to it).

Methods

__construct(integer $len, callable $hash = 'self::md5')

string simhash(array $set)

Compute the locality sensitive hash for this set.

int dist(array $A, array $B)

float similarity(array $A, array $B)

Details

at line 39
public __construct(integer $len, callable $hash = 'self::md5')

Parameters

integer $len The length of the simhash in bits
callable $hash The hash function to compute the hashes of the features

at line 62
public string simhash(array $set)

Compute the locality sensitive hash for this set.

Maintain a vector ($boxes) of length $this->length initialized to
0. Each member of the set is hashed to a {$this->length} bit vector.
For each of these bits we either increment or decrement the
corresponding $boxes dimension depending on the bit being either
1 or 0. Finally the signs of each dimension of the boxes vector
is the locality sensitive hash.

We have departed from the original implementation at the
following points:
1. Each feature has a weight of 1, but feature duplication is
allowed.

Parameters

array $set

Return Value

string The bits of the hash as a string

at line 93
public int dist(array $A, array $B)

Parameters

array $A
array $B

Return Value

int [0,$this->length]

at line 114
public float similarity(array $A, array $B)

Parameters

array $A
array $B

Return Value

float [0,1]