Parallel Locality Sensitive Hashing

herrherr · on May 30, 2014

I've been trying for weeks now to get a system running that can handle larger than RAM datasets and returns queries in an acceptable time. It's running ok now but far from optimal (size of DB is ~100 GB and it contains a few hundred million entries).

Does anyone here have experience with any implementations (such as likelike, lshkit, etc.) and can recommend something that can handle larger sets? All the implementations I have found were either not maintained, old, not running or not suitable for production use.

Will definitely take a look at the paper but unfortunately it's always a very long way from here to an actual implementation (there is no code published as far as I could see).

espeed · on May 30, 2014

Google's simhash paper shows how to do 8 billion 64bit fingerprints in memory:

Detecting Near-Duplicates for Web Crawling (http://www.wwwconference.org/www2007/papers/paper215.pdf)

SEOMoz has in-memory and db-backed implementations of simhash in Python (https://github.com/seomoz?query=simhash)

bunderbunder · on May 30, 2014

Simhash is indeed wicked fast.

Unfortunately, it's also encumbered by a patent: http://www.google.com/patents/US7158961

paulgb · on May 30, 2014

I've been playing with an implementation on top of lightning mdb[1]. Your profile doesn't have an email but feel free to email me if you're interested.

[1] http://symas.com/mdb/

herrherr · on May 30, 2014

Actually I'm also using lmdb (together with Python/numpy) :) Added an email address to my profile, would be happy to exchange some experiences.

hyc_symas · on May 31, 2014

100GB isn't that big a deal. If you have at least 16GB of RAM it should be a breeze. There are much larger data sets in OpenLDAP in production around the world.

But I wouldn't choose python for large scale data processing work. The python CPU/memory overhead is like 100:1, compared to C. (This is why I worked on rtorrent and ditched the original bittorrent client ASAP, and why I hate bitbake....)

herrherr · on May 31, 2014

First of all, thanks for open sourcing lmdb :)

The biggest problem currently is actually degrading performance, although I'm almost 100% sure that this isn't caused by lmdb itself, but rather by the bindings I've tried.

In the end, doing it directly in C is probably the only thing that will actually work.

nkcmr · on May 30, 2014

That's crazy, I was just diving into the nilsimsa LSH utility (http://ixazon.dynip.com/~cmeclax/nilsimsa.html) code when I decided to take a HN break, and boom! LSH news.

Cool.

paulgb · on May 30, 2014

This is the first time I've heard of nilsimsa. Looking through the docs to find the algorithm, here's how it is described, but I'm still not clear on what a "character separation" is in this context or which 3 characters are chosen:

> Nilsimsa uses eight sets of character separations (character, character, etc.) and takes the three characters and performs a computation on them: (((tran[((a)+(n))&255]^tran[(b)]*((n)+(n)+1))+tran[(c)^tran[n]])&255), where a, b, and c are the characters, n is 0-7 indicating which separation, and tran is a permutation of [0-255]. The result is a byte, and nilsimsa throws all these bytes from all eight separations into one histogram and encodes the histogram.

I've found that in practice, shingling text and then minhashing it is scary-good at finding similar documents.

espeed · on May 30, 2014

This is the paper linked to at the bottom...

"Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing" (http://istc-bigdata.org/plsh/docs/plsh_paper.pdf)

fiatmoney · on May 30, 2014

For some context, locality sensitive hashing is handy for quickly finding "similar" data points in high-dimensional space. This comes up a lot in collaborative filtering and recommendation problems (find similar products to hawk at your customers) and topic modeling (find similar text that is semantically getting at the same thing).