I've been trying for weeks now to get a system running that can handle larger than RAM datasets and returns queries in an acceptable time. It's running ok now but far from optimal (size of DB is ~100 GB and it contains a few hundred million entries).
Does anyone here have experience with any implementations (such as likelike, lshkit, etc.) and can recommend something that can handle larger sets? All the implementations I have found were either not maintained, old, not running or not suitable for production use.
Will definitely take a look at the paper but unfortunately it's always a very long way from here to an actual implementation (there is no code published as far as I could see).
I've been playing with an implementation on top of lightning mdb[1]. Your profile doesn't have an email but feel free to email me if you're interested.
100GB isn't that big a deal. If you have at least 16GB of RAM it should be a breeze. There are much larger data sets in OpenLDAP in production around the world.
But I wouldn't choose python for large scale data processing work. The python CPU/memory overhead is like 100:1, compared to C. (This is why I worked on rtorrent and ditched the original bittorrent client ASAP, and why I hate bitbake....)
The biggest problem currently is actually degrading performance, although I'm almost 100% sure that this isn't caused by lmdb itself, but rather by the bindings I've tried.
In the end, doing it directly in C is probably the only thing that will actually work.
This is the first time I've heard of nilsimsa. Looking through the docs to find the algorithm, here's how it is described, but I'm still not clear on what a "character separation" is in this context or which 3 characters are chosen:
> Nilsimsa uses eight sets of character separations (character, character, etc.) and takes the three characters and performs a computation on them: (((tran[((a)+(n))&255]^tran[(b)]*((n)+(n)+1))+tran[(c)^tran[n]])&255), where a, b, and c are the characters, n is 0-7 indicating which separation, and tran is a permutation of [0-255]. The result is a byte, and nilsimsa throws all these bytes from all eight separations into one histogram and encodes the histogram.
I've found that in practice, shingling text and then minhashing it is scary-good at finding similar documents.
For some context, locality sensitive hashing is handy for quickly finding "similar" data points in high-dimensional space. This comes up a lot in collaborative filtering and recommendation problems (find similar products to hawk at your customers) and topic modeling (find similar text that is semantically getting at the same thing).
Does anyone here have experience with any implementations (such as likelike, lshkit, etc.) and can recommend something that can handle larger sets? All the implementations I have found were either not maintained, old, not running or not suitable for production use.
Will definitely take a look at the paper but unfortunately it's always a very long way from here to an actual implementation (there is no code published as far as I could see).