Sorry to say this, but this is actually worse than just truncating the hash. Why...

teraflop · on June 15, 2018

I get your point but I don't think it's accurate. If the list contains 500 million hashes (approximately 2^29), then in order to achieve a false positive rate of 0.001, you must store at least a 39-bit prefix of each hash. A Bloom filter can do better -- approximately 14 bits per entry -- because it uses multiple hash functions.

dlhavema · on June 15, 2018

but don't you still have to store/manage the whole 30gb list of hashes? the point of a bloom filter is it can store any arbitrary string in 7-8 BITS depending on how many hash functions you use to test for existence

bloomthrowaway · on June 15, 2018

If you use a secure hash on the banned passwords they will be random (as good as arbitrary), then just truncate them to 7-8 bits and sort by prefix, searching them using binary search. It will be more efficient than a Bloom filter because there can be zero empty space between entries. It creates a structure similar to a Bloom filter but with no wasted space (no empty entries)

A Bloom filter is great when you need a probabilistic hash table that is fairly efficient and can have records added efficiently.

With a fixed list as I described above, the insertion cost is enormous (avg cost of moving 1/2 of the array elements), but the space/speed efficiency reaches the theoretical limit

pronoiac · on June 15, 2018

Your math is bogus. 7-8 bits? If there's one entry, that would give false positives of 1/256, or 0.3%. Even given 1k entries would likely hit the vast majority of entries.

Or, flipping this: every time you add a password, you check that it hasn't been used before. You can only set 256 passwords, total, before the space of truncated hashes is filled.

bloomthrowaway · on June 16, 2018

Bloom filters don't do anything special to change FP rates, a compact list of random hashes is always more efficient than using a Bloom filter until the filter is 100% full

dlhavema · on June 15, 2018

thanks for the clarification, with a fixed list that makes sense.

honoredb · on June 15, 2018

But that's still 12G of memory, since you can't compress it. Is there a data structure that's better optimized for space and probabilistic membership tests?

tantalor · on June 15, 2018

Cuckoo Filter:

https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

Cuckoo filters and bloom filters are different in how they handle increased load. As the cuckoo filter increases load, insertions are more likely to fail, and so its time complexity increases exponentially. However, the false positive rate remains the same. Bloom filters can keep inserting items into the filter at the cost of an ever rising false positive rate.

https://brilliant.org/wiki/cuckoo-filter

bloomthrowaway · on June 15, 2018

you can compress it by just truncating the hashes as short as you want

honoredb · on June 15, 2018

To store 500 million hashes in 860MB, wouldn't you need to truncate them to 13 bits each? That'd give a false positive rate of 1000 in 1000, slightly worse than the Bloom filter's false positive rate of 1 in 1000.

dwaite · on June 16, 2018

Depends on the data structure and encoding. For instance, another simple/fast/non-optimal encoding would be a 32 bit table of bit flags, which would weigh in at 0.5 GiB. That would give you a ~11.5% false positive rate.

Franciscouzo · on June 15, 2018

I think the method you're describing is equivalent to a bloom filter that only uses one hash function per key.

bloomthrowaway · on June 16, 2018

its the same structure as a Bloom filter but more efficient because it has no empty slots

jhafdks7r3wr3 · on June 16, 2018

No, it's not the same structure as a Bloom filter without empty slots. It's the same structure as a Bloom filter with only one hash function. Bloom filters use multiple.

bloomthrowaway · on June 18, 2018

multiple hash functions only exist because they don't use secure hashes (for speed reasons) so there's collisions.

Normally for passwords you would use a secure hash, negating this

Franciscouzo · on June 21, 2018

No, the math on bloom filters already assumes a random oracle.

_wldu · on June 16, 2018

Lookups in a Bloom Filter are basically constant time. Not sure you can go faster than that. Check out Eugene Spafford's 1992 paper.

https://dl.acm.org/citation.cfm?id=134593

CodesInChaos · on June 15, 2018

To make a sorted hash list as space efficient as a bloomfilter you need compress it by taking advantage of the shared prefix of consecutive hashes.

bloomthrowaway · on June 16, 2018

you're right, although I don't think compression would save a ton of space since the hashes are random. Might save a bit or maybe 2 per entry

dwaite · on June 16, 2018

Would that be a trie?

bradleyjg · on June 16, 2018

Probably some sort of radix tree (which is a kind of trie) would be best.