Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry to say this, but this is actually worse than just truncating the hash.

Why? Bloom filters are designed to maintain a reasonable false positive rate while still allowing you to add items extremely quickly (high insertion rate).

Since the bad password list is presumably fixed (or almost so) you can do checks faster and with a lower FP rate by just truncating the hashes and prefix sorting the list ahead of time.



I get your point but I don't think it's accurate. If the list contains 500 million hashes (approximately 2^29), then in order to achieve a false positive rate of 0.001, you must store at least a 39-bit prefix of each hash. A Bloom filter can do better -- approximately 14 bits per entry -- because it uses multiple hash functions.


but don't you still have to store/manage the whole 30gb list of hashes? the point of a bloom filter is it can store any arbitrary string in 7-8 BITS depending on how many hash functions you use to test for existence


If you use a secure hash on the banned passwords they will be random (as good as arbitrary), then just truncate them to 7-8 bits and sort by prefix, searching them using binary search. It will be more efficient than a Bloom filter because there can be zero empty space between entries. It creates a structure similar to a Bloom filter but with no wasted space (no empty entries)

A Bloom filter is great when you need a probabilistic hash table that is fairly efficient and can have records added efficiently.

With a fixed list as I described above, the insertion cost is enormous (avg cost of moving 1/2 of the array elements), but the space/speed efficiency reaches the theoretical limit


Your math is bogus. 7-8 bits? If there's one entry, that would give false positives of 1/256, or 0.3%. Even given 1k entries would likely hit the vast majority of entries.

Or, flipping this: every time you add a password, you check that it hasn't been used before. You can only set 256 passwords, total, before the space of truncated hashes is filled.


Bloom filters don't do anything special to change FP rates, a compact list of random hashes is always more efficient than using a Bloom filter until the filter is 100% full


thanks for the clarification, with a fixed list that makes sense.


But that's still 12G of memory, since you can't compress it. Is there a data structure that's better optimized for space and probabilistic membership tests?


Cuckoo Filter:

https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

Cuckoo filters and bloom filters are different in how they handle increased load. As the cuckoo filter increases load, insertions are more likely to fail, and so its time complexity increases exponentially. However, the false positive rate remains the same. Bloom filters can keep inserting items into the filter at the cost of an ever rising false positive rate.

https://brilliant.org/wiki/cuckoo-filter


you can compress it by just truncating the hashes as short as you want


To store 500 million hashes in 860MB, wouldn't you need to truncate them to 13 bits each? That'd give a false positive rate of 1000 in 1000, slightly worse than the Bloom filter's false positive rate of 1 in 1000.


Depends on the data structure and encoding. For instance, another simple/fast/non-optimal encoding would be a 32 bit table of bit flags, which would weigh in at 0.5 GiB. That would give you a ~11.5% false positive rate.


I think the method you're describing is equivalent to a bloom filter that only uses one hash function per key.


its the same structure as a Bloom filter but more efficient because it has no empty slots


No, it's not the same structure as a Bloom filter without empty slots. It's the same structure as a Bloom filter with only one hash function. Bloom filters use multiple.


multiple hash functions only exist because they don't use secure hashes (for speed reasons) so there's collisions.

Normally for passwords you would use a secure hash, negating this


No, the math on bloom filters already assumes a random oracle.


Lookups in a Bloom Filter are basically constant time. Not sure you can go faster than that. Check out Eugene Spafford's 1992 paper.

https://dl.acm.org/citation.cfm?id=134593


To make a sorted hash list as space efficient as a bloomfilter you need compress it by taking advantage of the shared prefix of consecutive hashes.


you're right, although I don't think compression would save a ton of space since the hashes are random. Might save a bit or maybe 2 per entry


Would that be a trie?


Probably some sort of radix tree (which is a kind of trie) would be best.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: