Sorry to say this, but this is actually worse than just truncating the hash.
Why? Bloom filters are designed to maintain a reasonable false positive rate while still allowing you to add items extremely quickly (high insertion rate).
Since the bad password list is presumably fixed (or almost so) you can do checks faster and with a lower FP rate by just truncating the hashes and prefix sorting the list ahead of time.
I get your point but I don't think it's accurate. If the list contains 500 million hashes (approximately 2^29), then in order to achieve a false positive rate of 0.001, you must store at least a 39-bit prefix of each hash. A Bloom filter can do better -- approximately 14 bits per entry -- because it uses multiple hash functions.
but don't you still have to store/manage the whole 30gb list of hashes? the point of a bloom filter is it can store any arbitrary string in 7-8 BITS depending on how many hash functions you use to test for existence
If you use a secure hash on the banned passwords they will be random (as good as arbitrary), then just truncate them to 7-8 bits and sort by prefix, searching them using binary search. It will be more efficient than a Bloom filter because there can be zero empty space between entries. It creates a structure similar to a Bloom filter but with no wasted space (no empty entries)
A Bloom filter is great when you need a probabilistic hash table that is fairly efficient and can have records added efficiently.
With a fixed list as I described above, the insertion cost is enormous (avg cost of moving 1/2 of the array elements), but the space/speed efficiency reaches the theoretical limit
Your math is bogus. 7-8 bits? If there's one entry, that would give false positives of 1/256, or 0.3%. Even given 1k entries would likely hit the vast majority of entries.
Or, flipping this: every time you add a password, you check that it hasn't been used before. You can only set 256 passwords, total, before the space of truncated hashes is filled.
Bloom filters don't do anything special to change FP rates, a compact list of random hashes is always more efficient than using a Bloom filter until the filter is 100% full
But that's still 12G of memory, since you can't compress it. Is there a data structure that's better optimized for space and probabilistic membership tests?
Cuckoo filters and bloom filters are different in how they handle increased load. As the cuckoo filter increases load, insertions are more likely to fail, and so its time complexity increases exponentially. However, the false positive rate remains the same. Bloom filters can keep inserting items into the filter at the cost of an ever rising false positive rate.
To store 500 million hashes in 860MB, wouldn't you need to truncate them to 13 bits each? That'd give a false positive rate of 1000 in 1000, slightly worse than the Bloom filter's false positive rate of 1 in 1000.
Depends on the data structure and encoding. For instance, another simple/fast/non-optimal encoding would be a 32 bit table of bit flags, which would weigh in at 0.5 GiB. That would give you a ~11.5% false positive rate.
No, it's not the same structure as a Bloom filter without empty slots. It's the same structure as a Bloom filter with only one hash function. Bloom filters use multiple.
Why? Bloom filters are designed to maintain a reasonable false positive rate while still allowing you to add items extremely quickly (high insertion rate).
Since the bad password list is presumably fixed (or almost so) you can do checks faster and with a lower FP rate by just truncating the hashes and prefix sorting the list ahead of time.