The bloom filter will tell you whether _the hash of your password_ is in the set. It doesn't matter whether your _password_ matches a pwned one, it only matters whether your password has the same hash as one that is pwned. This tool expands on that by simplifying the test such that it gets more false positives, but still has no false negatives.
Imagine that password hashes are six characters, and `abcdef` is the only pwned password we know about. A smaller version of this tool would reject your password if it hashes to anything that starts with `abc` (such as `abc000`, `abcdee`, `abcdef`, etc). With this toy example, we can see that this is deliberately pessimistic -- it only is a "your password might be pwned" indicator.
It's like throwing out anything in your fridge that is past its expiration date, even if you don't know whether it's actually spoiled.
>Imagine that password hashes are six characters, and `abcdef` is the only pwned password we know about. A smaller version of this tool would reject your password if it hashes to anything that starts with `abc` (such as `abc000`, `abcdee`, `abcdef`, etc).
Doesn't that not work very well at all though? If these "abcXXX" strings are hashes, than just because they share a prefix doesn't mean the unhashed password are even remotely similar.
"good-password-123-dog-xxx..." could hash to "abcdeg" and "password" could be the thing that hashed to "abcdef".
You're completely right, but I think you're misinterpreting the usefulness of this bloom filter. It would most likely serve only as the first layer in a password strength check.
Because there is no possibility of a false negative, you know that if the bloom filter does not include the hash you gave it, the password definitely does not exist in the original corpus. In this case, the password is probably at least semi-unique. If you're using this bloom filter in the context of a web application registration process, this would likely be the case for anyone trying to sign up with a strong/unique password, e.g. a random one generated by a password manager.
However, if the bloom filter maybe contains the hash, you would likely fall back to some more expensive strength check, like querying the full HIBP database to see if this exact password definitely exists in there.
Similarity between passwords doesn't really matter for the purposes of this checker - all it's looking for is a _possible_ exact match. If your password hashes to "abcXXX", then there are two possibilities:
- "abc" is in the bloom filter, and so your password _might_ be in the list (depending on what the remaining characters past "abc" are)
- "abc" is not in the bloom filter, so your password _is definitely not_ in the list.
This results in a high rate of false positives (times when a password is incorrectly thought to be part of the list, but it's not) but that's deemed to be an acceptable tradeoff for this algorithm.
Collisions for the entire hash are rare, but this is just a prefix of the hash. I imagine it is significantly more common for a collision in the prefix.
I maintain a library that talks to the Have I Been Pwned password database, which uses a 5-hexadecimal-digit prefix of the SHA1 hash.
Someone filed a PR asking to add a cache for HIBP's responses, and I did some quick math and came up with ~1200 as a birthday-paradox number for the first five digits of a SHA1 hex digest (assuming uniform/random distribution of hex digits).
Imagine that password hashes are six characters, and `abcdef` is the only pwned password we know about. A smaller version of this tool would reject your password if it hashes to anything that starts with `abc` (such as `abc000`, `abcdee`, `abcdef`, etc). With this toy example, we can see that this is deliberately pessimistic -- it only is a "your password might be pwned" indicator.
It's like throwing out anything in your fridge that is past its expiration date, even if you don't know whether it's actually spoiled.