Hacker News new | past | comments | ask | show | jobs | submit login
Which hashing algorithm is best for uniqueness and speed? (programmers.stackexchange.com)
243 points by fekberg on Oct 14, 2013 | hide | past | favorite | 81 comments



A more thorough test suite for hash function is SMHasher, by the same author of MurmurHash. It measures collisions, output uniformity, biased bits, etc... as well as speed, under several input distributions.

https://code.google.com/p/smhasher/

Here is the output for some popular hash functions:

https://code.google.com/p/smhasher/wiki/SMHasherDump


If someone could run smhasher for the hash functions in the original article (and perhaps also MurmurHash 2/3) and summarize the result for the small key sizes I will give them a cookie.


I REALLY think we're all tired of random people giving us yet another cookie to store...


It's worth remembering that when selecting things at random from a given pool, there is always a chance of collision. If you think of a hash as taking a value to a "random" result, with different objects going to results uniformly "at random", then you have to expect collisions. This is why the term "perfect hash" is defined so carefully.

For those who are interested, the probability of collision, then size of the hash destination, and the number of objects chosen are related by this formula:

    Choosing from N items, with N large, and wanting a
    probability T of having no collisions, the number of
    items you can choose at random with replacement is:

        k ~ sqrt( -2 N ln(T) )
So

        T ~ exp(-(k^2)/(2N))
Source: http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN2013...


There's always a chance of a collision, but for a large enough pool relative to the number of objects the chance may be sufficiently smaller than the chance of hardware failure that it can be legitimately ignored. For something where the entire space must in some way be realized (eg. hashtable) this won't be the case, obviously.


Most of the time, that isn't true. Hardware failures are (barring quality control issues) independent, hash collisions are not.

For example, if you assume your hash will not have collisions, and design your hash table without any mechanism for coping with one, and it happens to have a clash on "Jones" and "Smith", your address book app will become unreliable for a rather large fraction of your users.


But at the scales we're talking about, it won't happen to have a clash on "Jones" and "Smith". This won't be a hash table - this will be things like a git commit hash, where generating 1/ns is likely to give a collision in 3 years, before which the storage required just to store hashes is in the yottabytes.


My research group at CMU uses a lot of hashing these days and we care mostly about performance. We've settled on three as our basics: - CityHash as a nice, general purpose hash, particularly on newer machines with CRC32c support in hardware; - SpookyHash for longer strings, particularly with no CRC32 support; - CRC32c in hardware for a quick & dirty hash with possibly slightly more poor distribution.

It depends a lot on what you're hashing. For strings, both City and Spooky are great. You should benchmark both on your target architecture(s). If you're trying to take, e.g., integers to integers, we've found that CRC32c is a pretty darn good fallback if you need insanely high performance.

If you're running a hash table exposed to externally controlled input, then you also want to consider (whether you need) algorithms that provide decent resistance against "hash flooding". SipHash is the strongest contender there short of throwing your hands in the air and using MD5 with a local secret seed.


I think it's worth reiterating your last paragraph. If your hash table is exposed to external input, an effective DoS attack can be mounted by turning your hash table into an expensive linked list.

http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.h...


One issue to be careful of is assuming that you don't need the guarantees a cryptographic hash provides (preimage resistance, extension resistance, etc.) in your application. If an adversary can force hash collisions they may, depending on your uses for the hashing algorithm, be able to cause degredation or denial of service, or perhaps trigger information disclosures if they're very smart. If there's any way in which an attacker might be able to have some control over the dataset you are hashing you should probably be looking at using a cryptographic hash anyway.


Is denial of service from an attacker who has your source something you actually consider when building a product?

I can see trying to prevent data breaches and such, but going so far as to be wary of denial of service attacks from people who have your source code might be wasted effort.


The prevalence of frameworks means attackers probably have your source, or at least enough of it to be interesting.

This specific denial of service attack hit a lot of systems in 2012, and all you needed to know was which framework they were using (or just try all of them, as the attack has incredible leverage)

It's as simple as crafting a query-string that results in an O(n²) hashtable building, and allows you to burn minutes of server CPU with each request.


Coming from the perspective of a startup that is building a product, future risk/estimated damage from DOS attacks is not worth any of today's efforts to prevent. I mean, weren't they all able to fix it after the fact and not suffer much in the end?

These are problems you only run into if you've got a successful product (which happens very rarely) - and solving these problems before you have users might not be a good allocation of engineering resources for startups.

I liken it to making a system that scales in advance to millions of concurrent users when you haven't even got one.


The thing is that anywhere in your code where you are doing something which involves hashing data to create a digest for some purpose you are almost certainly attempting to do something a bit clever (I'm going to map the problem from the domain of entities to a domain of proxies for entities that are smaller and easier to deal with. I'll just deal with the special cases...). And if you're making a decision to use Murmur2 in that role, you're probably trying to be very clever. Whenever a programmer tries to do something clever, in my experience, there is something they haven't thought about. Often the something they haven't thought about leads to a security vulnerability.

The fact is it's often difficult to tell the difference between a place where the security properties of the digest algorithm you use are irrelevant, and a place where it might be critical. For example, a zip file uses a CRC32 checksum to validate each file; git uses SHA1 to uniquely identify objects; CouchDB shard mapping uses a strategy based on CRC32... if git had used murmur, would that cause problems for it? Is CouchDB being clever by using a fast hash for sharding at the expense of being secure?

Now, quick, make a decision which algorithm to use to generate Etag headers for your web API HTTP response entities: are you going to go with SHA256 or Murmur2? Murmur2 is much faster, and it's certainly a clever solution... but are you sure you didn't create a security hole?

I'm not saying "nobody ever got fired for choosing SHA256," but I am kind of saying that you have to be very sure you know the security model you're operating in before you build on a hash algorithm that doesn't provide strong extension and preimage resistance.


You should code as if any attacker has your source. If denial of service attacks are something you need to consider, you need to consider them from someone who has your source.


Denial of service attacks are usually not devastating (they've happened to many large services multiple times) and there's a very real engineering cost in being absolutely certain that you are immune to such attacks.

I think most people here don't fall under the umbrella of those who need to consider denial-of-service attacks. When your primary goal should be to launch your product (which you aren't even sure will take off) I would think trying to prevent denial-of-service attacks is a waste of time.


I'm not sure I disagree with any of that. I simply said that if you need to consider DoS you should consider DoS with the assumption your attacker has access to your source.


My understanding is that (in general) for associative containers, optimized trie implementations are faster than hash tables with cryptographic hashes.


>>FNV-1a collisions

>>altarage collides with zinke

>>altarages collides with zinkes

I found this most interesting. I have no idea how FNV-1a is constructed but I'm sure this is an interesting (mathematical?) quirk in an otherwise great algorithm since it had only 4 collisions for more than a couple of hundred thousand inputs.


It's not surprising, and it is not peculiar to FNV-1a. Most of these hashes will exhibit this same behavior.

Since the hash iterates over the characters in the string, once you find two colliding strings S and T, if you append any other string to both S and T, the hashes of S' and T' will also be identical. Try it yourself:

    #include <stdio.h>
    #include "hash_32a.c"
    int main(int argc, char* argv[])
    {
        const char* s[] = {
            "altarage",
            "zinke",
            "altarage_foo",
            "zinke_foo",
            "altarage_xyzzy",
            "zinke_xyzzy"
        };
        int i;
        for (i = 0; i < sizeof s / sizeof (char*); ++i)
        {
            Fnv32_t x = fnv_32a_str(s[i], FNV1_32A_INIT);
            printf("%s: %08x\n", s[i], x);
        }
    }


    $ ./test_fnv1a
    altarage:       e460d8b6
    zinke:          e460d8b6
    altarage_foo:   3d8619c5
    zinke_foo:      3d8619c5
    altarage_xyzzy: 2a6373cf
    zinke_xyzzy:    2a6373cf


I saw that too. DJB2a also has

>>playwright collides with snush

>>playwrighting collides with snushing


Yeah, that seems to violate the goal of the avalanche effect.

http://en.wikipedia.org/wiki/Avalanche_effect


That's a design goal for cryptographic hash functions, which these are not.


For something like English words, where you will have a lot of common prefixes and suffixes, it seems a legitimate design goal as well (though it may not be a tradeoff worth making, to be sure).


For those that love this topic, xxhash[1] has a SMHasher score of 10[2] and is ~2x as fast as MurMur3 -- really interesting, incredibly fast, high quality hash function.

  [1] https://code.google.com/p/xxhash
  [2] https://code.google.com/p/xxhash/wiki/xxh32


I was going to say this as well. It has a Node.js port that I'm using for static file versioning, and the speed is great.


It doesn't mention SipHash: https://131002.net/siphash/


If you have the reputation points, please suggest it to be included in the test on Stack Exchange.

SipHash was developed to replace CityHash, Murmurhash as well as being a fast and secure PRF.


I added a comment to suggest it.


Thanks!


Nice answer. However in practice current Intel processors do CRC32 in hardware. This blows away the rest and has really nice results. So I use portable code that using Intel's slicing-by-8 CRC32 code for generic machines and then automatically uses SSE4.2 on x86's that most people are actually using.


Nice analysis.

Since that post was submitted Murmurhash 3 and a number of other notable hash functions have been released. If you liked the linked post you should go read Aggregate Knowledge's series of posts on hash functions for a look at some newer algorithms: http://blog.aggregateknowledge.com/2012/02/02/choosing-a-goo...


The best hash algorithm for uniqueness and speed uses knowledge of the distribution of the incoming values and exploits whatever patterns are available. It's not one-size-fits-all. If I am being fed arbitrary binary blobs, taking the first 4 bytes might be as unique as any other approach (and will clearly be faster than anything else that actually depends on the data). If I am being fed English test, that is going to perform horribly on uniqueness.

If the source of the data might benefit (DoS attack being a common example) from deliberately causing collisions then there are additional constraints and looking at cryptographic hash functions may be warranted.


I often see these "plot the output of the RNG on a 2d grid" things. I understand they're useful for demonstrating non-randomness, but if you're searching for patterns, do they reveal anything an FFT wouldn't?


Human brains are quite good at spotting some kinds of patterns. A visual representation is a trivially easy way to access that pattern spotting ability. Trivial as in you can do it in the time it's taken me to write this reply. It also communicates well.


isn't fft just a different way of looking at the data?


A Fourier transform is not merely another graphical representation of data (although, it could be used as such). Among other things, it lends itself to generating features and analysis that can be used for detecting patterns by computer, not just humans.


Essentially, yes, that was my understanding, too. I've only used in in signal processing - converting from the time domain to the frequency domain. Not sure how that would apply to this.


I was thinking specifically of this:

  When i look at the FNV-1a "number" map, i think i see subtle
  vertical patterns. With Murmur i see no patterns at all. 
  What do you think?
Wouldn't that vertical repetition, and the other repeating patterns the other hashes form, show up as a nice spike in the frequency domain? Whereas I would expect uniform hashing to generate a flat, white-noise spectrum.


I find the avalanche maps used here easier to grasp: http://home.comcast.net/~bretm/hash/5.html

Here is FNV, quite a lot of reds: http://home.comcast.net/~bretm/hash/6.html And of course all green for SHA-1: http://home.comcast.net/~bretm/hash/9.html


This seems to be very of date? My understanding was that when SipHash arrived it knocked out all of these? (Please correct me if I'm wrong.)

This may be an interesting weakness with the StackOverflow system; if there's an answer that used to be very good, it may take a lot to knock it out of the first position when reality changes.


I don't know of a performance comparison between SipHash and other secure hashes other than [1], which compares it to MD5, "Python's new randomized hash, with a 128-bit key and 64-bit output", and a couple of SHA's. It is significantly slower than DJB2, though.

[1] http://bench.cr.yp.to/results-auth.html


You can wiki-edit any SO answer (with low karma it's queued for approval before being shown), unfortunately it looks like something in the culture or UI prevents this from being common.


> Yes. I started writing my test program to see if hash > collisions actually happen - and are not just a theoretical construct.

This should be no surprise. The birthday paradox, as Knuth points in in TAoCP, makes it clear why collisions happen. Given 23 people in a room, there is a greater that 50% chance that at least two have the same birthday.

Placing only 23 keys into a hash table with 365 entries with a perfectly random hash function will probably result in at least one collision.


    Placing only 23 keys into a hash table with 365 entries with
    a perfectly random hash function will probably result in at
    least one collision.
No, it won't "probably" result in at least one collision, it will result in at least one collision with probability about 1/2. As quoted elsewhere, placing k objects in a hash table with N entries will result in no collisions with probability about exp(-(k^2)/(2N)).


Interesting. I was going to complain, as I thought "probably" meant the probability of occurrence was greater than 50%.. That doesn't seem to be the case.

  Probably: almost certainly; as far as one knows or can tell
I would guess waynecochran thought something similar.


Curious. I've just taken a straw poll in the office. I asked "if someone said something was probably going to happen, what probability would you give it?"

The answers were:

* 80%

* 50% to 90%

* 90%

* > 80%

* 80%

* 80% to 90%


My guess would have been "exactly strictly higher than 50%".


While mathematically correct, this does not seem to be how people use with that phrase. "X is probably going to happen" implies there is some evidence in favor of X, while "X is probably not going to happen" implies evidence against. Now, in regular speech, people do not feel compelled to force themselves to choose one or the other, but are just as likely to say "I have no idea", implying no evidence in either direction.

From an Information Theory point of view, 50% probability means zero information, and 25%/75% probabilities mean one bit of information against/in-favor-of the event. This sounds roughly consistent with the results of the poll, with people assigning higher probabilities implying higher information density of the evidence.


Same; "more probable than not", which falls back on your definition.


Yes, you're probably right.


I'm curious as to why the emphasis on speed here? I would think that hashes are mostly for either confirming a payload, or as a security measure against plain text passwords/keys.

I would think that at least in the latter, that less fast, more cpu/memory intensive would be more desirable. Something like SCRYPT may be overkill, especially if you use if for password hashes, under load for a lot of logins, but the comparison seems to fly in the face of that.

This also doesn't really mention random seed/salt values.


I'd think it's far more common to use hashes for various data structures (i.e. hash map) than for cryptographic or data integrity purposes.


Even for data structures, you generally want to use a cryptographic hash function, to prevent collision and other similar attacks from degrading the performance of a data structure.

For example, if I can sufficiently reverse your hashing process, I can degrade a hash tree in to a linked list for a particular set of data - something your firewalls and other security measures will have virtually no chance of stopping, but which will greatly increase the run time taken to parse (malicious) instructions about that data.

A common use of this is maliciously chosen inserts of new items, to cause the tree structure to become a list building on it.


Well, there are various applications when you don't necessarily need the cryptographic property. Hashing is widely used for large-scale back-end machine learning tasks (LSH, linear models...) where speed and randomness are the only properties that matter.


Until recently I don't think any major hash table implementation (e.g. not Java, STL, Python, Ruby, etc) used a cryptographically secure hashing algorithm. They're just too slow. Slow enough that you might as well use a tree-based data structure, which doesn't suffer from hash-collision based attacks.

This is changing a bit now when SipHash is available. Unfortunately SipHash still is a bit too slow and forces an uncomfortable trade-off.


I would be really surprised to find that a tree is faster then a fast cryptographic hash function like SIP hash. Cache misses to main memory are worth 10s of instructions and that number is going up in many cases. Hash functions can be pretty efficient at allowing the processor to run multiple instructions in parallel.

If you know the tree will be cached and/or the keys you have to hash are large then sure a tree might be a win, but hashing a small key might only be the same as two or three cache misses. Comparing tree nodes is not instruction free either.

I think it is really a case by case sort of thing, but there is no substitute for measuring (and nailing down the comparison to specific hash functions).


"Unfortunately SipHash still is a bit too slow and forces an uncomfortable trade-off."

2.6x slower than DJB2 in my recent Rust trial. (SipHash is the hashing function in Rust's standard library.)


This is why hashes for hashtables and similar datastructures are seeded with a random value upon interpreter startup. This throws off your attack well enough to make it barely feasible with a hashing algorithm that distributes sufficiently random. There was a round of DoS vulnerabilities against PHP, Ruby, Java, ..., all based on that attack vector a while ago, but those were all fixed.


A secret seed is not sufficient to prevent DoS vulnerabilities. That is one of the things the creators of SIP hash demonstrated. You can create collisions with MurmurHash3 that work regardless of the seed.


They've also released code to generate an arbitrarily large number of colliding keys for MurmurHash2 and CityHash64, and one to recover the secret key of Python's randomized hash function. At some point here you're doing cryptography whether you want to or not -- and most of these randomized hash functions are just lousy cryptography.


This is right. My wording was incorrect, excuse me. I should have said more correctly "This is why the hashing functions used nowadays include a randomization with a seed determined at interpreter startup." Jruby for example now uses perls hashing algorithm. They used to rely on murmur2, just as cruby did.


The original question asked for

> Example (good) uses include hash dictionaries.

which I interpret as meaning hash tables or similar. If so hash function performance matters a lot, especially for short strings.


If you are hashing fixed-size keys, then tabulation hashing is simple and fast with good theoretical properties:

http://en.wikipedia.org/wiki/Tabulation_hashing

http://rcoh.me/pyhash.pdf


It is worth mentioning that the libstdc++, in version 4.6.3 at least, it has implemented the Murmur and FNV hash function for user defined key types.

If you want to use Murmur you use std::_Hash_impl::hash and for FNV you use std::_Fnv_hash_impl::hash. (Include bits/functional_hash.h)

The implementation is at libsupc++/hash_bytes.cc


Would the security of such algorithms improve dramatically if say 5 years from now we'd live in a world where everyone has a 64-bit device or machine, and the algorithms can be optimized for 64-bit? Are current algorithms limited by 32-bit a lot or not really?


Security of such algorithms now: none.

Security of such algorithms now where everyone has a 64-bit device: none.

Don't confuse hash functions like these and cryptographic hash functions.


gperf - https://www.gnu.org/software/gperf/ This is a "perfect" application of a "perfect hash" generator.


seeing at least 1 collision the tests makes me wonder why the chance of a bitcoin address collision is almost nil. https://en.bitcoin.it/wiki/Weaknesses#Generating_tons_of_add...


Bitcoin uses a cryptographic hash function, SHA256. There are, at present, no known collisions with SHA256.

This question is about non-cryptographic hash functions, as are commonly used in hash tables[1] which implement associative arrays, or sharding databases based on the hash of the record key.

Non-cryptographic hash functions are simply not in the same league as cryptographic ones; cryptographic hash functions are optimized to make it infeasible for an attacker to generate a different message with the same hash, while a non-cryptographic hash function generally simply has to have uniformly random output over its keyspace.

[1] http://en.wikipedia.org/wiki/Hash_table


All of which is true and well written. However it brings up the obvious point missed in the otherwise well written article that yes, if I use 32 bit hashes to shard into 2 to the power of 32 databases, some will have more collisions than others, aka the 32 MSB of a 32 bit hash has some "bad looking" variations. But I don't care about non-randomness in the LSBs because I'm infinitely more likely to shard into, perhaps, 2 to the power of 4 database machines.

The original article did not descend into that obvious area of research. I see no particular reason why a hash algo that has the worst randomness shoved into the least significant byte (which I simply don't care about) might be an inherent result of smooshing the best randomness into its most significant nibble, which I do care very much about. Given the likely use case for a sharding hash, a smart hash designer would make sure that most of his effort is put into smooth distribution in MSB and perhaps totally ignore the LSB for a given amount of latency / computation / electrical power. After all, the actual users are more likely to hash based on the first 4 bits than the first 24 bits. Although you'll always run into people who think its funny to pull their shard subset out of the hash using the LSB (why?) or some random byte in the middle (why?)

I think MSB for shard key comes out of the tradition in the really olden days of sharding based on raw unhashed data. Sometimes thats random enough such that the MSB of the data makes an excellent shard key and hashing would just slow things down for a minimal gain, even today.


Whoa, it's conventional to use the MSB? I usually just do

  bucket = hash(x) & mask
Should I expect non-cryptographic hashes and RNGs to produce better noise in the MSBs? That would explain "Sun Redefines Randomness"


Well, look on the bright side, at least you're not calculating

bucket = hash(x) % buckets

Unless you're stuck supporting non power of two number of shards in which case you need modulus. And shoving more randomness into the LSB is more important than the MSB.

I've thought about it some and I think anyone with a networking background is automatically going to "subnet" their data from MSB down out of pure raw habit. Of course you split top down, just like IP addrs.

I've also seen non-technical popular explanations of sharding which stereotypically use something like phone numbers and start at the most significant digit.

And a stereotypical test question is this does not work well with stuff like customer IDs or transaction IDs because they tend toward sequential-ish, unless you're talking about some kind of data mining thing not daily live transactions.

On the other hand it works well if you shard (sorta) by date if you use the right (usually non-native) format. So if you have dates like 2013-10-14 and shard by YYYYMM somehow, then it could be easy to wipe 201210 from the database because whichever shard its on, its probably not impacting the latency figures for 201310 today. Unless you wanted to speed up the delete by smoothing it across all shards in which case sharding by hash ends up being the smart idea.

Trying to do tricky things can turn into a mess when the tricky thing changes mid project, too.


This is actually an important topic for me. I am implementing a sharding distribution based on consistent hashing using MurmurHash3 as hash function.

I am taking the first 4-bytes of the hash function output and using that. I checked and MurmurHash3 mixes the first and last 8 bytes of hash output as a last step, but I am not sure how much differentiation there is in the first 4-bytes.

I guess it is something I should check.


These are 32-bit hashes. 160-bit hashes with similar characteristics are outrageously less likely to generate collisions. Something like 3.4 X 10^38 less likely if I'm running ColinWright's formula properly.

That difference is computationally extremely significant. It took the SO poster ~9ms to find a collision with, e.g., Murmur. If you mapped those results to the 160-bit hash, finding a collision, even ignoring the added time to compute the larger hash, would take 97 octillion years.


This is not the right way to look at it. Finding a collision on CRC256 (a 256-bit CRC code) is nowhere near as hard as finding a collision on SHA256 (assuming SHA256 is secure). A CRC code is just the remainder modulo a polynomial, so adding any multiple of that polynomial to the string will give you a collision. The probability of a pair of independently and uniformly sampled random strings having the same CRC256 hash is certainly small, but cryptographic hashes make a strong guarantee: even if the pair is not sampled independently or uniformly a collision is unlikely.

It is also worth pointing out that the hash size is not necessarily a measure of security. Very Smooth Hash is a good example of this: VSH has security that depends on the hardness of a problem that is closely related to integer factorization, and produces hashes that are as long as its parameters. You might need 3072 bit parameters for VSH to be secure, and will thus have 3072 bit hashes; but the hardness of finding a collision will be about as hard as brute-forcing a 128 bit keyspace (estimating these things is something of a dark art, and I am not an expert; it might be that VSH requires much larger parameters than RSA for equivalent security).


Of course an application like Bitcoin needs a cryptographic hash. But the parent comment was concerned that a single-digit number of collisions showed up in a couple hundred thousand test cases of a 32-bit hash. A similar count of collisions has a good chance of showing up with a 32-bit cryptographic hash, too. It's the cryptographic properties in concert with the size of the range that make the function secure for its intended use.


Cryptographic hashes are an entire category of hash functions unto themselves. The absolute minimum guarantee that a cryptographic hash makes is that for any polynomial time algorithm, the probability of the algorithm computing a pair (x,y) such that x != y and H(x) = H(y) is "negligible" i.e. is smaller than 1/p(n) for all polynomials p and large enough n; in other words, cryptographic hashes guarantee that collisions are extremely rare. Much like you can create a cipher with security that is reducible to the hardness of some problem (e.g. ElGamal is secure if the Diffie-Hellman problem is hard), it is possible to create collision resistant hashes with such security proofs, e.g. Very Smooth Hash, SWIFFT. In practice, though, we just use hashes that are believed to be security according to heuristic tests; hence the need to periodically update standard hash functions, rather than to just increase parameter sizes.

Also interesting is universal hashing, which has applicability in cryptography (for randomness extraction, among other things):

https://en.wikipedia.org/wiki/Universal_hashing


The length of the digest is 32 bits in the linked example and for Bitcoin is 160 bits. Assuming the hash functions are good, if you only have 32 bits of output you can expect to find a collision after 77,000 hashes; the examples used three times that! With 160 bits, you need to generate many orders of magnitude more hashes before a collision is likely.

See http://en.wikipedia.org/wiki/Birthday_attack


>CRC32 collisions

>codding collides with gnu

accident?





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: