Hacker News new | past | comments | ask | show | jobs | submit login
Overflow in consistent hashing (2018) (rmarcus.info)
81 points by g0xA52A2A 8 months ago | hide | past | favorite | 15 comments



This is a very cool page. I love little simulations like this for building intuition for systems problems.

Practical systems deal with this by not caring strongly about overflow (caches), by running at low enough utilizations that overflow is very unlikely given their item counts (e.g. Dynamo), by using explicit partitioning rather than consistent hashing (e.g. DynamoDB), by being able to take advantage of multi-tenancy to drive up per-physical-node utilization even in the case of low per-logical-node utilization, or by using some additional algorithmic sophistication (e.g. Chen et al https://arxiv.org/pdf/1908.08762).

In practice, this kind of overflow is a big deal for systems that deal with relatively small numbers of large objects, and are not as big a deal for systems that deal with large numbers of small objects. Try out the numbers in the page's "Handy Calculator" to see how that plays out.

It's also worth mentioning that this isn't unique to consistent hashing, but is a problem with random load balancing more generally. "Pick a random server and send traffic to it" is an OK load balancing strategy when requests are small and servers are large, but a terrible one when requests become relatively large or expensive. In the general load balancing/placement problem this is easier than the storage case, because you don't need to find requests again after dispatching them. That makes simple algorithms like best-of-2 and best-of-k applicable.


This is my post from 2018 (I didn't submit it to HN), and it could definitely use a "here's what practical systems do" update! I'll put it on the TODO list...

Your point about systems dealing with a relatively small number of large objects vs. small objects also makes sense: this is essentially the "cost" of an overflow (4kb spills once in a blue moon? Oh well, handle that as a special case. 4TB spills once in a blue moon? The system might crash). This is more obvious, as you also point out, in load balancing.

One aspect I found very counter-intuitive: before this investigation, I would've guessed that having a large number of large bins makes overflow increasingly unlikely. This is only partially true: more bins is obviously good, but larger bins are actually more sensitive to changes in load factor!

Overall, I think you are right that this is not really a concern in modern systems today. Compared to Dynamo, I still think Vimeo's solution (linked at the bottom of the post) is both intuitive and low-complexity. But regardless, more of an interesting mathematical diversion than a practical systems concern these days.


Yeah there's a nice load balancing paper called the power of two choices that shows sending two requests gives surprisingly good results.


Anyone interested in consistent hashing should take a look at the simpler and more general rendezvous hashing[1] too.

[1] https://en.m.wikipedia.org/wiki/Rendezvous_hashing


There is also jump consistent hashing https://arxiv.org/pdf/1406.2294


There is also Aggregated Rendezvous Hashing[1] available, which kind of solves problem of growth of required calculations for large numbers of nodes.

[1] https://github.com/SenseUnit/ahrw


I used a different technique that seems to work well in practice. Prehash the names of the nodes beforehand. Then hash the key and combine the hashes using a much cheaper algorithm [1]. You only need to do a single hash per key as with consistent hashing and then a very fast O(n) operation instead of a hash to find the optimal node. This does degrade to an O(nlogn) sort if you need to find the N best nodes instead of the single best node (e.g. to have a concept of fallbacks that you hand off to the next link in the chain), but I found that to not actually matter where I implemented it (generally was routing on < 100 nodes total in the first place).

[1] https://stackoverflow.com/a/27952689


Actually, I think what a lot of real systems do is equivalent to pre-computing a "random" table that has suitable balancing properties, and then use the hash to index into it.

e.g. Ceph used to have a big problem with overloaded placement groups, causing some disks to get twice as much load; max throughput was when those maxed out, leaving the rest half-idle. I don't recall the details of the current solution, but I think it's equivalent to generating a random assignment, and then tweaking it to get rid of over-full bins.

The original Chord-style consistent hashing is easier in the P2P environments it was designed for, but typically consistent hashing is used today in much more closely-coupled systems.


Aren't you supposed to use more buckets than nodes, with each node hosting a number of buckets not all adjacent on the circle? This I expect would reduce the problems described in the article, though not eliminate them of course.


Having multiple logical buckets per physical node doesn't fix this problem. It does help ensure that the bucket sizes are closer to uniform, but not that items hash uniformly into the available buckets. Even if all the buckets are exactly uniform (as in some of the simulations on this page, if I understand correctly), to inconsistent hashing of items to buckets leads to inconsistent load.

Multiplicity does help with a major operational concern, though: when a node fails and recovery is needed, the recovery traffic can be spread uniformly across all cluster members rather than hot-spotting a small number of neighbors. Incidentally, this is a classic congestive collapse scenario in consistent hashed systems: a node looks failed because its overloaded, which starts recovery, which adds load to the neighbors which makes them look overloaded, and the whole thing collapses.


I always thought what systems did in practice is that each node can have a variable number of logical buckets assigned to it, so if there is an uneven distribution, the physical node subscribes to more or less buckets. This makes it so that the maximum logical bucket size is the maximum number of items a physical node can hold.


I believe this problem diminishes as the key count goes up.


Depends!

If you double the number of keys and you double the number of bins (load factor stays constant), then the problem becomes much worse very quickly.

If you double the number of keys and you double the size of each bin (load factor stays constant), then the problem diminishes as you suggest. BUT, larger bins are more sensitive to changes in load factor.

The sibling comment ( https://news.ycombinator.com/item?id=40415826 ) does a good job of summarizing how post-2018 systems handle this issue.


Load balancing solves the issue of non-uniform hashing by generating two hashes and picking the hash that corresponds to the node with lower load. Can something similar be done here?


No because the point of consistent hashing is that it doesn’t require state other than the number of nodes to map items to nodes. Picking the lower load hash would require tracking much more state




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: