Ha! I *did* have an outage because route announcements and data plane were on th...

phil21 · on May 29, 2014

Yeah, it's pretty custom. We started doing this nearly 10 years ago and when explaining it to vendors their eyes would glaze over. These days it seems quite a bit more common - and I hope I had a tiny bit to do with that evangelizing it wherever I could.

Curious how you solved the hash redistribution problem? We never came up with anything good (some clever hacks though!), but luckily for our uses it wasn't a big deal and we could do away with a whole shedload of complexity.

The best we came up with was pre-assign all the IP's (or over-assign if you wanted more fine-grained balancing) a given cluster could ever maximally utilize. Then distribute those IPs evenly across the load balancers, and have the remaining machines take over those IPs should there be a failure. This was complicated as hell, and obviously broke layer3 to the access port so was a non-starter.

I'm sure we had better/more clever ideas, but we never had reason to chase them down so I honestly forget. At this point, if someone needs to refresh a page once out of every 100 million requests I'm pretty happy.