Hacker News new | past | comments | ask | show | jobs | submit login

Ha! I did have an outage because route announcements and data plane were on the same host a few years back. Having a separate health check service & route server is a trade off of complexity vs control. I could see the argument when you only have a couple hosts total. With dozens of endpoints in a fleet its quite nice to tolerate more wonky grey failures.

Unfortunately I dont know of any existing public lib/application/framework that does this type of layer 2/3/4 load balancing for fleets of endpoints. The vrrp/carp/keepalive/heartbeat crew seem focussed on master/slave failover which is totally uninteresting to me.




Yeah, it's pretty custom. We started doing this nearly 10 years ago and when explaining it to vendors their eyes would glaze over. These days it seems quite a bit more common - and I hope I had a tiny bit to do with that evangelizing it wherever I could.

Curious how you solved the hash redistribution problem? We never came up with anything good (some clever hacks though!), but luckily for our uses it wasn't a big deal and we could do away with a whole shedload of complexity.

The best we came up with was pre-assign all the IP's (or over-assign if you wanted more fine-grained balancing) a given cluster could ever maximally utilize. Then distribute those IPs evenly across the load balancers, and have the remaining machines take over those IPs should there be a failure. This was complicated as hell, and obviously broke layer3 to the access port so was a non-starter.

I'm sure we had better/more clever ideas, but we never had reason to chase them down so I honestly forget. At this point, if someone needs to refresh a page once out of every 100 million requests I'm pretty happy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: