Hacker News new | past | comments | ask | show | jobs | submit login

This reminds me of a recent project where I was doing some maintenance work on a system where one component was using ZooKeeper, which uses a distributed consensus algorithm similar to Paxos.

Guess which component was the only one that failed to come back up after a simple reboot?

Turns out that ZooKeeper has default "boot" timeouts that are too short for typical JVM startup times on some hosting platforms, so it just throws up its hands and never tries again after the first attempt.

I want to know who was the person that sat down, cracked his knuckles, and then want about writing the code to deliberately add a failure mode to what is otherwise a very failure-tolerant algorithm.




Well, all valid consensus algorithms are guaranteed to eventually reach agreement, but it's not possible to guarantee about deadline in an asynchronous, distributed environment.

However, in practice, we don't want our system to stall forever waiting for consensus, so a timeout is often used to terminate the algorithm.


ZooKeeper is consistent at all costs and often sacrifices availability for it, by design.


This superficially sounds like a good design choice, but makes no sense whatsoever if you stop and think about it for a minute.

Clustering systems like ZooKeeper are pointless for scenarios where things never fail. They serve no purpose in a system that is infallible, unchanging, and perfect. They're intended for handling failures and other changes.[1]

Power outages happen. Temporary but total network outages happen, sometimes across even multi-region deployments. Cluster-wide reboots may need to occur occasionally. Startup might be slow due to a storm of other systems also starting and slowing down everything.

These scenarios have to be handled. It's not optional for any system design for failure scenarios.

VMware handled it fine. Windows Failover Clustering handled it fine. SQL Server AlwaysOn Availability Groups handled it fine. Kubernetes was back up and running before anything else.

ZooKeeper didn't. It just sat there. No nodes were down. No services had crashed. It wasn't out of disk space. No data was corrupted. No "inconsistency" of any kind had occurred.

It just decided not to start. Because it was, what? Too impatient? It saw that 60 seconds had elapsed, and apparently 61 seconds is just too long to wait, so no cluster! No! Too slow! Boot faster next time!

What if this occurs at 1am on a system without 24/7, round-the-clock support? ZooKeeper will go down, and stay down.

In this case this was a planned outage, so I had to stay up until 3:30 am troubleshooting why ZooKeeper wouldn't act like any other cluster I had ever dealt with. Fun.

[1] I got this lesson taught to me by a CommVault backup product designer. He explained that "backup products" aren't. They're recovery products, and their only purpose is to be used in a disaster. If they can't handle a disaster, they're worse than worthless, they give a false sense of confidence! CommVault was designed by people that had operated forward IT bases for the military in the Yugoslav civil war. They had to recover systems from data centres that were literally smoking craters in the ground. ZooKeeper struggles with planned reboots. It's not even vaguely in the same league of trustworthiness.


> This superficially sounds like a good design choice, but makes no sense whatsoever if you stop and think about it for a minute.

Sure it does; it makes sense for systems where continuing in an inconsistent state would be worse than stopping. If your bank doesn't let people spend their money that's bad, but if it lets people spend their money twice it's disastrous.

Obviously the ideal thing is to always keep working correctly, but anyone who's had a clustered system come up in split-brain mode knows the value of a system that will sometimes stop rather than continuing.


Is etcd better than zookeeper in this regard?


It seems to be able to handle being stopped and started just fine, at least in my (limited) experience. I.e.: Azure Kubernetes Service (AKS) has a stop button that turns off the entire cluster. Pressing the start button "just works". No troubleshooting required.

The ability to stop and start clusters on a whim is very handy for large non-prod environments, for example.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: