> correctness unfortunately rapidly breaks down when physical disks are used, si...

jorangreef · on Oct 28, 2021

"Isn't the whole point of consensus to get around discrepancies due to faults in individual devices"

Yes, but only with respect to the fault models actually adopted, proven, implemented and tested. Again, most formal proofs, papers and implementations are not actually incorporating any kind of a storage fault model, yet this is essential for correctness in a production deployment.

If you look closely at Paxos and RAFT, they only attempt correctness for the network fault model {packets may be reordered/lost/delayed/replayed/redirected} and process fault model {crash failure}, nothing else. For example, there's no mention of what to do if your disk forgets some writes either because of a misdirected write, or because of a latent sector error, or because of corruption in the middle of the committed consensus log.

Unfortunately, simply increasing quorum will also not actually lower the probability of data loss. In fact, there are trivial cases where even a single local disk sector failure on one replica can cause an entire Paxos or RAFT cluster to experience global cluster data loss.

Protocol-Aware Recovery for Consensus-Based Storage from UW-Madison goes into this in detail, showing where well-known distributed systems are not correct, and also showing how this can be fixed. It won best paper at FAST '18 but is still catching on in industry.

We implemented Protocol-Aware Recovery for TigerBeetle [1], and use deterministic simulation testing to inject all kinds of storage faults on all replicas in a cluster, even as high as 20% simultaneously on all 5 replicas, including the leader. It was pretty magical for us to watch TigerBeetle just keep on running the first time, with high availability and no loss of data, despite radioactively high levels of corruption across the cluster.

[1] https://www.tigerbeetle.com