Thanks to you too! I've also enjoyed this dialogue and it helps to formulate one...

_benedict · on Oct 28, 2021

> Let me know if you ever want to catch up to chat consensus and storage.

Likewise!

> working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster

Unless I am badly misreading figure 4, example (c)(i) seems impossible to encounter - with a correct implementation of Paxos, at least. For a record to be appended to the distributed log, it must have first been proposed to a majority of participants. So the log record must be recoverable from these other replicas by performing normal Paxos recovery, or may be invalidated if no other quorum has witnessed it. By fail-stopping in this situation, a quorum of the remaining replicas will either agree a different log record to append if it had not reached a quorum, or restore the uncommitted entry if it had. (c)(ii) seems similar to me.

All of (b) and (d) seem to exceed the failure tolerance of the cluster if you count corruption failures as fail-stop, and (a) is not a problem for protocols that verify their data with a quorum before answering, or for those that ensure a faulty process cannot be the leader.

> Discarding entries due to such conflation introduces the possibility of a global data loss (as shown earlier in Figure 2)."

Again, Figure 2 does not list data loss as a possibility for either Crash (fail-stop) or Reconfigure approaches.

> We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated

This is interesting. I’m not entirely clear what the distinction is that’s offered by PAR here, as I think all distributed consensus protocols must separate uncommitted, maybe-committed and definitely-committed records for correctness. At least in Cassandra these three states are expressly separated into promise, proposal and commit registers.

Either way, I and colleagues will be developing the first real-world leaderless transaction system for Cassandra over the coming year, and you’ve convinced me to expand Cassandra’s deterministic cluster simulations to include a variety of disk faults, as this isn’t a particularly large amount of work and I’m sure there will anyway be some unexpected surprises in old and new code.