Thanks to you too! I've also enjoyed this dialogue and it helps to formulate one's thoughts by writing them down.
Roughly speaking, there are two big ideas (or three, if we count deterministic snapshots, but let's not go into that here) in the PAR paper for me personally — I'll just quote these directly from the paper:
1. "3.4.2 Determining Commitment
The main insight to fix the leader’s faulty log safely and quickly is to distinguish uncommitted entries from possi- bly committed ones; while recovering the committed en- tries is necessary for safety, uncommitted entries can be safely discarded. Further, discarding uncommitted faulty entries immediately is crucial for availability. For in- stance, in case (c)(i), the faulty entry on S1 cannot be fixed since there are no copies of it; waiting to fix that entry results in indefinite unavailability. Sometimes, an entry could be partially replicated but remain uncommit- ted; for example, in case (c)(ii), the faulty entry on S1 is partially replicated but is not committed. Although there is a possibility of recovering this entry from the other node (S2), this is not necessary for safety; it is completely safe for the leader to discard this uncommitted entry." — [working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster]
2. "3.3.3 Disentangling Crashes and Corruption in Log
An interesting challenge arises when detecting corrup- tions in the log. A checksum mismatch for a log entry could occur due to two different situations. First, the system could have crashed in the middle of an update; in this case, the entry would be partially written and hence cause a mismatch. Second, the entry could be safely per- sisted but corrupted at a later point. Most log-based sys- tems conflate these two cases: they treat a mismatch as a crash [30]. On a mismatch, they discard the corrupted entry and all subsequent entries, losing the data. Discard- ing entries due to such conflation introduces the possibil- ity of a global data loss (as shown earlier in Figure 2)."
Finally, why reconfiguration is not always an option [again, see (c)(i) above in particular]:
"Reconfigure. In this approach, a faulty node is removed and a new node is added. However, to change the configuration, a configuration entry needs to be committed by a majority. Hence, the system remains unavailable in many cases (for example, when a majority are alive but one node’s data is corrupted). Although Reconfigure is not used in practical systems to tackle storage faults, it has been suggested by prior research [15, 44]."
The key to grokking the PAR paper for me personally was seeing that what should intuitively be only an isolated local "process-stopper" event can in fact propagate through the consensus protocol into being a showstopper event for the global cluster. This connection between local storage and global consensus was actually introduced in their prior work "Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions" (https://www.usenix.org/conference/fast17/technical-sessions/...).
What was also interesting for us with TigerBeetle (and perhaps I should have mentioned this earlier in our discussion), was that our deterministic simulation testing (https://github.com/coilhq/tigerbeetle#simulation-tests) could actually pick up very quickly where the lack of PAR would lead to cluster unavailability if we inject one or two faults here and there.
We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated, and therefore maximize availability, simply by being more careful in our distinction between committed and uncommitted data, and where this distinction was crucial.
So it's not just theoretical, our test suite is also experimentally telling us that we need PAR. We wouldn't pass our storage fault model without it.
Regarding VSR's Recovery Protocol, perhaps the best explanation of what I'm thinking is simply the YouTube talk linked to in the discussion of the VSR paper by Aleksey Charapko's DistSys Reading Group.
Thanks again for the interesting discussion! Let me know if you ever want to catch up to chat consensus and storage. I'd be excited to hear about Cassandra's new deterministic simulation framework you mentioned.
> Let me know if you ever want to catch up to chat consensus and storage.
Likewise!
> working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster
Unless I am badly misreading figure 4, example (c)(i) seems impossible to encounter - with a correct implementation of Paxos, at least. For a record to be appended to the distributed log, it must have first been proposed to a majority of participants. So the log record must be recoverable from these other replicas by performing normal Paxos recovery, or may be invalidated if no other quorum has witnessed it. By fail-stopping in this situation, a quorum of the remaining replicas will either agree a different log record to append if it had not reached a quorum, or restore the uncommitted entry if it had. (c)(ii) seems similar to me.
All of (b) and (d) seem to exceed the failure tolerance of the cluster if you count corruption failures as fail-stop, and (a) is not a problem for protocols that verify their data with a quorum before answering, or for those that ensure a faulty process cannot be the leader.
> Discarding entries due to such conflation introduces the possibility of a global data loss (as shown earlier in Figure 2)."
Again, Figure 2 does not list data loss as a possibility for either Crash (fail-stop) or Reconfigure approaches.
> We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated
This is interesting. I’m not entirely clear what the distinction is that’s offered by PAR here, as I think all distributed consensus protocols must separate uncommitted, maybe-committed and definitely-committed records for correctness. At least in Cassandra these three states are expressly separated into promise, proposal and commit registers.
Either way, I and colleagues will be developing the first real-world leaderless transaction system for Cassandra over the coming year, and you’ve convinced me to expand Cassandra’s deterministic cluster simulations to include a variety of disk faults, as this isn’t a particularly large amount of work and I’m sure there will anyway be some unexpected surprises in old and new code.
Roughly speaking, there are two big ideas (or three, if we count deterministic snapshots, but let's not go into that here) in the PAR paper for me personally — I'll just quote these directly from the paper:
1. "3.4.2 Determining Commitment The main insight to fix the leader’s faulty log safely and quickly is to distinguish uncommitted entries from possi- bly committed ones; while recovering the committed en- tries is necessary for safety, uncommitted entries can be safely discarded. Further, discarding uncommitted faulty entries immediately is crucial for availability. For in- stance, in case (c)(i), the faulty entry on S1 cannot be fixed since there are no copies of it; waiting to fix that entry results in indefinite unavailability. Sometimes, an entry could be partially replicated but remain uncommit- ted; for example, in case (c)(ii), the faulty entry on S1 is partially replicated but is not committed. Although there is a possibility of recovering this entry from the other node (S2), this is not necessary for safety; it is completely safe for the leader to discard this uncommitted entry." — [working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster]
2. "3.3.3 Disentangling Crashes and Corruption in Log An interesting challenge arises when detecting corrup- tions in the log. A checksum mismatch for a log entry could occur due to two different situations. First, the system could have crashed in the middle of an update; in this case, the entry would be partially written and hence cause a mismatch. Second, the entry could be safely per- sisted but corrupted at a later point. Most log-based sys- tems conflate these two cases: they treat a mismatch as a crash [30]. On a mismatch, they discard the corrupted entry and all subsequent entries, losing the data. Discard- ing entries due to such conflation introduces the possibil- ity of a global data loss (as shown earlier in Figure 2)."
Finally, why reconfiguration is not always an option [again, see (c)(i) above in particular]:
"Reconfigure. In this approach, a faulty node is removed and a new node is added. However, to change the configuration, a configuration entry needs to be committed by a majority. Hence, the system remains unavailable in many cases (for example, when a majority are alive but one node’s data is corrupted). Although Reconfigure is not used in practical systems to tackle storage faults, it has been suggested by prior research [15, 44]."
The key to grokking the PAR paper for me personally was seeing that what should intuitively be only an isolated local "process-stopper" event can in fact propagate through the consensus protocol into being a showstopper event for the global cluster. This connection between local storage and global consensus was actually introduced in their prior work "Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions" (https://www.usenix.org/conference/fast17/technical-sessions/...).
What was also interesting for us with TigerBeetle (and perhaps I should have mentioned this earlier in our discussion), was that our deterministic simulation testing (https://github.com/coilhq/tigerbeetle#simulation-tests) could actually pick up very quickly where the lack of PAR would lead to cluster unavailability if we inject one or two faults here and there.
We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated, and therefore maximize availability, simply by being more careful in our distinction between committed and uncommitted data, and where this distinction was crucial.
So it's not just theoretical, our test suite is also experimentally telling us that we need PAR. We wouldn't pass our storage fault model without it.
Regarding VSR's Recovery Protocol, perhaps the best explanation of what I'm thinking is simply the YouTube talk linked to in the discussion of the VSR paper by Aleksey Charapko's DistSys Reading Group.
Thanks again for the interesting discussion! Let me know if you ever want to catch up to chat consensus and storage. I'd be excited to hear about Cassandra's new deterministic simulation framework you mentioned.