> For example, how can a single replica even detect that it had a disk fault if the last I/O it acknowledged externally was simply never written to disk at all, despite an fsync barrier?
Could you explain how the paper addresses this particular kind of fault, as I cannot find it? This isn't one of the fault categories listed in Table 2, and the paper assumes that the storage layer reliably detects faults. I cannot see how a local system by itself can ever reliably detect this class of fault?
I can imagine coordinators detecting that a response from a replica is not up-to-date, but that happens anyway in most consensus protocols and this doesn't help correctness if only f+1 replicas had persisted the record (and another f have persisted a record with a conflicting outcome)
>I would highly recommend the PAR paper to you, if you have not incorporated it already into Cassandra.
I have read (well, skimmed) the PAR paper a few times, and while it is a nice paper I am unconvinced it is particularly helpful, beyond reiterating the fact that you cannot trust your disks - but this problem has to anyway be addressed by databases.
I think that the kind of active recovery it outlines could even be counter-productive in some cases. Once a disk fault is detected, it is unclear that it is safe or helpful to try to replicate correct data to the node. It may have less storage available to it now, if it has sensibly isolated the faulty disk, or may be unable to persist the data and so the recovery process may simply result in a lot of recurring work for the system.
Probably the most general purpose course of action is to replace the node, since nodes are plentiful. The faulty node can then have its disk replaced before being returned to the pool of available nodes.
> Probably the most general purpose course of action is to replace the node, since nodes are plentiful. The faulty node can then have its disk replaced before being returned to the pool of available nodes.
This is expensive and wasteful, and could potentially lead to cascading failure. Worse, what if all nodes have a single disk sector error in different places on disk? The cluster would be lost.
> what if all nodes have a single disk sector error in different places on disk?
Well, there are many ways to skin this particular cat, but pretty much all large-scale storage systems (think S3) use some kind of erasure coding. That is, the replication protocol replicates a "log" whose execution generates a "state" of the "state machine". The log is fully replicated, but the state is erasure coded. This scheme tolerates single-sector errors in the state. If you have an error in the log you throw away the entire replica, but since the state is 1000x larger than the log, this is not a big deal.
At the end of the day the disk must guarantee something, or else you are screwed no matter what. For example, if an acceptor acknowledges a phase-1 (view change) message but the disk lies about storing the new ballot/view/term, then all these protocols are incorrect.
> Well, there are many ways to skin this particular cat
Sure, of course, but why shouldn't we implement the easy wins as well? PAR is pretty simple to implement. It's not exclusive to other techniques, and it increases availability significantly. It's not either/or but both I believe.
> At the end of the day the disk must guarantee something, or else you are screwed no matter what. For example, if an acceptor acknowledges a phase-1 (view change) message but the disk lies about storing the new ballot/view/term, then all these protocols are incorrect.
No, that's actually exactly my point about Viewstamped Replication as per the 2012 revision (and your example is precisely what I've had back in mind throughout this thread).
VSR's view change is different from all the others, in that it does not require any guarantee from disk. The whole view change protocol is entirely in-memory. It's the one protocol (at least that I know of) that remains correct, even where all the others would be incorrect (since, contrary to VSR, they unfortunately require pristine stable storage for correctness).
I think we are making the same point, you just missed this aspect of VSR, in that it places no reliance on disk (at all) for correctness. This sets it apart from all the others.
VSR is closer to a "near-byzantine" model as I like to call it. You can have completely byzantine storage (which is probably not a bad way to think of physical disks and firmwares and kernel caches) and VSR will remain correct, despite requiring only the same resources as an otherwise non-byzantine protocol.
Hmm... how does it work? Say I finish the view change and then everybody loses power. When the power comes back, how do you know what the old view was? What prevents two distinct nodes from being both leaders of view v+1 (one before the power loss and one after power comes back) thus issuing inconsistent messages?
Recoverable disk faults of course are handled at the storage layer, so we're talking here about unrecoverable faults. My risk model prefers replacing any disk that is encountering unrecoverable errors, to avoid the disk causing additional problems as it degrades (no matter how good your error detection is, it is preferable not to lean on it more often than necessary).
To ensure only the affected data is concerned, it is possible to relocate only that disk's data to a new replica, to minimise the burden.
I'll concede that a better model might permit a replica some number of unrecoverable errors (few, but more than one in some time interval) before a total replacement is performed, but in this case there are still repair mechanisms to bring the replica up-to-date, and simply performing normal recovery for consensus operations whose state it cannot recover before it participates in further decisions is sufficient for ensuring correctness.
> could potentially lead to cascading failure
Trying to correct faults on broadly failing disks can lead to cascading failure, as the cluster becomes preoccupied with correcting faults that cannot be resolved, and user queries become unserviceable. Bootstrapping a new replica is also a very modest CPU burden.
> Worse, what if all nodes have a single disk sector error in different places on disk?
This isn't a concern in practice from experience, and theoretically the risk is also minimal. Unrecoverable disk errors occur at a rate of fewer than one per many petabytes. This provides plenty of time to replace a replica - and if this isn't sufficient, you can increase your replication factor until you can survive the requisite number of coincident faults.
> Could you explain how the paper addresses this particular kind of fault?
I'm genuinely curious to hear your answer here if you have a moment. I can imagine some mechanisms to help recover from a single fault like this in a distributed fashion, but would love to understand what I might be missing from the paper.
> My risk model prefers replacing any disk that is encountering unrecoverable errors, to avoid the disk causing additional problems as it degrades (no matter how good your error detection is, it is preferable not to lean on it more often than necessary). To ensure only the affected data is concerned, it is possible to relocate only that disk's data to a new replica, to minimise the burden.
Your risk model is of course not exclusive of the PAR protocol. However, PAR would mean that a cluster can at least recover itself and then replace the disk in the background, without this being a showstopper event.
The paper has some very simple examples of where a single disk sector failure can render a cluster unable to complete a view change and thus completely unavailable, but where PAR's distinction between uncommitted/committed ops can mean that the cluster can make forward progress and recover.
There are also further optimizations that you can do on top of PAR once you get started, for example to support constant-time acks even from lagging followers which don't have a complete log, to help them catch up faster and reduce ack latency spikes, while still sticking to the VSR invariant of not allowing gaps in the committed log (as distinct from the uncommitted log). I won't go into further details, but you can see how we do this exactly in TigerBeetle (src/vsr/replica.zig).
> This isn't a concern in practice from experience, and theoretically the risk is also minimal. Unrecoverable disk errors occur at a rate of fewer than one per many petabytes.
Theoretically, but in practice a bad batch could easily have higher failure rates. You would also need to make sure that your scrubbing rates are surfacing bad sectors soon enough and that you can then reconfigure faster than additional sectors fail. It's also rare to see test suites actually injecting storage faults and testing that the scrubbing system can detect and repair. This aspect of recovery is often simply not tested.
However, as I mentioned before, PAR also shows cases where even a single local disk sector is enough to bring down a cluster that doesn't implement protocol-aware recovery for consensus storage.
The principle is that local storage and global consensus protocol really do need to be integrated, unless maximizing availability or network efficiency are not primary goals of the system.
> Could you explain how the paper addresses this particular kind of fault?
Sure, this kind of fault is addressed by Viewstamped Replication's 2012 revision, not by PAR. See the Recovery Protocol in the paper by Liskov and Cowling. In the paper, it's used to recover RAM contents after a crash but the technique is also helpful for storage, and indeed the paper always has stable storage in mind, it's just that it doesn't require it for correctness as other consensus protocols tend to do.
VSR is more of a "write-back cache" style protocol and I believe this makes it a better place for engineers to start from, because it's easier to work with disks, when you don't place much reliance on them.
> However, PAR would mean that a cluster can at least recover itself and then replace the disk in the background, without this being a showstopper event.
This isn’t a showstopper event, it’s a process stopper event. I also outlined non-PAR approaches to recovering just fine without replacing the node.
> Theoretically, but in practice
I also have extensive practical confirmation that this approach works well.
> This aspect of recovery is often simply not tested.
I agree that test suites do not cover this well enough, and I commend you for working this into Tiger Beetle. This is something I hope to expand Cassandra's new deterministic simulation framework to incorporate in future as well, but Cassandra does benefit from a great deal of real world exposure to this kind of fault.
> PAR also shows cases where even a single local disk sector is enough to bring down a cluster that doesn't implement protocol-aware recovery for consensus storage.
I’m not sure I agree. The paper discounts reconfiguration because there could only be f+1 live processes and one could have corruption in a relevant sector, but under normal models this is simply f live processes. It seems that to accommodate this scenario we must duplicate the record identifier, and store both separately from the record itself.
It's not clear to me this scenario warrants the additional complexity, storage and bandwidth, as I’m not sure guarding against this is enough to reduce your replication factor. But it's worth considering, and this particular recovery enhancement is quite simple (we just need to duplicate the ballot in Paxos, so we can arbitrate between a split decision of other replicas that retain an intact record), so thanks for highlighting it more clearly for me.
> this kind of fault is addressed by Viewstamped Replication's 2012 revision
Could you point me to the place in the paper? I cannot see how VR solves this problem without introducing a risk of inconsistency, without necessitating an additional round-trip before responding to a client. Specifically, I think there are only three ways to address this particular problem, and I don’t see them discussed in VR:
1. The coordinator may record the responses of each replica before acknowledging to the client, so that the loss of any a write to any one disk may be recoverable.
2. The coordinator may require k>f+1 responses before answering a client, so that we may tolerate the loss of k-(f+1) disks losing a write
3. The coordinator may require an additional round-trip to record the consensus decision before answering a client
I don’t see any of these approaches discussed in VR. I see discussion of recovering the log from other replicas, but this cannot be done safely if the replica is not itself aware that it has lost data.
> This isn’t a showstopper event, it’s a process stopper event.
Again, the PAR paper describes the specific examples where a single disk sector failure on a single replica can be a showstopper event — for the whole cluster, even precluding reconfiguration since the log might no longer be functional.
I know I've been hammering this point (single local disk sector failure = global cluster data loss or else global cluster unavailability), but it's really why PAR is essential for a distributed database, and why it won best paper at FAST '18.
> Could you point me to the place in the paper?
Sure, just grep the 2012 VSR Revisited paper throughout for "recovery protocol" to follow the discussion thread through the paper. It's one of the four main sub-protocols in the VSR paper.
I’d just like to say in preface that I appreciate this back and forth, as text can feel more combative than intended. I have learned in the process.
> Again, the PAR paper describes the specific examples where a single disk sector failure on a single replica can be a showstopper event
Again, I don’t believe it does, but I may have missed it. If you could specify the particular scenario you imagine that would be great. The paper explicitly only rejects recovery in the scenario that there are already f failed processes, so we are talking about f+1 faults.
In essence, AFAICT the paper says that you can obtain additional failure tolerance by treating corruption as a non-fail-stop failure, and my contention is that the necessary replication factor for surviving other faults confers sufficient protection against corruption, and that as a result you most likely cannot reduce your replication factor with this improvement.
> Sure, just grep the 2012 VSR Revisited paper
Thanks. I read Section 4.3 before writing my prior reply, and I do not believe it addresses the problem.
Thanks to you too! I've also enjoyed this dialogue and it helps to formulate one's thoughts by writing them down.
Roughly speaking, there are two big ideas (or three, if we count deterministic snapshots, but let's not go into that here) in the PAR paper for me personally — I'll just quote these directly from the paper:
1. "3.4.2 Determining Commitment
The main insight to fix the leader’s faulty log safely and quickly is to distinguish uncommitted entries from possi- bly committed ones; while recovering the committed en- tries is necessary for safety, uncommitted entries can be safely discarded. Further, discarding uncommitted faulty entries immediately is crucial for availability. For in- stance, in case (c)(i), the faulty entry on S1 cannot be fixed since there are no copies of it; waiting to fix that entry results in indefinite unavailability. Sometimes, an entry could be partially replicated but remain uncommit- ted; for example, in case (c)(ii), the faulty entry on S1 is partially replicated but is not committed. Although there is a possibility of recovering this entry from the other node (S2), this is not necessary for safety; it is completely safe for the leader to discard this uncommitted entry." — [working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster]
2. "3.3.3 Disentangling Crashes and Corruption in Log
An interesting challenge arises when detecting corrup- tions in the log. A checksum mismatch for a log entry could occur due to two different situations. First, the system could have crashed in the middle of an update; in this case, the entry would be partially written and hence cause a mismatch. Second, the entry could be safely per- sisted but corrupted at a later point. Most log-based sys- tems conflate these two cases: they treat a mismatch as a crash [30]. On a mismatch, they discard the corrupted entry and all subsequent entries, losing the data. Discard- ing entries due to such conflation introduces the possibil- ity of a global data loss (as shown earlier in Figure 2)."
Finally, why reconfiguration is not always an option [again, see (c)(i) above in particular]:
"Reconfigure. In this approach, a faulty node is removed and a new node is added. However, to change the configuration, a configuration entry needs to be committed by a majority. Hence, the system remains unavailable in many cases (for example, when a majority are alive but one node’s data is corrupted). Although Reconfigure is not used in practical systems to tackle storage faults, it has been suggested by prior research [15, 44]."
The key to grokking the PAR paper for me personally was seeing that what should intuitively be only an isolated local "process-stopper" event can in fact propagate through the consensus protocol into being a showstopper event for the global cluster. This connection between local storage and global consensus was actually introduced in their prior work "Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions" (https://www.usenix.org/conference/fast17/technical-sessions/...).
What was also interesting for us with TigerBeetle (and perhaps I should have mentioned this earlier in our discussion), was that our deterministic simulation testing (https://github.com/coilhq/tigerbeetle#simulation-tests) could actually pick up very quickly where the lack of PAR would lead to cluster unavailability if we inject one or two faults here and there.
We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated, and therefore maximize availability, simply by being more careful in our distinction between committed and uncommitted data, and where this distinction was crucial.
So it's not just theoretical, our test suite is also experimentally telling us that we need PAR. We wouldn't pass our storage fault model without it.
Regarding VSR's Recovery Protocol, perhaps the best explanation of what I'm thinking is simply the YouTube talk linked to in the discussion of the VSR paper by Aleksey Charapko's DistSys Reading Group.
Thanks again for the interesting discussion! Let me know if you ever want to catch up to chat consensus and storage. I'd be excited to hear about Cassandra's new deterministic simulation framework you mentioned.
> Let me know if you ever want to catch up to chat consensus and storage.
Likewise!
> working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster
Unless I am badly misreading figure 4, example (c)(i) seems impossible to encounter - with a correct implementation of Paxos, at least. For a record to be appended to the distributed log, it must have first been proposed to a majority of participants. So the log record must be recoverable from these other replicas by performing normal Paxos recovery, or may be invalidated if no other quorum has witnessed it. By fail-stopping in this situation, a quorum of the remaining replicas will either agree a different log record to append if it had not reached a quorum, or restore the uncommitted entry if it had. (c)(ii) seems similar to me.
All of (b) and (d) seem to exceed the failure tolerance of the cluster if you count corruption failures as fail-stop, and (a) is not a problem for protocols that verify their data with a quorum before answering, or for those that ensure a faulty process cannot be the leader.
> Discarding entries due to such conflation introduces the possibility of a global data loss (as shown earlier in Figure 2)."
Again, Figure 2 does not list data loss as a possibility for either Crash (fail-stop) or Reconfigure approaches.
> We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated
This is interesting. I’m not entirely clear what the distinction is that’s offered by PAR here, as I think all distributed consensus protocols must separate uncommitted, maybe-committed and definitely-committed records for correctness. At least in Cassandra these three states are expressly separated into promise, proposal and commit registers.
Either way, I and colleagues will be developing the first real-world leaderless transaction system for Cassandra over the coming year, and you’ve convinced me to expand Cassandra’s deterministic cluster simulations to include a variety of disk faults, as this isn’t a particularly large amount of work and I’m sure there will anyway be some unexpected surprises in old and new code.
Could you explain how the paper addresses this particular kind of fault, as I cannot find it? This isn't one of the fault categories listed in Table 2, and the paper assumes that the storage layer reliably detects faults. I cannot see how a local system by itself can ever reliably detect this class of fault?
I can imagine coordinators detecting that a response from a replica is not up-to-date, but that happens anyway in most consensus protocols and this doesn't help correctness if only f+1 replicas had persisted the record (and another f have persisted a record with a conflicting outcome)
>I would highly recommend the PAR paper to you, if you have not incorporated it already into Cassandra.
I have read (well, skimmed) the PAR paper a few times, and while it is a nice paper I am unconvinced it is particularly helpful, beyond reiterating the fact that you cannot trust your disks - but this problem has to anyway be addressed by databases.
I think that the kind of active recovery it outlines could even be counter-productive in some cases. Once a disk fault is detected, it is unclear that it is safe or helpful to try to replicate correct data to the node. It may have less storage available to it now, if it has sensibly isolated the faulty disk, or may be unable to persist the data and so the recovery process may simply result in a lot of recurring work for the system.
Probably the most general purpose course of action is to replace the node, since nodes are plentiful. The faulty node can then have its disk replaced before being returned to the pool of available nodes.