Paxos automatically determined safe and secure

jorangreef · on Oct 28, 2021

“Paxos is one of the first and most celebrated ideas that laid the foundation for how different things come to an agreement asynchronously”

“Most—if not all—consensus algorithms fundamentally derive concepts from Paxos”

It's not well known (even in the field of distributed systems) that in fact Viewstamped Replication pioneered the field of consensus in 1988, a year before Paxos, with an intuitive algorithm essentially identical to Raft, so that Raft is more within the family and tradition of Viewstamped Replication (as is Multi-Paxos state machine replication) than the other way round.

Viewstamped Replication is significantly easier to implement correctly in the presence of real world storage faults, which formal proofs often do not take into account in their fault model. For example, Paxos and Raft proofs typically assume only pristine stable storage, and the correctness unfortunately rapidly breaks down when physical disks are used, since these may misdirect reads and writes.

UW-Madison have done some excellent research in this area in particular, the intersection of consensus protocol correctness and storage faults, and I would recommend reading "Protocol-Aware Recovery for Consensus-Based Storage" if you'd like to dive into some of this: https://www.usenix.org/conference/fast18/presentation/alagap...

Aleksey Charapko's DistSys Reading Group also recently covered Viewstamped Replication as a foundational paper, focusing on the crystal clear 2012 revision by Barbara Liskov and James Cowling, which I was able to present, covering some of our experience implementing the protocol for TigerBeetle, a distributed database with a strict storage fault model: http://charap.co/reading-group-viewstamped-replication-revis...

_benedict · on Oct 28, 2021

What people think of as Paxos was not originally intended to be called Paxos by Lamport, but the leader election phase for what is now called Multi-Paxos (and he intended to call Paxos). So I don't think there's a difference for algorithmic heritage between Paxos and Multi-Paxos (perhaps you know this, but for those reading your comment it might help clarify).

I think it is anyway reasonable to say that most consensus algorithms derive concepts from Paxos, since (despite how terribly it was presented by Lamport) it is the consensus protocol that captured most attention. Most recent advances in distributed consensus derive from Paxos, not Viewstamped Replication. As far as I know all leaderless distributed consensus protocols derive from Paxos, and most protocol optimisations that have been developed apply to Paxos or one of its derivatives.

I also happen to think Paxos is pretty easy to implement correctly, particularly by comparison to the other protocols, in large part due to its active replication semantics, permitting that commands may be processed by replicas in any order. This means failover is much less complicated to negotiate, as the new leader does not expect to have a complete view of the log. Though of course membership changes remain complicated, and it may be beneficial for the leader to be able to assume it has a complete view of the log - but this is an optimisation rather than an inherent property for correct implementation.

> correctness unfortunately rapidly breaks down when physical disks are used, since these may misdirect reads and writes

What do you mean by "misdirect" here?

jorangreef · on Oct 28, 2021

I would say that the majority of leader-based state machine replication protocols today are based on the more popular RAFT, which is itself a derivative of Viewstamped Replication.

At least most leader-based consensus implementations that I know of seem to be derived from RAFT these days? I would also guess that leaderless distributed protocols derived from Paxos tend to be more niche than leader-based state machine replication?

While Viewstamped Replication as contributed by Brian M. Oki certainly established all the foundational elements of consensus that Paxos would later reiterate and generalize, Viewstamped Replication also immediately showed how to use consensus to do leader-based state machine replication, and to do this simply and practically, as seen in RAFT's massive success in industry.

> What do you mean by "misdirect" here?

A misdirected read/write I/O. This is a rare kind of storage fault in the literature where the disk device redirects the I/O to a different sector. It's rare, but it happens.

> I also happen to think Paxos is pretty easy to implement correctly

What was especially interesting about the PAR paper from UW-Madison was how even fairly common single disk sector faults such as latent sector errors or corruption could cause global cluster data loss for implementations based on Paxos and RAFT as specified, where they depend on pristine fault-free stable storage for correctness. The PAR paper further motivated why simple checksums are not enough and why local storage and global protocol need to be aware of each other if cluster availability is to be maximized.

Implementing Paxos correctly in the presence of a realistic storage fault model is certainly non-trivial.

I believe Viewstamped Replication (VSR), and in particular the 2012 revision by Liskov and Cowling, is a much better place to start when implementing a state machine replication protocol, since it places absolutely no demands on the disk for correctness of the protocol.

In fact, this was the key reason we picked VSR for TigerBeetle, it's not only elegant, but it's one of the very few protocols that actually make sense in a production context where a realistic storage fault model is at play. For example, the VSR view change is entirely correct without any reliance on disk, whereas most other protocols would need significant deviations (and proofs) to make the same guarantee.

_benedict · on Oct 28, 2021

> I would say that the majority of leader-based state machine replication protocols today are based on the more popular RAFT, which is itself a derivative of Viewstamped Replication.

In production systems, most software appears to use Raft. In academia much more significant advances have been made for Paxos, and most of the research I am aware of is invested into Paxos. I am not really aware of much protocol advancement in academia that is built upon Raft?

> Implementing Paxos correctly in the presence of a realistic storage fault model is certainly non-trivial.

I disagree. The failure model for Paxos as specified assumes that processes fail-stop. So all that is required for correctness is that a process fail-stops in the presence of a detectable disk fault. To detect the kind of fault you report here, we only require checksums - either that are in part computed upon data that is not stored in the sector in question, or that are computed over multiple sectors (or both). This latter criterion holds pretty commonly, I think.

Of course, all distributed consensus protocols must handle reconfiguration in order to replace faulty processes, which would need to be invoked here. So this does not really seem any more challenging?

Certainly the software I maintain (Apache Cassandra), that currently employs Paxos is protected against this kind of fault by virtue of the chunk-level checksums that span many sectors, and the property that processes will not respond to operations that encounter any kind of local fault.

jorangreef · on Oct 28, 2021

"So all that is required for correctness is that a process fail-stops in the presence of a detectable disk fault."

Yes, except the PAR paper goes into many more storage faults. Recovering from these is not trivial and needs to be integrated into the consensus protocol, something not specified by Paxos.

For example, how can a single replica even detect that it had a disk fault if the last I/O it acknowledged externally was simply never written to disk at all, despite an fsync barrier? To recover from this correctly at the consensus layer, you would need your consensus protocol to also include and run a recovery protocol at startup, akin to Viewstamped Replication's 2012 Recovery Protocol. Of course there are details I am leaving out here, but the point should hopefully be clear, it's not obvious or trivial when a storage fault model is assumed.

I would highly recommend the PAR paper to you, if you have not incorporated it already into Cassandra.

And because this is not only about correctness, you also want your system to be able to recover and to be highly available in the presence of storage faults. Otherwise, the redundancy afforded by the consensus protocol and replication is not really being fully utilized and resources are being wasted.

_benedict · on Oct 28, 2021

> For example, how can a single replica even detect that it had a disk fault if the last I/O it acknowledged externally was simply never written to disk at all, despite an fsync barrier?

Could you explain how the paper addresses this particular kind of fault, as I cannot find it? This isn't one of the fault categories listed in Table 2, and the paper assumes that the storage layer reliably detects faults. I cannot see how a local system by itself can ever reliably detect this class of fault?

I can imagine coordinators detecting that a response from a replica is not up-to-date, but that happens anyway in most consensus protocols and this doesn't help correctness if only f+1 replicas had persisted the record (and another f have persisted a record with a conflicting outcome)

>I would highly recommend the PAR paper to you, if you have not incorporated it already into Cassandra.

I have read (well, skimmed) the PAR paper a few times, and while it is a nice paper I am unconvinced it is particularly helpful, beyond reiterating the fact that you cannot trust your disks - but this problem has to anyway be addressed by databases.

I think that the kind of active recovery it outlines could even be counter-productive in some cases. Once a disk fault is detected, it is unclear that it is safe or helpful to try to replicate correct data to the node. It may have less storage available to it now, if it has sensibly isolated the faulty disk, or may be unable to persist the data and so the recovery process may simply result in a lot of recurring work for the system.

Probably the most general purpose course of action is to replace the node, since nodes are plentiful. The faulty node can then have its disk replaced before being returned to the pool of available nodes.

jorangreef · on Oct 28, 2021

> Probably the most general purpose course of action is to replace the node, since nodes are plentiful. The faulty node can then have its disk replaced before being returned to the pool of available nodes.

This is expensive and wasteful, and could potentially lead to cascading failure. Worse, what if all nodes have a single disk sector error in different places on disk? The cluster would be lost.

MatteoFrigo · on Oct 28, 2021

> what if all nodes have a single disk sector error in different places on disk?

Well, there are many ways to skin this particular cat, but pretty much all large-scale storage systems (think S3) use some kind of erasure coding. That is, the replication protocol replicates a "log" whose execution generates a "state" of the "state machine". The log is fully replicated, but the state is erasure coded. This scheme tolerates single-sector errors in the state. If you have an error in the log you throw away the entire replica, but since the state is 1000x larger than the log, this is not a big deal.

At the end of the day the disk must guarantee something, or else you are screwed no matter what. For example, if an acceptor acknowledges a phase-1 (view change) message but the disk lies about storing the new ballot/view/term, then all these protocols are incorrect.

jorangreef · on Oct 28, 2021

> Well, there are many ways to skin this particular cat

Sure, of course, but why shouldn't we implement the easy wins as well? PAR is pretty simple to implement. It's not exclusive to other techniques, and it increases availability significantly. It's not either/or but both I believe.

> At the end of the day the disk must guarantee something, or else you are screwed no matter what. For example, if an acceptor acknowledges a phase-1 (view change) message but the disk lies about storing the new ballot/view/term, then all these protocols are incorrect.

No, that's actually exactly my point about Viewstamped Replication as per the 2012 revision (and your example is precisely what I've had back in mind throughout this thread).

VSR's view change is different from all the others, in that it does not require any guarantee from disk. The whole view change protocol is entirely in-memory. It's the one protocol (at least that I know of) that remains correct, even where all the others would be incorrect (since, contrary to VSR, they unfortunately require pristine stable storage for correctness).

I think we are making the same point, you just missed this aspect of VSR, in that it places no reliance on disk (at all) for correctness. This sets it apart from all the others.

VSR is closer to a "near-byzantine" model as I like to call it. You can have completely byzantine storage (which is probably not a bad way to think of physical disks and firmwares and kernel caches) and VSR will remain correct, despite requiring only the same resources as an otherwise non-byzantine protocol.

MatteoFrigo · on Oct 28, 2021

Hmm... how does it work? Say I finish the view change and then everybody loses power. When the power comes back, how do you know what the old view was? What prevents two distinct nodes from being both leaders of view v+1 (one before the power loss and one after power comes back) thus issuing inconsistent messages?

_benedict · on Oct 28, 2021

> This is expensive and wasteful.

Recoverable disk faults of course are handled at the storage layer, so we're talking here about unrecoverable faults. My risk model prefers replacing any disk that is encountering unrecoverable errors, to avoid the disk causing additional problems as it degrades (no matter how good your error detection is, it is preferable not to lean on it more often than necessary).

To ensure only the affected data is concerned, it is possible to relocate only that disk's data to a new replica, to minimise the burden.

I'll concede that a better model might permit a replica some number of unrecoverable errors (few, but more than one in some time interval) before a total replacement is performed, but in this case there are still repair mechanisms to bring the replica up-to-date, and simply performing normal recovery for consensus operations whose state it cannot recover before it participates in further decisions is sufficient for ensuring correctness.

> could potentially lead to cascading failure

Trying to correct faults on broadly failing disks can lead to cascading failure, as the cluster becomes preoccupied with correcting faults that cannot be resolved, and user queries become unserviceable. Bootstrapping a new replica is also a very modest CPU burden.

> Worse, what if all nodes have a single disk sector error in different places on disk?

This isn't a concern in practice from experience, and theoretically the risk is also minimal. Unrecoverable disk errors occur at a rate of fewer than one per many petabytes. This provides plenty of time to replace a replica - and if this isn't sufficient, you can increase your replication factor until you can survive the requisite number of coincident faults.

> Could you explain how the paper addresses this particular kind of fault?

I'm genuinely curious to hear your answer here if you have a moment. I can imagine some mechanisms to help recover from a single fault like this in a distributed fashion, but would love to understand what I might be missing from the paper.

jorangreef · on Oct 28, 2021

> My risk model prefers replacing any disk that is encountering unrecoverable errors, to avoid the disk causing additional problems as it degrades (no matter how good your error detection is, it is preferable not to lean on it more often than necessary). To ensure only the affected data is concerned, it is possible to relocate only that disk's data to a new replica, to minimise the burden.

Your risk model is of course not exclusive of the PAR protocol. However, PAR would mean that a cluster can at least recover itself and then replace the disk in the background, without this being a showstopper event.

The paper has some very simple examples of where a single disk sector failure can render a cluster unable to complete a view change and thus completely unavailable, but where PAR's distinction between uncommitted/committed ops can mean that the cluster can make forward progress and recover.

There are also further optimizations that you can do on top of PAR once you get started, for example to support constant-time acks even from lagging followers which don't have a complete log, to help them catch up faster and reduce ack latency spikes, while still sticking to the VSR invariant of not allowing gaps in the committed log (as distinct from the uncommitted log). I won't go into further details, but you can see how we do this exactly in TigerBeetle (src/vsr/replica.zig).

> This isn't a concern in practice from experience, and theoretically the risk is also minimal. Unrecoverable disk errors occur at a rate of fewer than one per many petabytes.

Theoretically, but in practice a bad batch could easily have higher failure rates. You would also need to make sure that your scrubbing rates are surfacing bad sectors soon enough and that you can then reconfigure faster than additional sectors fail. It's also rare to see test suites actually injecting storage faults and testing that the scrubbing system can detect and repair. This aspect of recovery is often simply not tested.

However, as I mentioned before, PAR also shows cases where even a single local disk sector is enough to bring down a cluster that doesn't implement protocol-aware recovery for consensus storage.

The principle is that local storage and global consensus protocol really do need to be integrated, unless maximizing availability or network efficiency are not primary goals of the system.

> Could you explain how the paper addresses this particular kind of fault?

Sure, this kind of fault is addressed by Viewstamped Replication's 2012 revision, not by PAR. See the Recovery Protocol in the paper by Liskov and Cowling. In the paper, it's used to recover RAM contents after a crash but the technique is also helpful for storage, and indeed the paper always has stable storage in mind, it's just that it doesn't require it for correctness as other consensus protocols tend to do.

VSR is more of a "write-back cache" style protocol and I believe this makes it a better place for engineers to start from, because it's easier to work with disks, when you don't place much reliance on them.

_benedict · on Oct 28, 2021

> However, PAR would mean that a cluster can at least recover itself and then replace the disk in the background, without this being a showstopper event.

This isn’t a showstopper event, it’s a process stopper event. I also outlined non-PAR approaches to recovering just fine without replacing the node.

> Theoretically, but in practice

I also have extensive practical confirmation that this approach works well.

> This aspect of recovery is often simply not tested.

I agree that test suites do not cover this well enough, and I commend you for working this into Tiger Beetle. This is something I hope to expand Cassandra's new deterministic simulation framework to incorporate in future as well, but Cassandra does benefit from a great deal of real world exposure to this kind of fault.

> PAR also shows cases where even a single local disk sector is enough to bring down a cluster that doesn't implement protocol-aware recovery for consensus storage.

I’m not sure I agree. The paper discounts reconfiguration because there could only be f+1 live processes and one could have corruption in a relevant sector, but under normal models this is simply f live processes. It seems that to accommodate this scenario we must duplicate the record identifier, and store both separately from the record itself.

It's not clear to me this scenario warrants the additional complexity, storage and bandwidth, as I’m not sure guarding against this is enough to reduce your replication factor. But it's worth considering, and this particular recovery enhancement is quite simple (we just need to duplicate the ballot in Paxos, so we can arbitrate between a split decision of other replicas that retain an intact record), so thanks for highlighting it more clearly for me.

> this kind of fault is addressed by Viewstamped Replication's 2012 revision

Could you point me to the place in the paper? I cannot see how VR solves this problem without introducing a risk of inconsistency, without necessitating an additional round-trip before responding to a client. Specifically, I think there are only three ways to address this particular problem, and I don’t see them discussed in VR:

1. The coordinator may record the responses of each replica before acknowledging to the client, so that the loss of any a write to any one disk may be recoverable. 2. The coordinator may require k>f+1 responses before answering a client, so that we may tolerate the loss of k-(f+1) disks losing a write 3. The coordinator may require an additional round-trip to record the consensus decision before answering a client

I don’t see any of these approaches discussed in VR. I see discussion of recovering the log from other replicas, but this cannot be done safely if the replica is not itself aware that it has lost data.

jorangreef · on Oct 28, 2021

> This isn’t a showstopper event, it’s a process stopper event.

Again, the PAR paper describes the specific examples where a single disk sector failure on a single replica can be a showstopper event — for the whole cluster, even precluding reconfiguration since the log might no longer be functional.

I know I've been hammering this point (single local disk sector failure = global cluster data loss or else global cluster unavailability), but it's really why PAR is essential for a distributed database, and why it won best paper at FAST '18.

> Could you point me to the place in the paper?

Sure, just grep the 2012 VSR Revisited paper throughout for "recovery protocol" to follow the discussion thread through the paper. It's one of the four main sub-protocols in the VSR paper.

Here's the link to the paper itself: http://pmg.csail.mit.edu/papers/vr-revisited.pdf

The protocol is section 4.3 but it's good to see the surrounding discussion in the paper where it's referenced.

We also discussed this aspect of VSR in some extra detail at Aleksey Charapko's DistSys Reading Group recently: http://charap.co/reading-group-viewstamped-replication-revis...

_benedict · on Oct 28, 2021

I’d just like to say in preface that I appreciate this back and forth, as text can feel more combative than intended. I have learned in the process.

> Again, the PAR paper describes the specific examples where a single disk sector failure on a single replica can be a showstopper event

Again, I don’t believe it does, but I may have missed it. If you could specify the particular scenario you imagine that would be great. The paper explicitly only rejects recovery in the scenario that there are already f failed processes, so we are talking about f+1 faults.

In essence, AFAICT the paper says that you can obtain additional failure tolerance by treating corruption as a non-fail-stop failure, and my contention is that the necessary replication factor for surviving other faults confers sufficient protection against corruption, and that as a result you most likely cannot reduce your replication factor with this improvement.

> Sure, just grep the 2012 VSR Revisited paper

Thanks. I read Section 4.3 before writing my prior reply, and I do not believe it addresses the problem.

jorangreef · on Oct 28, 2021

Thanks to you too! I've also enjoyed this dialogue and it helps to formulate one's thoughts by writing them down.

Roughly speaking, there are two big ideas (or three, if we count deterministic snapshots, but let's not go into that here) in the PAR paper for me personally — I'll just quote these directly from the paper:

1. "3.4.2 Determining Commitment The main insight to fix the leader’s faulty log safely and quickly is to distinguish uncommitted entries from possi- bly committed ones; while recovering the committed en- tries is necessary for safety, uncommitted entries can be safely discarded. Further, discarding uncommitted faulty entries immediately is crucial for availability. For in- stance, in case (c)(i), the faulty entry on S1 cannot be fixed since there are no copies of it; waiting to fix that entry results in indefinite unavailability. Sometimes, an entry could be partially replicated but remain uncommit- ted; for example, in case (c)(ii), the faulty entry on S1 is partially replicated but is not committed. Although there is a possibility of recovering this entry from the other node (S2), this is not necessary for safety; it is completely safe for the leader to discard this uncommitted entry." — [working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster]

2. "3.3.3 Disentangling Crashes and Corruption in Log An interesting challenge arises when detecting corrup- tions in the log. A checksum mismatch for a log entry could occur due to two different situations. First, the system could have crashed in the middle of an update; in this case, the entry would be partially written and hence cause a mismatch. Second, the entry could be safely per- sisted but corrupted at a later point. Most log-based sys- tems conflate these two cases: they treat a mismatch as a crash [30]. On a mismatch, they discard the corrupted entry and all subsequent entries, losing the data. Discard- ing entries due to such conflation introduces the possibil- ity of a global data loss (as shown earlier in Figure 2)."

Finally, why reconfiguration is not always an option [again, see (c)(i) above in particular]:

"Reconfigure. In this approach, a faulty node is removed and a new node is added. However, to change the configuration, a configuration entry needs to be committed by a majority. Hence, the system remains unavailable in many cases (for example, when a majority are alive but one node’s data is corrupted). Although Reconfigure is not used in practical systems to tackle storage faults, it has been suggested by prior research [15, 44]."

The key to grokking the PAR paper for me personally was seeing that what should intuitively be only an isolated local "process-stopper" event can in fact propagate through the consensus protocol into being a showstopper event for the global cluster. This connection between local storage and global consensus was actually introduced in their prior work "Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions" (https://www.usenix.org/conference/fast17/technical-sessions/...).

What was also interesting for us with TigerBeetle (and perhaps I should have mentioned this earlier in our discussion), was that our deterministic simulation testing (https://github.com/coilhq/tigerbeetle#simulation-tests) could actually pick up very quickly where the lack of PAR would lead to cluster unavailability if we inject one or two faults here and there.

We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated, and therefore maximize availability, simply by being more careful in our distinction between committed and uncommitted data, and where this distinction was crucial.

So it's not just theoretical, our test suite is also experimentally telling us that we need PAR. We wouldn't pass our storage fault model without it.

Regarding VSR's Recovery Protocol, perhaps the best explanation of what I'm thinking is simply the YouTube talk linked to in the discussion of the VSR paper by Aleksey Charapko's DistSys Reading Group.

Thanks again for the interesting discussion! Let me know if you ever want to catch up to chat consensus and storage. I'd be excited to hear about Cassandra's new deterministic simulation framework you mentioned.

_benedict · on Oct 28, 2021

> Let me know if you ever want to catch up to chat consensus and storage.

Likewise!

> working through figure 4 carefully is key to understanding the paper's contribution here, especially (c)(i) which is a simple scenario that can lead to indefinite unavailability of the cluster

Unless I am badly misreading figure 4, example (c)(i) seems impossible to encounter - with a correct implementation of Paxos, at least. For a record to be appended to the distributed log, it must have first been proposed to a majority of participants. So the log record must be recoverable from these other replicas by performing normal Paxos recovery, or may be invalidated if no other quorum has witnessed it. By fail-stopping in this situation, a quorum of the remaining replicas will either agree a different log record to append if it had not reached a quorum, or restore the uncommitted entry if it had. (c)(ii) seems similar to me.

All of (b) and (d) seem to exceed the failure tolerance of the cluster if you count corruption failures as fail-stop, and (a) is not a problem for protocols that verify their data with a quorum before answering, or for those that ensure a faulty process cannot be the leader.

> Discarding entries due to such conflation introduces the possibility of a global data loss (as shown earlier in Figure 2)."

Again, Figure 2 does not list data loss as a possibility for either Crash (fail-stop) or Reconfigure approaches.

> We actually found some interesting cases where PAR was able to help us maximize the durability of data we had already replicated

This is interesting. I’m not entirely clear what the distinction is that’s offered by PAR here, as I think all distributed consensus protocols must separate uncommitted, maybe-committed and definitely-committed records for correctness. At least in Cassandra these three states are expressly separated into promise, proposal and commit registers.

Either way, I and colleagues will be developing the first real-world leaderless transaction system for Cassandra over the coming year, and you’ve convinced me to expand Cassandra’s deterministic cluster simulations to include a variety of disk faults, as this isn’t a particularly large amount of work and I’m sure there will anyway be some unexpected surprises in old and new code.

hawk_ · on Oct 28, 2021

> correctness unfortunately rapidly breaks down when physical disks are used, since these may misdirect reads and writes.

Isn't the whole point of consensus to get around discrepancies due to faults in individual devices. With high enough quorum in Raft for example you can lower the probability of dataloss to an arbitrarily low point to not matter at all in practice.

jorangreef · on Oct 28, 2021

"Isn't the whole point of consensus to get around discrepancies due to faults in individual devices"

Yes, but only with respect to the fault models actually adopted, proven, implemented and tested. Again, most formal proofs, papers and implementations are not actually incorporating any kind of a storage fault model, yet this is essential for correctness in a production deployment.

If you look closely at Paxos and RAFT, they only attempt correctness for the network fault model {packets may be reordered/lost/delayed/replayed/redirected} and process fault model {crash failure}, nothing else. For example, there's no mention of what to do if your disk forgets some writes either because of a misdirected write, or because of a latent sector error, or because of corruption in the middle of the committed consensus log.

Unfortunately, simply increasing quorum will also not actually lower the probability of data loss. In fact, there are trivial cases where even a single local disk sector failure on one replica can cause an entire Paxos or RAFT cluster to experience global cluster data loss.

Protocol-Aware Recovery for Consensus-Based Storage from UW-Madison goes into this in detail, showing where well-known distributed systems are not correct, and also showing how this can be fixed. It won best paper at FAST '18 but is still catching on in industry.

We implemented Protocol-Aware Recovery for TigerBeetle [1], and use deterministic simulation testing to inject all kinds of storage faults on all replicas in a cluster, even as high as 20% simultaneously on all 5 replicas, including the leader. It was pretty magical for us to watch TigerBeetle just keep on running the first time, with high availability and no loss of data, despite radioactively high levels of corruption across the cluster.

[1] https://www.tigerbeetle.com

hawk_ · on Oct 28, 2021

do you happen to have a mirror for the vr-revisited paper? This link doesn't work

https://pmg.csail.mit.edu/papers/vr-revisited.pdf

jorangreef · on Oct 29, 2021

I think the issue is with the scheme, this works:

http://pmg.csail.mit.edu/papers/vr-revisited.pdf

hawk_ · on Oct 31, 2021

you're right. Firefox wouldn't let me click the http link and automatically convert it to https. however if I click download directly instead of just loading it, it works.

throw10920 · on Oct 27, 2021

Assuming it's possible for most proofs to become automatically generated (which I don't think is a given), then it seems like verification work will shift from manually writing proofs into writing the specification from which proofs will automatically be generated - in which case, verification engineers will be able to look forward to an exciting future of figuring out how to tell the computer what is meant by "is".

Or doing system implementation directly - it doesn't seem like automated program synthesis is going anywhere fast...

(I jest, while still growing to be more of a fan of type systems, borrow checkers, and formal proofs)

lmm · on Oct 28, 2021

I mean that's always been the key skill of programming. Telling the computer how to do the thing isn't the hard part, figuring out exactly what needs to be done is.

johnbender · on Oct 28, 2021

More likely we’re looking at codesign/correct-by-construction for fully general purpose verification.

In many cases it’s possibly to refine a state machine based specification to an imperative implementation (and thereby carry safety properties down to the implementation) but at present the implementation usually looks like the state machine (thus codesign)

ncmncm · on Oct 28, 2021

A proof about a state machine is more useful than a proof about a program, or an automatically generated program. Proving that a program correctly implements a state machine is a much simpler exercise than teaching a proof system about your programming language.

Jweb_Guru · on Oct 28, 2021

> A proof about a state machine is more useful than a proof about a program, or an automatically generated program.

Easier maybe, but more useful? I disagree. Just because it's difficult to prove things about real programs doesn't change the fact that a formally verified program has really important advantages--like being able to actually run it, and knowing the executing code actually fulfills the spec rather than relying on a human translation layer.

ncmncm · on Oct 28, 2021

No, more useful. A proof about a state machine applies to any implementation of the state machine, in any language. A generated program is only itself. Any needed modifications make it not the program generated, or proven.

Furthermore: such output depends, for correctness, on a complete and accurate description of the salient details of the programming language, and the environment it needs to run in, provided to the proof system. But such a description is more complicated and error-prone than the description of the state machine. So, you need an extra proof that the language and system description are accurate and complete, which is not provided.

Jweb_Guru · on Oct 28, 2021

A proof that the language and system description are accurate and complete is very difficult, I agree. But it only has to be done once, it does not have to be repeated for each proof, so that does not seem relevant to me wrt the advantages of such a proof.

As for your first point, I would wager (having done it many times) that it is considerably easier to translate a proven, working program to another language correctly, than it is to do the same thing with an abstract state machine.

ncmncm · on Oct 28, 2021

For it to be "done once", it has to be done at all. I have not seen any reason to believe it was done at all. When posting a boast about having done the easy part, it does not inspire confidence that the hard part was done, too, but nobody thought the hard part was worth mentioning.

So, until I see the announcement, the reasonable interpretation is that it has not been done.

threatofrain · on Oct 27, 2021

Towards an Automatic Proof of Lamport's Paxos (2021)

https://arxiv.org/pdf/2108.08796.pdf

sethammons · on Oct 27, 2021

> The tool the researchers designed for this proof is called IC3PO, a model checking system that looks through every state a program can enter and determines whether it matches a description of safe behavior. If the protocol is correct, IC3PO produces what’s termed an inductive invariant—a proof by induction that the property holds in all cases. If instead a bug is found in the protocol, it will produce a counter-example and execution trace, showing step by step how the bug manifests.

That's really cool.

thrtythreeforty · on Oct 28, 2021

You can also find this in hardware test benches - Symbiyosys (FOSS) and JasperGold (very proprietary) can prove via induction that assertions about your system on chip hold true for all time.

meetups323 · on Oct 28, 2021

> The inductive invariant IC3PO produced for Paxos in under an hour identically matches the human-written one previously derived with significant manual effort using a technique called interactive theorem proving. On top of speeding the process up, it also produces a proof with very succinct and digestible documentation.

Is there anything actually different between "interactive theorem proving" in a sufficiently advanced language (say Coq) and what they've done here? Honest question, I had to drop out of my college Coq class on account of a broken leg (and lack of willpower).

johnbender · on Oct 28, 2021

In Coq the user generally writes meta programs to generate proofs as programs at varying levels of automation (Ltac scripts). Here they used IC3 to automatically produce proof that their state based model of paxos satisfies the invariant.

In principle one could build an automated proof system for a state machine-esque formalism written in Coq using Ltac but that would be a nightmare in practice.

generationP · on Oct 28, 2021

Wondering where the word "secure" in the title is coming from. These are purely safety results; the setting doesn't even allow for Byzantine failures.

lowbloodsugar · on Oct 27, 2021

I thought that Lamport already did that with TLA+? What is new about this one?

MatteoFrigo · on Oct 28, 2021

TLA+ is a specification language. Its goal is to say exactly what the algorithm is, because an informal description in English or even pseudo-code is usually too ambiguous and not amenable to mechanical proofs. TLA+ by itself does not prove anything. You can write incorrect algorithms in TLA+, in the same way that you can write incorrect C programs.

Once you have a TLA+ specification, lazy people like me run the specification through a tool called TLC that exhaustively explores all possible behaviors of the algorithm in a finite search space. For example, the specification may say that a property is valid for all N, but TLC checks it for N=1, 2, and 3. This step is not a "proof" (it's more like a test suite), but people like me say "good enough" and ship at this point.

Lamport and colleagues have a tool called TLAPS where you can write a proof yourself (e.g., valid for all N), and the tool checks that the proof proves what it claims to prove.

The next step, which is what this paper is all about, is to derive the proof automatically.

aman-goel · on Oct 28, 2021

Here is a link to a YouTube video on this work- https://youtu.be/nPwlj6w6EXU

The tool IC3PO is open source on GitHub- https://github.com/aman-goel/ic3po

ummonk · on Oct 28, 2021

Has anyone done the same for raft consensus?

_r2gl · on Oct 28, 2021

The conclusion in the paper states it's left as future work.

"Future work also includes automatically inferring inductive proofs for other distributed protocols, such as Byzantine Paxos [15], Raft [90], etc., and exploring the verification of consensus algorithms in blockchain applications. "

https://arxiv.org/pdf/2108.08796.pdf

johnbender · on Oct 28, 2021

Raft has been manually verified which was the hurdle here that makes the result interesting:

https://github.com/uwplse/verdi-raft

baby · on Oct 28, 2021

or for byzantine-fault tolerant consensus like hotstuff?

dgan · on Oct 28, 2021

Could this be done in Isabelle? Isabelle has the ability to prove inductive statements. (Disclaimer : i am a noob reading Concrete Semantics right now)

johnisgood · on Oct 28, 2021

https://github.com/Componolit/libsparkcrypto/blob/componolit...

It makes my head spin!

jt_thurs_82 · on Oct 27, 2021

> the most foundational distributed computing protocols—known as Paxos—meets its specifications.

> Paxos is one of the most important examples of the category

>“Most—if not all—consensus algorithms fundamentally derive concepts from Paxos,” Goel said.

A lot of burying the lede here, trying to imply Paxos powers many more applications than it actually does. While Paxos is in decent widespread use, and influential, I find some of the verbiage here lying by omission.

User23 · on Oct 27, 2021

First, automatically checking Paxos is very cool. Nothing I'm saying in any way diminishes that achievement.

Next, the result is entirely unsurprising. Paxos is proved correct since its construction was formally derived[1] to be so.

Finally, this is a nice opportunity to highlight the distinction between program verification and program derivation. In general, it's considerably easier to derive a program that has a property than it is to verify that an arbitrary program has a property. For example the halting problem is uncomputable in general and yet it's easy for a competent programmer to derive loops that always terminate. This is analogous to how it's much easier to determine that some large number is the product of two primes when you yourself produced it by multiplying those same two primes than it is to factor the product when the multiplier and multiplicand are not already known.

[1] https://www.microsoft.com/en-us/research/uploads/prod/2016/1...

Jweb_Guru · on Oct 27, 2021

That seems pretty accurate, TBH. Correct distributed consensus is almost exclusively done via some flavor of Paxos.

As for the actual research, it probably needs much closer reading, but if they can automatically infer nontrivial inductive invariants that's huge for automatic verification, since this is the usual stumbling block for those kinds of tools.

m0zg · on Oct 27, 2021

Most people have never dealt with large resilient systems, and do not understand what "distributed consensus" even means, and what it enables. Hence the derisive comments on this important result.

ncmncm · on Oct 28, 2021

As I understand it, Raft is much more commonly relied upon than Paxos. Raft uses concepts from Paxos, but is not Paxos. So, a proof about Paxos is not a proof about Raft.

Jweb_Guru · on Oct 28, 2021

Raft is more or less a Paxos variant with good marketing. Sure, the proof for vanilla Paxos won't directly apply, but it's not inaccurate to say that it derives heavily from Paxos, which is all the quote above was saying. See https://emptysqua.re/blog/paxos-vs-raft/.

ncmncm · on Oct 28, 2021

...which is what I just said: the proof is about Paxos, not Raft, and so applies to relatively rarebsystems that do rely directly

ncmncm · on Oct 28, 2021

[Buggy HN.] ... relatively rare systems that do rely directly on Paxos, not Raft.

Jweb_Guru · on Oct 28, 2021

If you read the article, you can see that there are barely any material differences between Paxos and Raft. Certainly no more than there are between the description of a program as an abstract state machine and its actual implementation (and based on our other comment thread, you seem to be under the impression that there is not much difference there).

ncmncm · on Oct 28, 2021

I don't need to read the article to know that Paxos and Raft have substantial commonality. That does not make them the same. That does not make a proof about one a proof about the other. That does not make a program that implements one a program that implements the other.

An abstract state machine, such as defines a version of Paxos, maps trivially one-to-one to parts of a program that implements it. A different abstract state machine, such as one that implements Raft, maps trivially one-to-one to a different program.

You might reasonably demonstrate that two different state machines define the same system behavior, but you would need to provide that demonstration. That has not been reported.

scrubs · on Oct 27, 2021

This is what I come to HN for. Thanks for the post.

0xbadcafebee · on Oct 27, 2021

When people talk about security vulnerabilities, they focus on the most glaring holes. "Memory safety!!!!!" etc. Meanwhile, cyber criminals merrily go about their day hijacking massive computer systems without even using a memory safety vuln. Because the criminals don't give a crap what kind of holes there are - they any kind will do.

When people talk about distributed consensus algorithms, they're similarly missing the important point. A useful algorithm isn't the end-all be-all of the system. What's important is that the entire system be operated to a high degree of reliability in a way that meets the needs of a specific application. Paxos or not, if your implementation keeps falling over for reasons other than consensus, or you don't build in the tools to properly deal with the knock-on effects of consensus issues, etc, then the whole thing is going to blow up anyway.

No algorithm on its own will make a better system. You need all the other parts of the system to be just as awesome, or it falls apart just the same.

lamontcg · on Oct 28, 2021

Its important to have a distributed consensus protocol like Paxos, which has been validated, as a lego block in building any larger system.

We need a lot more of those well engineered, proven, validated lego blocks. Memory safety is a cross cutting concern which makes all those lego blocks much safer.

We are about 30 years behind the curve and have a lot of ground to make up. That doesn't make these efforts useless, that just means that we need 10 times the effort.

drivebycomment · on Oct 28, 2021

What a strawman. The article doesn't claim or imply that Paxos on its own will make a better system. You can't build secure systems without having solid building blocks and primitives. Just like proving encryption algorithms and cryptographic have functions have certain properties is important and essential, proving properties about Paxos is important and useful.

johnisgood · on Oct 28, 2021

Speaking of proving encryption algorithms and such: https://github.com/Componolit/libsparkcrypto

Ada/SPARK! They added ECDSA, ECGDSA, too. :)

jiggawatts · on Oct 28, 2021

This reminds me of a recent project where I was doing some maintenance work on a system where one component was using ZooKeeper, which uses a distributed consensus algorithm similar to Paxos.

Guess which component was the only one that failed to come back up after a simple reboot?

Turns out that ZooKeeper has default "boot" timeouts that are too short for typical JVM startup times on some hosting platforms, so it just throws up its hands and never tries again after the first attempt.

I want to know who was the person that sat down, cracked his knuckles, and then want about writing the code to deliberately add a failure mode to what is otherwise a very failure-tolerant algorithm.

dikei · on Oct 28, 2021

Well, all valid consensus algorithms are guaranteed to eventually reach agreement, but it's not possible to guarantee about deadline in an asynchronous, distributed environment.

However, in practice, we don't want our system to stall forever waiting for consensus, so a timeout is often used to terminate the algorithm.

lmm · on Oct 28, 2021

ZooKeeper is consistent at all costs and often sacrifices availability for it, by design.

jiggawatts · on Oct 28, 2021

This superficially sounds like a good design choice, but makes no sense whatsoever if you stop and think about it for a minute.

Clustering systems like ZooKeeper are pointless for scenarios where things never fail. They serve no purpose in a system that is infallible, unchanging, and perfect. They're intended for handling failures and other changes.[1]

Power outages happen. Temporary but total network outages happen, sometimes across even multi-region deployments. Cluster-wide reboots may need to occur occasionally. Startup might be slow due to a storm of other systems also starting and slowing down everything.

These scenarios have to be handled. It's not optional for any system design for failure scenarios.

VMware handled it fine. Windows Failover Clustering handled it fine. SQL Server AlwaysOn Availability Groups handled it fine. Kubernetes was back up and running before anything else.

ZooKeeper didn't. It just sat there. No nodes were down. No services had crashed. It wasn't out of disk space. No data was corrupted. No "inconsistency" of any kind had occurred.

It just decided not to start. Because it was, what? Too impatient? It saw that 60 seconds had elapsed, and apparently 61 seconds is just too long to wait, so no cluster! No! Too slow! Boot faster next time!

What if this occurs at 1am on a system without 24/7, round-the-clock support? ZooKeeper will go down, and stay down.

In this case this was a planned outage, so I had to stay up until 3:30 am troubleshooting why ZooKeeper wouldn't act like any other cluster I had ever dealt with. Fun.

[1] I got this lesson taught to me by a CommVault backup product designer. He explained that "backup products" aren't. They're recovery products, and their only purpose is to be used in a disaster. If they can't handle a disaster, they're worse than worthless, they give a false sense of confidence! CommVault was designed by people that had operated forward IT bases for the military in the Yugoslav civil war. They had to recover systems from data centres that were literally smoking craters in the ground. ZooKeeper struggles with planned reboots. It's not even vaguely in the same league of trustworthiness.

lmm · on Oct 29, 2021

> This superficially sounds like a good design choice, but makes no sense whatsoever if you stop and think about it for a minute.

Sure it does; it makes sense for systems where continuing in an inconsistent state would be worse than stopping. If your bank doesn't let people spend their money that's bad, but if it lets people spend their money twice it's disastrous.

Obviously the ideal thing is to always keep working correctly, but anyone who's had a clustered system come up in split-brain mode knows the value of a system that will sometimes stop rather than continuing.

polskibus · on Oct 28, 2021

Is etcd better than zookeeper in this regard?

jiggawatts · on Oct 28, 2021

It seems to be able to handle being stopped and started just fine, at least in my (limited) experience. I.e.: Azure Kubernetes Service (AKS) has a stop button that turns off the entire cluster. Pressing the start button "just works". No troubleshooting required.

The ability to stop and start clusters on a whim is very handy for large non-prod environments, for example.

tester756 · on Oct 27, 2021

https://xkcd.com/538/