Errors in Database Systems, Eventual Consistency, and the CAP Theorem

justinsb · on April 7, 2010

I think Stonebraker is right: CAP has been dramatically over-applied.

aristus · on April 7, 2010

It's probably to do with differences in scale. At the LAN level, he's right: partitioning is rare and C/A might make the most sense. At the WAN level, partitioning is a fact of life.

There are at least two problems to solve on opposite ends of scale: how to get many components inside one computer to cooperate without stepping all over each other, and how to get many computers to cooperate without drowning in coordination overhead. They may be special cases of a more general problem, and one solution will work for all. Or perhaps we'll have one kind of programming for the large and another for the small, just as the mechanics of life are different inside and outside of the cell.

justinsb · on April 7, 2010

Agreed, but he's addressed WAN partitioning very well. Editing: "[Consider] a partition in a WAN network. There is enough redundancy engineered into today’s WANs that a partition is quite rare ... the most likely WAN failure is to separate a small portion of the network from the majority. In this case, the majority can continue with straightforward algorithms, and only the small portion must block. Hence, it seems unwise to give up consistency all the time in exchange for availability of a small subset of the nodes in a fairly rare scenario."

That last sentence is a very strong put-down of NoSQL; if it wasn't published on the ACM website, it should have a "zing!" at the end of it.

mmalone · on April 7, 2010

Actually he's not really addressing it at all. What stats does he have that show WAN partitions are rare? I find that short WAN partitions occur regularly. And how does the small portion "block?" How does it even know that it's the "small portion?" If a read comes into the "small portion" (maybe because the client's in the same data center) what does the small portion do? If it responds, the response may be inconsistent. If it doesn't, the data store is unavailable (at least as far as that client is concerned).

cx01 · on April 7, 2010

Paxos (which you'll probably use to implement a strongly consistent distributed system) allows nodes to reach consensus as long as a majority of them are able to talk to each other. So if a partition includes a majority of nodes, this partition will just continue working like before. The minority partition will not be able to reach any decisions at all and will be unavailable. Strictly speaking, the minority partition doesn't even know it's a minority partition, because this is an undecidable problem in asynchronous distributed systems. But the important point is: At most one partition will continue making decisions so that consistency is guaranteed.

So you're right in that the minority partition will be unavaiable, but I think it's a worthwhile tradeoff.

mmalone · on April 7, 2010

It _may_ be a worthwhile tradeoff. But if you're in three datacenters and one of them splits you could have a third of your requests fail. Depending on what you're doing, that may be unacceptable. I agree that it's a tradeoff. I disagree with the people who are saying that emphasizing consistency is always the right thing to do.

cx01 · on April 7, 2010

The problem is: If a node performs a write in an eventually-consistent data-store, the write will not be visible immediately. Alternatively (in a strongly consistent data-store) the node could just retry the write until it succeeds. From the perspective of a user that wouldn't really make much of a difference, but in the latter case there would at least be no risk of having multiple inconsistent versions of the same data.

mmalone · on April 7, 2010

Your first point is not true. In the absence of partitions, all "eventually consistent" data stores that I know of will give you strong consistency. The eventual consistency bit only comes into play if a problem occurs. (It's probably also worth noting that many RDBMS replication strategies won't give you strong consistency at all - even in ideal circumstances.)

I suppose you could retry if a write fails (e.g., can't reach a quorum), but you could theoretically end up retrying forever (and 10 seconds or so is forever as far as an interactive user is concerned)... eventually you need to either fail or write inconsistently. So we're just delaying the inevitable.

Also if you're accepting quorum writes to the "major partition" you still have to repair the "minor partition" when it comes back online. There's no traditional DB that implements the sort of read-repair/hinted-handoff/anti-entropy mechanics that Cassandra, Voldemort, and the various proprietary big data stores use.

cx01 · on April 7, 2010

The case that I have to retry indefinitely will only occur when the partition is permanent. And in that case an eventually consistent system will not help either, because clients will not see the writes (as long as the partition exists), so the eventual consistent system doesn't offer me any advantage.

mmalone · on April 7, 2010

My point was that for an interactive application it doesn't take a very long time for user experience to degrade substantially.

And we've glossed over my points about newer databases exposing these knobs to clients. It's possible to do exactly what you're describing using Cassandra, for example (er, you might have to do some consistency level tweaking to make writes fail if you can't get a quorum of authoritative nodes, but it wouldn't be hard - and I'm not even sure if that's necessary). It's not possible to do it with MySQL or PostgreSQL without building some intelligent partitioning layer on top. And that layer will probably make it impossible to do joins and add relationship constraints, so you lose any benefits these systems bring to the table.

justinsb · on April 7, 2010

Quorum protocols are the simple mechanism he's talking about. The argument he's making is that you shouldn't be so binary about saying 'well this subset of nodes is unavailable, so the system is not available'. That's why CAP is so overapplied in the real world: just because some fraction of your nodes are offline, that doesn't mean the whole system is offline.

Your sharing your statistics on WAN partitions happening regularly would be a welcome contribution to the debate. There's a hierarchy of failures, and I think it's generally accepted that WAN partitions happen less often than, say, one node in a cluster crashing. Statistics that show otherwise would let us talk in specifics rather than the abstract.

mmalone · on April 7, 2010

Sure, but quorum protocols only provide strong consistency in the absence of partitions. If a partition occurs you may not be able to get a quorum (where R + W > N) and, again, you're stuck with either being unavailable or potentially inconsistent. There's really no way around it... AFAIK it's a logical impossibility.

I'm not sure I get your argument re: CAP being overapplied. The key point the whole "AP" camp is making is exactly what you're saying - "just because some fraction of your nodes are offline, that doesn't mean the whole system is offline." What it does mean, though, is that some of your data may be stale. But eventually it won't be.

As for WAN partitions, I agree, they're not as frequent as single node failures. But as far as CAP is concerned it doesn't really matter. A partition is a partition, whether it's one node or half your cluster. The frequency that "WAN partitions" occurs depends on how you define a "WAN partition." If you consider a single lost TCP connection a short-lived partition (it pretty much is), or if you consider a DNS or power outage a WAN partition (in the sense that a whole cluster might disappear) then I think we can all come up with lots of ways WAN partitions can and do occur. I do agree that the entire Internet doesn't go down very often.

justinsb · on April 7, 2010

You choose your quorum trading off cost/complexity vs risk-tolerance. You ensure that not forming a quorum is impossible in scenarios that you care about. e.g. You may decide it's OK not to form a quorum if the entire USA power grid goes offline.

The broad problem is that you're trying to apply the mathematical proof of the CAP theorem to the real world. For example, the proof of the CAP theorem treats single-node failures as a case of network partitioning, which is logically elegant. But in the real world, it's just not realistic to consider a dropped TCP connection as equivalent to the failure of a datacenter, as you seem to be doing.

mmalone · on April 7, 2010

Er, no. I'm just not differentiating between the various reasons why a single node may be unavailable. It doesn't really matter _why_ the node is unavailable... it just is.

FWIW, databases like Cassandra expose the consistency tradeoff to the client. You can do quorum reads/writes with Cassandra. You can't with MySQL or PostgreSQL.

Edit: you can choose between quorum reads/writes and stronger or weaker consistency levels with Cassandra, but can't with MySQL / PostgreSQL.

justinsb · on April 7, 2010

I'm not going to treat a cosmic ray corrupting one single network packet the same way I treat a hurricane cutting off power to a datacenter for 2 weeks. I do see the intellectual appeal in doing so, but we'll just have to agree to disagree!

mmalone · on April 7, 2010

Uhm, of course they're not the same thing... but they have the same effect. The point is that the system remains available even if a node becomes unavailable for _whatever_ reason. I'm not sure what you're disagreeing on... There are common and uncommon modes of failure. Of course we should prioritize handling the common ones. But if we can handle all of them at once that's ideal. And, as I said in an earlier comment, when you're doing a million operations a second, failures that are one-in-a-million happen every second.

strlen · on April 7, 2010

It's an interesting article, as fundamentally sound systems have been built which are "CA" within a single datacenter. BigTable is the premier example.

There are however several issues with this. First, eventual consistency means eventual consistency of data on the replicas; it doesn't mean that a client can't be presented with a consistent view. Dynamo-pattern systems use quorums to provide this. Even when a quorums aren't used (R + W < N), at least with Voldemort, the weak (no read-your-writes) form of eventual consistency is still only a failure condition: when the "coordinator" (first in the preference list) node for the specific key fails after a write has happened, but before the next read is made.

Second, a CA system means that the core switch (and there may be multiple within a datacenter, especially in the case of startups leasing colocation space or using a managed/"cloud" hosting provider) is now a single point of failure. The way to build around this is by building an "AP" layer on top of the "CA" layer, spanning multiple core switches. This is similar to Yahoo's PNUTS and Google's Spanner. Both of these had been multi-year projects, by companies with expertise in distributed computing, with very specific and limited requirements (i.e., they were building this for themselves, not selling them as general purpose solution). Which brings me to the next point:

"CA" consistency is much more difficult (that is, error prone) to implement than "AP" consistency. Two phase commit is one way to do so, but doesn't provide fault tolerance (it won't withstand the failure of the coordinator/master node). Paxos is one way to do so, but a high performance Paxos implementation requires leases and is still very tricky to implement. Again, it took years for Google to build, trouble shoot and perfect the Chubby/GFS/BigTable stack; the first version did not have a fault tolerant master and the query model is still much simpler than SQL.

That's why I am skeptical when people claim to be able to deliver to market (within months, not years) a commercial solution that provides strong consistency, fault tolerance, horizontal scalability (without a hard upper limit), supports multiple datacenters and still allows execution of SQL queries (even if without certain types of JOINs) with OLTP-suitable performance. That's not to say it's logically impossible, it's just a very bold claim to make.

kennu · on April 7, 2010

I always figured that the C in CAP is more about scalability. If you keep adding servers indefinitely, it becomes slower and slower to commit every transaction to every server. But if you ditch the C, you can write fully in parallel to all servers.

cx01 · on April 7, 2010

ACID transactions require consistency (C in CAP) in order to guarantee atomicity. So if you ditch C, you'll just be unable to perform transactions at all.

In general, weakening the consistency guarantees is not done to improve scalability, but rather to improve the partition tolerance. There are replication techniques (e.g. chain replication) that allow very high performance without sacrificing consistency.

kingkilr · on April 7, 2010

The C in CAP and the C in ACID are not the same C. Consistency in the ACID sense means that all constraints are met, Consistency in the CAP sense means all servers have the same data.

cx01 · on April 7, 2010

I know. I actually explained this not too long ago here: http://news.ycombinator.com/item?id=1191003

But what I'm saying here is: You cannot have ACID transactions without the C in CAP.

dm_mongodb · on April 7, 2010

fwiw, here is the formal version of the theorem from the gilbert/lynch proof paper:

Theorem 1. It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: * Availability, * Atomic consistency, in all fair executions (including those in which messages are lost).

kennu · on April 7, 2010

Without knowing much about it, I can imagine that chain replication could reduce the problem of copying data to every server from O(n) to something like O(log n).

But can it also manage simultaneous writes (of possibly conflicting data) to multiple masters, to achieve write scalability? Won't it get slowed down, checking every master for conflicts before committing?

cx01 · on April 7, 2010

You do not copy data to all servers; in practice you may choose a replication degree of 3, so all data is written to 3 servers. In chain replication the first server in the chain will decide on the order of writes (i.e. he will serialize the writes), which is very fast. In a P2P system you'll need to do some kind of distributed consensus for every write, which (when conflicts happen) will be much slower than having one node serialize the writes.

Having multiple masters will most likely add overhead for conflict resolution and should be avoided, but it's really orthogonal to the whole CAP stuff. If you allow writes to multiple masters without conflict resolution at write-time, then you will need to do the conflict-resolution at read-time, which is going to be much more complicated, because your read may yield multiple versions.

kennu · on April 7, 2010

That was pretty much my original point about giving up consistency to achieve (write) scalability. E.g. CouchDB is based on that principle: You can write to all masters in parallel and the system provides a mechanism for resolving the conflicts afterwards.

wmf · on April 7, 2010

If you keep adding servers indefinitely, it becomes slower and slower to commit every transaction to every server.

The point of database partitioning is that a transaction that doesn't span partitions only touches a constant number of servers (usually 2 or 3).

kennu · on April 7, 2010

I would interpret that as a compromise where C, A and P are divided into some ratio (you still don't get 100% of each). I understand Cassandra does that well if the application fits the model.

mmalone · on April 7, 2010

I posted this comment to the ACM article, but they're pre-moderating and apparently someone's shirking their responsibilities... so I'm reposting here. Hope you guys don't mind:

Hey Michael,

Interesting read. I'm with you on most points. I do have a couple comments though...

Failure mode 1 is an application error.

Failure mode 2 should result in a failed write. It shouldn't be too hard to trap errors programmatically and handle them intelligently / not propagate them. Of course the devil's in the details and hardware / interpreter / other problems in components that are outside of the DBs control can make things more difficult. These are the sorts of issues that tend to stick around until a system is "battle tested" and run in a couple of large / high volume operations.

Failure modes 3, 4, 5, 6 (partition in a local cluster) - this seems to be where the meat of your argument is... but you totally gloss over your solution. I'm not sure I believe that network partitions (even single node failures) are easily survived by lots of algorithms... Or, more specifically, I don't believe that they can be survived while maintaining consistency (in the CAP sense, not the ACID sense). I threw together a super simplified "proof" of why consistency is essentially impossible in this situation in a recent talk. See http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrel... - slides 16 through 20. What algorithms are there to get around this? If a replica is partitioned you either can't replicate to it and have to fail (unavailable) or you can't replicate to it and succeed anyways (replica is inconsistent).

I also don't buy the argument that partitions (LAN or WAN) are rare and therefore we shouldn't worry about them. For a small operation this may be true, but when you're doing a million operations a second then a one-in-a-million failure scenario will happen every second.

Failure mode 7 will probably result in some data loss unless (as you mention) you're willing to live with the latency of waiting for durable multi-datacenter writes to occur. But having that option is definitely nice, and that's a trade off that I'd like to be able to make on a per-write basis. I may choose to accept that latency when I'm recording a large financial transaction, for example. Another thought related to this issue - in a lot of ways writing something to memory on multiple nodes is more "durable" than writing it to disk on one. So you may be able to do multi-DC replicated writes in memory with tolerable latency assuming your DCs are close enough that the speed of light isn't limiting. That should get you durability up to the point where the entire eastern seaboard disappears, at least.

Failure mode 8 is another core issue that I think you're glossing over. WAN failures (particularly short-lived ones) can and do happen on a regular basis. It's true that routing issues are typically resolved quickly, but it's another law-of-large-numbers thing. Amazon AWS had an issue that took an entire data center offline a while back, for example. Shit happens. In CAP terms this is really the same thing as a failure modes 3, 4, 5, 6, and 7 though. So the same arguments apply. Re: your argument that only a small segment splits - what happens when a read comes into the small split segment (maybe from a client in the same datacenter)? If data has been updated on the larger segment it couldn't have been replicated, so again you're either serving stale data or your data store is unavailable.

Thanks for putting this together, it was an interesting read. Looking forward to hearing more about some of these issues!