Hacker News new | past | comments | ask | show | jobs | submit login
Paxos Made Moderately Complex (paxos.systems)
85 points by sargun on Aug 26, 2015 | hide | past | favorite | 29 comments



Paxos is one of the simplest distributed algorithms out there.

For a proposal i with value x:

1. Ask everyone not to participate in proposals older than i, get a majority to agree and send back any previously accepted values.

2. Ask everyone to accept proposal i with the newest previously accepted value or x if none, get a majority to accept.

- On failure, restart with i > max(previous proposals)

That is the Paxos algorithm.

Its fault-tolerance comes from the fact that you can tolerate losing the minority (1 out of 3 / 2 out of 5). There's no magic. If the majority fails, Paxos blocks. If two proposers keep blocking each other's proposals by going through round 1, Paxos can take forever.

The goal of Paxos is to agree on 1 value. Once consensus is reached (majority agrees), the value can never be changed, since no one can complete round 1 without getting back the newest accepted value from at least one node. However, you can run Paxos again and again to agree on a sequence of values (e.g. transitions in a state machine). This is Multi-Paxos.

The Paxos algorithm is optimal, but Multi-Paxos allows for a lot of optimizations and there is no clearly defined implementation. A pretty typical optimization is to elect a leader (through Paxos) which makes the proposals. This avoids duelling proposers and ensures you have a node that is always aware of the latest accepted value. Making this optimization requires you to make decisions that fall outside the scope of Paxos, such as how long a node remains leader, what happens if the leader fails, and what happens if two nodes think they are the leader. These optimizations can easily cause you to lose the fault-tolerance and consistency guarantees that Paxos gives you within the scope of reaching consensus on a single value.

It's important to remember that Paxos itself is easy to understand and a lot of the problems in Multi-Paxos can be solved by running Paxos. You need to familiarize yourself with it as though it was a programming construct.


I agree that Paxos is much simpler than many people apparently think. It certainly didn't help that the original paper is almost completely useless for learning Paxos.

I highly recommend that anyone interesting in learning Paxos study this pseudo code from MIT's distributed systems class; it was the single thing that helped me understand Paxos most:

http://nil.csail.mit.edu/6.824/2015/notes/paxos-code.html


In the step 2. you write that you "Ask everyone to accept proposal i", but to satisfy the safety property of Paxos you can only accept proposals from round "i" if you were a participant of the round "i" in the step 1.


Ah true, I should have probably written "all round 1 participants". It can be enforced by the receiver as well. There are some other implied receiver-side steps that I did not write down, such as rejecting all proposals with a sequence number <i.


At my previous employer, I worked on Arakoon, a consistent distributed key-value store based on Multi-Paxos, implemented in OCaml, using TokyoCabinet (a fork of it) as the local storage mechanism.

Implementing the protocol in an industrial setting brings up lots of ... interesting design and engineering issues for sure!

See https://github.com/Incubaid/arakoon


There is a popular alternative as well, Raft. https://ramcloud.stanford.edu/wiki/download/attachments/1137...

Has anyone used both? What are the pros and cons?


Raft has a dirty little secret: your cluster breaks when two nodes become partitioned from each other (but not the rest of the cluster) and one of them is the leader.

Script: nodes {1, 2, 3*} where 2 and 3 are partitioned, and 3 is the current leader.

Node 2 fires its leader election timer, broadcasts RequestVotes to 1 and 3, only 1 gets it, but now 3 is ignored for the term of its leadership and 2 is the only node capable of quorum. In a decent implementation 3 will be nack'd the next time it sends an AppendEntries to 1, and 1 will pass along the current term. 3 isn't hearing from the leader so broadcasts RequestVotes (either failing once due to failure to hear the current term from 1, or jumping straight into the next term, which actually increases the ratio of cluster livelock). Leadership bounces back and forth rapidly, making your cluster worthless.

The current hotness is spec paxos, which gets the positive trade-offs of both cheap paxos and fast paxos, if you're able to actually implement it in your DC (has networking assumptions).


How's that a "dirty little secret"? Raft not supporting asymmetric partitions is somewhere on the third line of the original paper.

Besides that, I genuinely wonder how often does this happen in the real world? I see a typical case where parts of your cluster is behind a NAT. Otherwise, what could cause an asymmetric partition?


That sounds easy, actually. Take a cloud provider like AWS. They divide their regions into "availability zones", which are supposedly in different buildings, don't share the same power, internet connectivity, etc. Say you have 3 AZs, and you put a node in each. A simple network connectivity issue between AZ 1 and 2 (but not between 1 and 3 or 2 and 3) would cause this scenario to happen, assuming I'm understanding this correctly.


Consider a 3 location network with routes between each pair of locations. Then the link between two of them fails without mechanisms in place to automatically re-route.

That's not such an uncommon situation. E.g. quite a few large to mid-sized companies with servers in three locations does not have the in-house skillset to either get BGP set up and/or set up VPNs between the locations and a means to update routes automatically on outages. Some larger shops will have "metro LAN" type setups with separate ports for dedicated connections between racks in different data centres, and won't even have capacity enough on their public ports to handle a failover of traffic normally going on the private connection between two of the data centres over the public port - the expectation is often that the data centre operator will handle redundancy... (until they don't...)

It's not a terribly hard thing to fix, but it's also something it seems few people think about until they reach much larger size.


> The current hotness is spec paxos, which gets the positive trade-offs of both cheap paxos and fast paxos, if you're able to actually implement it in your DC (has networking assumptions).

Sounds interesting, is "spec paxos" like Spanner in requiring special hardware? If you could share some links or references to more information about it, that would be great, thanks.



I always considered Mencius to be the "next hotness in consensus algorithms".


Raft is 'simpler' than Paxos because it's not only a consensus protocol, it include log management etc. On the other hand, Raft could be (with a stretch) considered one of the protocols in the Paxos family, whilst other family members have their specific strengths (and complexities), e.g. being more efficient in WAN networks.

If you're interested in Raft, take a look at Kontiki (https://github.com/NicolasT/kontiki). Could use some maintenance though...


Paxos is poorly explained in the multi-decree method, because the explanation isn't particularly opinionated about how SMR and logging is done. Raft only has a multi-decree variant, and it has opinions on how to do leader election, etc...

Fundamentally, both protocols (multi-decree, multi-paxos, and Raft are equivalent (in performance, and capability)). There are other variants of Paxos (ePaxos) that have advantages to Paxos in some cases, especially in the WAN.


Here's a Paxos implementation in C++ that I wrote for my old startup's distributed database product:

https://github.com/scalien/scaliendb/tree/master/src/Framewo...

If you just want to agree on a leader, there's PaxosLease:

https://github.com/scalien/scaliendb/tree/master/src/Framewo...


What was Scalien, and its purpose?


Scalien was my startup, its purpose was to sell ScalienDB and make money :)


Why didn't it work out?


1. Our selling point was: "there's a bunch of others like MongoDB out there, but they're eventually consistent; if you really care about your data, use ScalienDB, it uses Paxos and there's no consistency issues". But, (i) EC/Mongo/etc were so hyped that our message wasn't heard, (ii) people didn't really understand what we're saying, (iii) others were making false claims in their marketing that their product can also be tuned to be consistent. We didn't really understand how to market this. Contrast this with the excellent execution of Mongo.

2. We got foobared by potential investors, we focused on one group but they eventually walked away, and by that time we ran out of money. The other side of this was long enterprise deal lifecycles. Trying to convince somebody who is big enough to pay for DB software to use your alpha thing is 6-12 months, that's the same timescale you run out of money.

3. We did have some bugs, so we were claiming consistency/reliability, but it was an alpha product with bugs. (Of course it was, it was just us, 2 guys writing it, we didn't even have money to buy testing infrastructure!)

In the end we quit after ~3.5 years, after we exhaused all options, at great personal (but not monetary) cost, eg. I got divorced shortly after. Fortunately we were able to exit gracefully from all business affairs.


Why list all of those variants without referring to byzantine paxos at all? Byzantine failure is everything except a clean end of participation; perhaps its worth mentioning the flavor of paxos that can deal with it.


Author of the website here: Byzantine Paxos fundamentally is a very different protocol than Multi-Paxos because of the failure model it supports. The variants I included are all variants of the Multi-Paxos protocol. Because of this, even if it includes Paxos in its name it was not in the scope of this work. I can maybe extend the content of the website to cover all protocols related to Paxos and underline how they are related and how they are different and include Byzantine Paxos in that discussion.


So, byzantine failure is neat, but I've never actually seen a system used that implements true byzantine Paxos, I'm really not sure what kind of production system would implement it.


Nitpick: Paxos is a consensus protocol, not " a protocol for state machine replication in an asynchronous environment that admits crash failures.".

Consensus can be used for SMR, atomic broadcast, leader election. One can argue all of these replicate a sort of state machine. However, that dilutes the meaning of state machine replication -- replication of state with durability and performance in mind


Nice writeup. Complex protocol like this always makes me wonder, how do people even test it?


They have these things called "proofs": http://cstheory.stackexchange.com/questions/25851/correctnes...

Jokes aside, there are two ways out: huge simulators or writing very very simple code that matches 1-to-1 the steps of the proof. And Lamport's writing style for proofs [1] lends itself to simple matching implementations.

[1] "How to Write a 21st Century Proof" http://research.microsoft.com/en-us/um/people/lamport/pubs/p...


jepsen, quickcheck, interleave testers like concuerror; or, by implementing directly in a proof assistant/language like coq, as was recently done with raft.


TLA+ model checking and an implementation that follows the spec as closely as possible.


http://harry.me/blog/2014/12/27/neat-algorithms-paxos/ is quite a good article for an introduction to PAXOS.

Last time on HN: https://news.ycombinator.com/item?id=8806835




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: