Call me maybe: RabbitMQ

atombender · on June 8, 2014

RabbitMQ requiring a reliable network is causing problems for us in production. Anyone else struggled with this?

We're running several clusters on different providers; one is Digital Ocean, another is on a partner's VMware vMotion-based system. The kernel gets soft lockups now and then (in the case of vMotion, when VMs are automatically migrated to other physical nodes), which causes RabbitMQ partitions. The lockups may last a few seconds, but I've seen minute-long pauses.

When this happens, RabbitMQ starts throwing errors at clients. Queues disappear and clients can't do anything even though the local node is running. Although I understand the behaviour, that's not what I want from a queue; I want a local node to store messages until it rejoins the cluster and can dispatch them, and I want the local node to continue offering messages to those listening.

Unfortunately, RabbitMQ's design doesn't allow this: Queues live on the node they were created on, and RabbitMQ does not replicate them. We have turned on autohealing, but I don't like the fact that minority nodes simply wipe their data when they rejoin the cluster. The federation plugin doesn't look like a great solution.

I really like RabbitMQ, but maybe it's time to considering something else. Any suggestions? Something equally lightweight and "multimaster", but without the partition problem?

jpgvm · on June 8, 2014

You need to look into using HA queues (ie. mirroring) and ensure you are using a client that properly supports consumer cancel notifications and are actually processing them correctly.

There is also significant TCP/IP and RabbitMQ tuning that can be done to make fail-over much faster etc.

I have used RabbitMQ for years and though it's not perfect if you do try something else you will quickly understand that it's still so far ahead of the competition.

Best of luck!

atombender · on June 8, 2014

HA queues have this caveat in the documentation:

    This solution requires a RabbitMQ cluster, which means
    that it will not cope seamlessly with network partitions
    within the cluster and, for that reason, is not
    recommended for use across a WAN

And then:

    However, there is currently no way for a slave to know
    whether or not its queue contents have diverged from the
    master to which it is rejoining (this could happen
    during a network partition, for example). As such, when
    a slave rejoins a mirrored queue, it throws away any
    durable local contents it already has and starts empty.

So I don't think that's helpful to us at all.

We are going to migrate to a client that supports consumer cancel notifications, though. Thanks for the tip.

jpgvm · on June 10, 2014

True, what I do recommend though is using RMQ clusters on each cloud (where networking should be abit more reliable, the exception being AWS, which always sucks in this regard) then using federation (probably via shovel, but there are other means) to the other clouds.

Ultimately though.. when you get to this stage I question if your app is big enough to warrant this is suggest you use Azure Service Bus/Simple Queuing Services/whatever else your providers make available.

If your business really needs such control over messaging.. I understand. I have been in the situation where those easy ways out aren't available and I know your pain. Unfortunately there is no vendor you can go to make it go away, Tibco, Sterling etc aren't much better than RMQ.

I wish you the best of luck in your multi-cloud federated messaging system though, I highly suggest you look at Azure Service Bus though, I have nothing but praise for it despite being a devout RMQ zealot.

atombender · on June 11, 2014

You misunderstood me, I think. Our clouds aren't connected. The partitioning problem exists within each data center (eg., Digital Ocean).

So federation/shovel is probably not the solution.

SQS is far too simple for our needs. No idea what Azure is, but if it's SaaS the latency will likely be too high. We need local performance.

cheald · on June 9, 2014

Can you point me at a good place to start for RMQ and TCP/IP stack tuning?

jpgvm · on June 10, 2014

Sure, I would recommend the below in sysctl:

  net.ipv4.tcp_keepalive_time=5
  net.ipv4.tcp_keepalive_probes=5
  net.ipv4.tcp_keepalive_intvl=1

This will tune the TCP keepalives to decrease the time it takes for most client stacks to realize a server has gone away. It should also be configured on the servers themselves that are participating in the cluster.

As for rabbitmq tuning I recommend these settings at a minimum:

  [
   {rabbit, [{tcp_listen_options, [binary,
                                  {packet, raw},
                                  {reuseaddr, true},
                                  {backlog, 128},
                                  {nodelay, true},
                                  {exit_on_close, false},
                                  {keepalive, true}]}
            ]}
  ].

cheald · on June 10, 2014

Thank you! I have some reading to do. Appreciate it!

rdtsc · on June 8, 2014

Presumably running clustered configuration on a public VM hosting provider is not a recommended configuration.

Kernel lockups lasting minutes at a time seems like a serious issue a lot of software might have problems with.

Have you tried federation of shovel modes?

atombender · on June 8, 2014

The Federation plugin looks interesting, but all the configuration that's necessary looks like a chore.

For one: Unlike most Unix daemons it must apparently be applied as rabbitmqctl commands after the daemon has started, which is awkward to do in a Puppet environment; as far as I can tell, you can get into situations where RabbitMQ starts in clustering mode, and there will be a race condition before you can apply the plugin config.

Secondly, the documentation is very poor. I have read everything pretty closely, but I still don't know if federation preserves the exact same behaviour, from the client's point of view, with regard to consistency, atomicity etc. What's missing is a feature matrix of how RabbitMQ behaves in the different modes.

The same thing goes for the Shovel plugin, except fortunately it allows for static config to be declared in rabbitmq.config. But again, because of the poor documentation, I still don't know how it would behave in practice.

krakensden · on June 8, 2014

Apache Kafka seems like a promising alternative, although I haven't fully evaluated it yet.

atombender · on June 19, 2014

Kafka is a lower-level system than RabbitMQ. It doesn't support:

* Topic routing. With RabbitMQ you can bind queue X to exchange Y with the routing keys "foo.bar." and "foo..baz". The queue will then get all messages matching those keys. That allows for very flexible pub-sub-style routing of messages; we use this extensively. As far as I can see, not really possible with Kafka without message duplication.

* Nacking model. Since Kafka queues are strictly linear, if a consumer fails to consume a message and wants to nack it and then process the remaining messages, it can't do so, since it has a single "read head". It would have to re-enqueue the message in that case.

* Prioritization. Not supported. You will have to create topics for different priorities and then consume those topics at a different rate depending on the priority.

* Message TTL.

And it's susceptible to network partitions, just like RabbitMQ.

atombender · on June 9, 2014

I considered it, but it requires Apache ZooKeeper, which is yet another thing that needs to be configured and maintained. Not a fan of Java-based software either, to be honest.

lindvall · on June 9, 2014

> Not a fan of Java-based software either, to be honest.

It makes me sad to see this is still something people say. I hope you reconsider this sentiment and investigate some of the really great Java-based software out there, especially Zookeeper and Kafka.

It's worth checking out how Zookeeper and Kafka did in aphyr's testing:

http://aphyr.com/posts/291-call-me-maybe-zookeeper

http://aphyr.com/posts/293-call-me-maybe-kafka

atombender · on June 11, 2014

Every Java-based backed service I have come across, be it ElasticSearch, LogStash, Hadoop or PuppetDB, have all been memory-hogging beasts. Part of this is due to the GC, which tends to use more heap space than the program actually needs. Java is fast, but I have yet to see anyone claim it's lightweight.

krakensden · on June 16, 2014

The worst part is "let's statically preallocate* the heap size like it's 1975".

*Note: never more than 31GB

atombender · on June 11, 2014

Kafka seems to suffer from lack of partition tolerance, by the way, according to Aphyr. Not happy about the fact that it will just wipe a partition upon re-electing a new leader.

slantedview · on June 9, 2014

> Not a fan of Java-based software either, to be honest.

Yet so many of the services you use run on it.

atombender · on June 11, 2014

Not sure that's an argument for anything. A lot of businesses build on crappy software.

rdtsc · on June 8, 2014

If you didn't read the article, the first part is related to this blog post:

http://www.rabbitmq.com/blog/2014/02/19/distributed-semaphor...

That post talks about how to build a distributed semaphore using RabbitMQ in the "clustered" configuration option.

Aphyr's post is good but RabbitMQ make it pretty clear that this setup is not resilient to network partitions. They even mention it at the end of the blog post. That is also clear in the documentation.

https://www.rabbitmq.com/distributed.html

The big general question is how likely are you to encounter network partitions?

Aphyr believes they are very dangerous and likely to happen to you at some point and should be taken more seriously by the programming world.

I tend to take a more of "it depends" on your environment. Recently with VM and cloud deployment network partitions are quite likely. So they should be at the top of your "things to worry about" list. But, I haven't seen it happen on a local LAN. During testing I have induced it by hand (but pulling the network cable out from a switch), but I just haven't seen it happen otherwise. I probably got lucky. Now I have seen other things come like memory corruption, memory leaks in software, and so on. So many other things that worry me more than network partition on a LAN.

All that said, it is great to see these experiments run. Please read and study all the other "Call me maybe" series. They are just very good. It turns out most products with a "distributed" component will fall over in the face of a partition. And keep in mind that it is best to have Aphyr discover these issues than your customers ;-)

Nitramp · on June 8, 2014

I think this is a misunderstanding in terminology. "Network partition" in this context does not necessarily mean your LAN cable is broken, it means that you have a cluster of servers that expect to communicate, and some of them cannot communicate between each other.

That could be because of a faulty network, but it could also be due to hardware failures, misconfiguration, slowness of servers (hard to tell from not reachable), and a whole host of other causes. If you have more than just a couple of servers, this will happen sooner or later.

If you only have a couple of servers, IMHO don't bother with a truly distributed system, use a database with failover to a replication slave (lazy or eager, depending on what consistency you need) and be done with it.

hendzen · on June 8, 2014

Aphyr (Kyle Kingsbury) & Peter Bailis have a nice survey of actual network partitions that have occurred in production systems: http://aphyr.com/posts/288-the-network-is-reliable

jamesaguilar · on June 8, 2014

If I'm reading the dates right, they admit its lack of resilience to partitions after the Aphyr blog post came out. That's a completely different thing than proactively raising this issue.

It's also not clear what if any usefulness a lock service has, beyond as a demonstration, if it can't survive partitions.

Network partitions are rare on small scale networks over short durations. But as the saying goes, if you run a billion trials, the one-in-a-million event happens a thousand times. The larger a network grows, the more likely you'll eventually experience the joy of a partition.

old_sound · on June 8, 2014

I'm the author here of that blogpost.

The blog post was published on Feb 19th. First edit is from Feb 20th and second one from March 10th.

Aphyr's blog post is from June 6th 2014, so clearly we made the edits way before aphyr's blogpost came out.

rdtsc · on June 8, 2014

> they admit its lack of resilience to partitions after the Aphyr blog post came out. That's a completely different thing than proactively raising this issue.

In that blog post. This doc page was there long before.

https://www.rabbitmq.com/distributed.html

Network partition tolerance is a general configuration trade-off not related just to that particular trick of distributed semaphores. Presumably someone who set up a clustered configuration already decided on the likely-hood of experiencing a network partition and read the docs.

Now there is a another issue explored and that is tolerance to network failures in general between clients and even a single server. That is (the way I understand it) not related to clustered or un-clustered configuration. It relates to the stability of network connections between clients and server(s).

jamesaguilar · on June 8, 2014

I don't know if you read the blog post, but it demonstrates that the partition tolerant mode from the page you just linked is not actually, and that RabbitMQ, even in that mode, can't be used as a lock service.

I didn't see the part of the post about connections between clients and servers, but I was reading on mobile, so maybe I just missed it.

rdtsc · on June 8, 2014

> but it demonstrates that the partition tolerant mode from the page you just linked is not actually

I was talking about the part about picking a response to a partition detection event. In this case "ignore" (see "Recommendations" section), instead of "auto-heal" or "minority-pause".

ams6110 · on June 9, 2014

I've seen it happen, not what I'd call often but not so rare that you're astonished by it. NICs fail, switch ports or entire switches fail, stuff like that happens from time to time.

keypusher · on June 8, 2014

I thought that in order to create a distributed locking system, you need to be able to reliably fence unreachable nodes. "Network Partitions" sound a lot like "Split Brain" to me. I am more familiar with traditional clustering solutions such as Pacemaker/Corosync and the GFS2 DLM than RabbitMQ, so perhaps I am missing something here?

teraflop · on June 8, 2014

That's more of a software design consequence than a fundamental limitation.

It's common to build fault-tolerant systems by designing a master-slave architecture, and then bolting a failover mechanism on top so that exactly one node is the master at any time. That approach suffers from exactly the problem you describe: if you can't contact a node, there's no foolproof way to ensure it isn't acting as a master, so you risk split-brain unless you can remotely fence it.

Consensus algorithms like Paxos/Raft don't suffer from this problem; in order to make steady progress, there should be only one master, but they still operate consistently if that assumption is violated in the short term. So no fencing is needed.

A lot of people seem to have a fundamental misunderstanding that using Paxos for leader election is enough to make a system consistent, and it's emphatically not. For instance, I hope this isn't still true in current versions, but for a long time HBase was vulnerable to losing committed data because it misused Zookeeper and allowed multiple regionservers to simultaneously "own" a given table region. (I found about that issue when it happened to me in production, so now my default assumption is that all "distributed locking" systems are broken until proven otherwise.)

Tomdarkness · on June 8, 2014

In relation to the second part of the article regarding lost messages it would be interesting to see the same test done but with HA queues:

http://www.rabbitmq.com/ha.html

I might be wrong but I'd suspect that it might solve the problem as while the partioned nodes will still discard the data it should be present on the master node.

jpgvm · on June 8, 2014

Those tests were done with mirrored queues:

"We’ll be using durable, triple-mirrored writes across our five-node cluster"

KayEss · on June 8, 2014

The seriousness of this depends on the consequences of the failure and how much of the time the lock is held. If the failure causes double charging a customer that's likely far more serious than if it double emails a customers. And if the lock spends most of its time not held then spotting a doubling of its message can probably be spotted and fixed far more easily than if the lock spends most of its time held.

EGreg · on June 8, 2014

How does RabbitMQ compare with 0MQ?

atombender · on June 8, 2014

They are conceptually very different.

Rabbit is a broker based on the AMQP model of exchanges, queues and bindings. Producers and consumers talk to Rabbit, which processed, stores and dispatches messages across the cluster. It supports both direct (one message goes to a single consumer) and fan-out (every message goes to all consumers) delivery, and can pick targets based on wildcard routes. Queues can be "durable", where Rabbit will store messages on disk atomically in case of a crash. You can restart Rabbit, as well as producers and consumers, and the queues will still be intact.

ZeroMQ is (despite the name) not a message queue in the classical sense; it's more like multicast TCP or UDP, with some message queue functionality. To start, since Zero is embedded in each program, a "queue" does not persist if a producer and consumer are not both running; any buffering occurs in the consumer, so if a producer emits messages and no consumers are running, the messages are lost. Similarly, there is no durability; all messages must be handled right away.

ZeroMQ is a low-level toolset for implementing robust messaging -- it could be used to implement a high-level broker like Rabbit, but in itself it doesn't compete directly.

nateburke · on June 8, 2014

Cool. I've been bitten by RabbitMQ before. When are you going to give cassandra the jepsen treatment?

shadytrees · on June 8, 2014

It already exists! Back from the 2013 series. http://aphyr.com/posts/294-call-me-maybe-cassandra/

jpgvm · on June 8, 2014

The unfortunate reality for RabbitMQ is about 99.99% of the problems people have with it (assuming they are using it as a queue) is that they either a) are using a bad/incomplete client or b) aren't using the HA features correctly or c) are doing all of that but then aren't checking for failure states (CCN etc) properly so their application isn't responsive enough in failure conditions.

slantedview · on June 9, 2014

I don't think any of that applies here.

mjamil · on June 8, 2014

A bit meta. That was a really well-explained description of a non-trivial problem. I wish there was (or I knew of) a curated collection of bloggers that wrote this well about software.

freshhawk · on June 8, 2014

Aphyr's Clojure tutorial series is one of the best programming language tutorials I've ever read. I'm definitely a fan.

http://aphyr.com/tags/Clojure-from-the-ground-up

steveklabnik · on June 8, 2014

The rest of Aphyr's stuff is just as good. My personal favourite: http://aphyr.com/posts/283-call-me-maybe-redis