I make systems performant and reliable by eliminating their lock-free queues.
To be more precise... I like to use single-writer ring buffers. (Technically you could say they are a sort of lock-free queue. But there is little to go wrong in a ring buffer.) Mainly, I reduce coupling enough that taking the occasional mutex is not a noticeable drain. Explicit mutex locking is less prone to obscure heisenbugs than lock-free techniques.
What I rip out are queues treated as if "lock-free" coupling were free. It's not free; it is at least 50 cycles, each interaction. Nowadays you can get a hell of a lot of work done in 50 cycles. Each "lock-free" operation engages bus-level interactions more like a mutex than not, so it doesn't buy what you hope.
To reduce coupling, you increase batching. Instead of one event per interaction, do ten, or fifty.
Most "lock-free" implementations you find are not portable off of x86 systems, which have a much more forgiving memory model than, say, ARM. You think you were careful, but the system is better at finding your mistakes than you are.
It's an interesting point - "lock free" then is actually more about consistently good performance than theoretically maximising performance. If you never have to wait for a lock then you know exactly how long your process will take (as long as it isn't CPU starved). As soon as there's a lock its going to become unpredictable, because when there's a lot of contention you'll wait a long time, when there isn't you won't have to wait. And best of all, with lock free you shouldn't be getting deadlocks ever.
This property in turn could get you better performance in the end because you may not have to size everything for such a big range of contention, so you can optimize utilization of the hardware more without risking things hitting a peak and experiencing severe degradation.
> If you never have to wait for a lock then you know exactly how long your process will take
I think you may be confusing lock-free with wait-free here. I believe it's wait-free that gives you a bound on the number of steps the process can take to complete an operation, but I'm not an expert.
Yes, wait-free is a more precise description of single-writer ring buffer performance.
Lock-free queues are not fixed-time, because operations on them involve negotiations on the bus to reclaim the cache line involved, often behind other queued operations.
With the ring buffer, the writer never gives up ownership of the head cache line, and the next one can be pre-fetched (speculatively) because the writes are walking memory sequentially. There is a little hiccup when you wrap around, so it's not absolutely fixed-time.
I put my ring buffers on hugepages so I am not competing for page-map TLB cache.
For anyone curious in what sense ncmmcm's ring buffers aren't lockfree data structures, we had some discussion about their specific details earlier in https://news.ycombinator.com/item?id=25264023
Maybe he means cases where a lock free queue can't implement batching easily. Batching by making the messages themselves contain batches of messages makes them variable sized which can cause less efficient allocations etc., whereas with a lock you can hold the lock, insert N same-sized messages, release the lock.
If it is done as a linked list there may be good multiple insert options, but then you have a linked list and all the extra jumping around memory that involves compared to a ring buffer.
Perhaps the author is referring to a technique such as how a cache can be made concurrent [1]. In that case every read is a write, as LRU requires moving entries. Since an LRU is often implemented as a doubly-linked list, it either is a very complex lock-free algorithm or performed under a lock. The lock-free approach doesn't scale because all the recent entries move to the MRU position (tail), so there is contention.
The approach used in that article is to publish the events into multi-producer/single-consumer queues. When a queue is full, then the batch can be performed under a lock to replay and catch-up the LRU. This way the lock is used to ensure single-writer, but does not suffer lock contention. The ring buffer is lock-free, but the cache itself is not. Lock-free can suffer contention or be more expensive algorithmically than if under a lock, so whichever strategy chosen requires careful consideration of the system performance.
Exactly the point: the goal of batching is to amortize the cost of synchronization over a large number of events. With enough events to amortize over, the cost of synchronization drops into the noise. Then, you are better off avoiding bug-prone lock-free operations in favor of explicit lock operations that actual humans can reason about.
This is known as BP-Wrapper, a batching and prefetching technique, that also coined the term lock amortization. Sometimes this is confused with flat combining, which came out a little later and is a slightly different concept.
Amortizing batched operations across occasional overhead goes back to the mistiest origins of computation, and to industrial operations research before that. The very first brick-masons carried a hod of bricks to the top of their scaffold and then placed them before giing back for more.
I promise you that practically nobody doing it has ever heard of "BP-Wrapper", or of flat-combining.
In the context of concurrent programming they are well known. We are not discussing how to program a PDP-8 but how to efficiently design multithreaded algorithms. This article recommends Nir Shavit’s book, who has written many well known papers including flat combining.
> Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again.
Another possibility is that the thread dies while holding a mutex (or exclusively possessing a resource). It may seem like a stretch for shared memory systems, but not so much for distributed systems. Recovering the resource is not an easy task, especially considering the fact that recovering threads may also die or that the thread assumed dead all of a sudden returns to life. In my opinion, making a parallel system robust to that kind of malfunction goes far beyond what the industry considers acceptable as lock-free algorithms.
I gave some thought to that scenario in my project Robusta [1]. I doubt it could be useful for any practical purpose but at least it works for a universal computational model.
```
In this sense, the lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” the entire application in some way, whether it’s deadlock, livelock – or even due to hypothetical thread scheduling decisions made by your worst enemy. That last point sounds funny, but it’s key. Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again.
[...]
A final precision: Operations that are designed to block do not disqualify the algorithm. For example, a queue’s pop operation may intentionally block when the queue is empty. The remaining codepaths can still be considered lock-free.
```
I don't get it ; how could "locking a mutex" not be considered as an "operation that are designed to block" ?
Would somebody be kind enough to tell me what am I missing here? Thanks !
> I don't get it ; how could "locking a mutex" not be considered as an "operation that are designed to block" ?
That's not what is being claimed here. What is being said is that an algorithm that blocks a thread can still be considered lock-free. There's a difference between lock-freedom and wait-freedom.
The basic differentiator is that a locking algorithm or locking code has the potential to execute but not making progress. Lock-free code is always guaranteed to make progress if it runs.
The queue implementation that blocks on an empty queue is fine if inserting into the queue is still lock-free, ie, your insert operation eventually completes or makes progress regardless of if a reader is spinning on the lock that allows you to pop a value from the queue.
That means, while the popping of a value is locking, inserting is not. Hence the mutex locking the read does not turn the queue into a locking algorithm.
I think it's saying that not _all_ blocking operations are inherently disqualifying. That doesn't mean that some (like acquiring a shared mutex) are allowed though.
> your worst enemy could simply never schedule that thread again
That seems too extreme to me. If the scheduler itself is potentially malicious, then no multi-threaded implementation is safe, since we can't assume that any thread we start will ever actually run.
If the properties of concern for your thread or program are only safety properties (“never does a bad thing”) then the fact that the program may never do anything at all is just fine!
Then, if you want some liveness properties (“eventually does something good”) you’ll need to embed some assumption about fairness/scheduling/progress just as you’ve said!
This appears in early concurrency literature (TLA etc which I can’t be bothered to look for)
Remembering that you are your own worst enemy, what this implies is that some other perfectly sane decision you made in the past has come to bite you in the arse.
Alistarh et al. (https://arxiv.org/abs/1311.3200) show that assuming we have a kinder, non-malicious scheduler, the situation looks even better: lock-free (some process can make progress) algorithms have a stronger progress property of being wait-free (every process can make progress).
As omazurov mentioned as a top-level comment, threads can also die, which is rather like never being scheduled. In practice, I suspect that you're right that it's often too extreme of a model—many programs use locks and don't worry too much about this issue.
"Malice" doesn't need to come into play. A thread could simply be waiting in line behind a non-preemptible thread which does not relinquish the CPU due to a bug.
One real-world example is assuming that the audio thread (scheduled at realtime priority if possible) is guaranteed to run, while the GUI thread may or may not run, or get descheduled at any point. Even if the GUI thread locks up (for example the user is resizing the window), the audio thread should not stop producing sound.
To be more precise... I like to use single-writer ring buffers. (Technically you could say they are a sort of lock-free queue. But there is little to go wrong in a ring buffer.) Mainly, I reduce coupling enough that taking the occasional mutex is not a noticeable drain. Explicit mutex locking is less prone to obscure heisenbugs than lock-free techniques.
What I rip out are queues treated as if "lock-free" coupling were free. It's not free; it is at least 50 cycles, each interaction. Nowadays you can get a hell of a lot of work done in 50 cycles. Each "lock-free" operation engages bus-level interactions more like a mutex than not, so it doesn't buy what you hope.
To reduce coupling, you increase batching. Instead of one event per interaction, do ten, or fifty.
Most "lock-free" implementations you find are not portable off of x86 systems, which have a much more forgiving memory model than, say, ARM. You think you were careful, but the system is better at finding your mistakes than you are.