An Introduction to Lock-Free Programming (2012)

ncmncm · on Dec 20, 2020

I make systems performant and reliable by eliminating their lock-free queues.

To be more precise... I like to use single-writer ring buffers. (Technically you could say they are a sort of lock-free queue. But there is little to go wrong in a ring buffer.) Mainly, I reduce coupling enough that taking the occasional mutex is not a noticeable drain. Explicit mutex locking is less prone to obscure heisenbugs than lock-free techniques.

What I rip out are queues treated as if "lock-free" coupling were free. It's not free; it is at least 50 cycles, each interaction. Nowadays you can get a hell of a lot of work done in 50 cycles. Each "lock-free" operation engages bus-level interactions more like a mutex than not, so it doesn't buy what you hope.

To reduce coupling, you increase batching. Instead of one event per interaction, do ten, or fifty.

Most "lock-free" implementations you find are not portable off of x86 systems, which have a much more forgiving memory model than, say, ARM. You think you were careful, but the system is better at finding your mistakes than you are.

zmmmmm · on Dec 21, 2020

It's an interesting point - "lock free" then is actually more about consistently good performance than theoretically maximising performance. If you never have to wait for a lock then you know exactly how long your process will take (as long as it isn't CPU starved). As soon as there's a lock its going to become unpredictable, because when there's a lot of contention you'll wait a long time, when there isn't you won't have to wait. And best of all, with lock free you shouldn't be getting deadlocks ever.

This property in turn could get you better performance in the end because you may not have to size everything for such a big range of contention, so you can optimize utilization of the hardware more without risking things hitting a peak and experiencing severe degradation.

chrisseaton · on Dec 21, 2020

> If you never have to wait for a lock then you know exactly how long your process will take

I think you may be confusing lock-free with wait-free here. I believe it's wait-free that gives you a bound on the number of steps the process can take to complete an operation, but I'm not an expert.

zmmmmm · on Dec 21, 2020

I don't know how authoritative it is but I was taking the definitions from here:

http://www.1024cores.net/home/lock-free-algorithms/introduct...

They define wait free as not even having thread starvation and lock free as not having to wait for anything as long as you have CPU to actually run.

Its always interesting how any given topic always fragments into much more complexity than you expected once you start learning about it!

chrisseaton · on Dec 21, 2020

Doesn't that support it?

> Wait-freedom means that ... Each operations is executed in a bounded number of steps.

ncmncm · on Dec 21, 2020

Yes, wait-free is a more precise description of single-writer ring buffer performance.

Lock-free queues are not fixed-time, because operations on them involve negotiations on the bus to reclaim the cache line involved, often behind other queued operations.

With the ring buffer, the writer never gives up ownership of the head cache line, and the next one can be pre-fetched (speculatively) because the writes are walking memory sequentially. There is a little hiccup when you wrap around, so it's not absolutely fixed-time.

I put my ring buffers on hugepages so I am not competing for page-map TLB cache.

KMag · on Dec 21, 2020

For anyone curious in what sense ncmmcm's ring buffers aren't lockfree data structures, we had some discussion about their specific details earlier in https://news.ycombinator.com/item?id=25264023

Dylan16807 · on Dec 21, 2020

> To reduce coupling, you increase batching. Instead of one event per interaction, do ten, or fifty.

If you can do that it makes your previous argument, caring that an interaction takes 50 cycles, basically irrelevant.

cma · on Dec 21, 2020

Maybe he means cases where a lock free queue can't implement batching easily. Batching by making the messages themselves contain batches of messages makes them variable sized which can cause less efficient allocations etc., whereas with a lock you can hold the lock, insert N same-sized messages, release the lock.

If it is done as a linked list there may be good multiple insert options, but then you have a linked list and all the extra jumping around memory that involves compared to a ring buffer.

NovaX · on Dec 21, 2020

Perhaps the author is referring to a technique such as how a cache can be made concurrent [1]. In that case every read is a write, as LRU requires moving entries. Since an LRU is often implemented as a doubly-linked list, it either is a very complex lock-free algorithm or performed under a lock. The lock-free approach doesn't scale because all the recent entries move to the MRU position (tail), so there is contention.

The approach used in that article is to publish the events into multi-producer/single-consumer queues. When a queue is full, then the batch can be performed under a lock to replay and catch-up the LRU. This way the lock is used to ensure single-writer, but does not suffer lock contention. The ring buffer is lock-free, but the cache itself is not. Lock-free can suffer contention or be more expensive algorithmically than if under a lock, so whichever strategy chosen requires careful consideration of the system performance.

[1] http://highscalability.com/blog/2016/1/25/design-of-a-modern...

ncmncm · on Dec 21, 2020

Exactly the point: the goal of batching is to amortize the cost of synchronization over a large number of events. With enough events to amortize over, the cost of synchronization drops into the noise. Then, you are better off avoiding bug-prone lock-free operations in favor of explicit lock operations that actual humans can reason about.

And get better throughput.

NovaX · on Dec 21, 2020

This is known as BP-Wrapper, a batching and prefetching technique, that also coined the term lock amortization. Sometimes this is confused with flat combining, which came out a little later and is a slightly different concept.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.15...

[2] http://ranger.uta.edu/~sjiang/pubs/papers/ding16_BP-wrapper....

ncmncm · on Dec 22, 2020

Amortizing batched operations across occasional overhead goes back to the mistiest origins of computation, and to industrial operations research before that. The very first brick-masons carried a hod of bricks to the top of their scaffold and then placed them before giing back for more.

I promise you that practically nobody doing it has ever heard of "BP-Wrapper", or of flat-combining.

NovaX · on Dec 22, 2020

In the context of concurrent programming they are well known. We are not discussing how to program a PDP-8 but how to efficiently design multithreaded algorithms. This article recommends Nir Shavit’s book, who has written many well known papers including flat combining.

patrec · on Dec 21, 2020

Sounds like disruptor? https://en.wikipedia.org/wiki/Disruptor_(software)

ncmncm · on Dec 21, 2020

patrec · on Dec 21, 2020

Any good quality open source implementations you can recommend (I guess in addition to LMAX's for Java).

omazurov · on Dec 21, 2020

> Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again.

Another possibility is that the thread dies while holding a mutex (or exclusively possessing a resource). It may seem like a stretch for shared memory systems, but not so much for distributed systems. Recovering the resource is not an easy task, especially considering the fact that recovering threads may also die or that the thread assumed dead all of a sudden returns to life. In my opinion, making a parallel system robust to that kind of malfunction goes far beyond what the industry considers acceptable as lock-free algorithms. I gave some thought to that scenario in my project Robusta [1]. I doubt it could be useful for any practical purpose but at least it works for a universal computational model.

[1] https://github.com/OlegMazurov/Robusta

Ace17 · on Dec 20, 2020

``` In this sense, the lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” the entire application in some way, whether it’s deadlock, livelock – or even due to hypothetical thread scheduling decisions made by your worst enemy. That last point sounds funny, but it’s key. Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again.

[...]

A final precision: Operations that are designed to block do not disqualify the algorithm. For example, a queue’s pop operation may intentionally block when the queue is empty. The remaining codepaths can still be considered lock-free. ```

I don't get it ; how could "locking a mutex" not be considered as an "operation that are designed to block" ?

Would somebody be kind enough to tell me what am I missing here? Thanks !

jganetsk · on Dec 20, 2020

> I don't get it ; how could "locking a mutex" not be considered as an "operation that are designed to block" ?

That's not what is being claimed here. What is being said is that an algorithm that blocks a thread can still be considered lock-free. There's a difference between lock-freedom and wait-freedom.

zaarn · on Dec 21, 2020

The basic differentiator is that a locking algorithm or locking code has the potential to execute but not making progress. Lock-free code is always guaranteed to make progress if it runs.

The queue implementation that blocks on an empty queue is fine if inserting into the queue is still lock-free, ie, your insert operation eventually completes or makes progress regardless of if a reader is spinning on the lock that allows you to pop a value from the queue.

That means, while the popping of a value is locking, inserting is not. Hence the mutex locking the read does not turn the queue into a locking algorithm.

saghm · on Dec 21, 2020

I think it's saying that not _all_ blocking operations are inherently disqualifying. That doesn't mean that some (like acquiring a shared mutex) are allowed though.

brianberns · on Dec 21, 2020

> your worst enemy could simply never schedule that thread again

That seems too extreme to me. If the scheduler itself is potentially malicious, then no multi-threaded implementation is safe, since we can't assume that any thread we start will ever actually run.

johnbender · on Dec 21, 2020

This is an important observation!

If the properties of concern for your thread or program are only safety properties (“never does a bad thing”) then the fact that the program may never do anything at all is just fine!

Then, if you want some liveness properties (“eventually does something good”) you’ll need to embed some assumption about fairness/scheduling/progress just as you’ve said!

This appears in early concurrency literature (TLA etc which I can’t be bothered to look for)

manicdee · on Dec 21, 2020

This isn’t about malice.

Remembering that you are your own worst enemy, what this implies is that some other perfectly sane decision you made in the past has come to bite you in the arse.

thmt · on Dec 21, 2020

Alistarh et al. (https://arxiv.org/abs/1311.3200) show that assuming we have a kinder, non-malicious scheduler, the situation looks even better: lock-free (some process can make progress) algorithms have a stronger progress property of being wait-free (every process can make progress).

As omazurov mentioned as a top-level comment, threads can also die, which is rather like never being scheduled. In practice, I suspect that you're right that it's often too extreme of a model—many programs use locks and don't worry too much about this issue.

colanderman · on Dec 21, 2020

"Malice" doesn't need to come into play. A thread could simply be waiting in line behind a non-preemptible thread which does not relinquish the CPU due to a bug.

nyanpasu64 · on Dec 21, 2020

One real-world example is assuming that the audio thread (scheduled at realtime priority if possible) is guaranteed to run, while the GUI thread may or may not run, or get descheduled at any point. Even if the GUI thread locks up (for example the user is resizing the window), the audio thread should not stop producing sound.

insulanus · on Dec 21, 2020

"your worst enemy" can be a stand-in for "a faulty algorithm running on another thread".

Or, another way to look at it - a lock-free algorithm that allows starvation or livelock/deadlock.

jganetsk · on Dec 21, 2020

It assumes that at least 1 thread is running at all times. There are algorithms that can make progress this way.

read_if_gay_ · on Dec 21, 2020

A thread that never ran can’t lock up, so I guess it’s still technically lock free?

brianberns · on Dec 21, 2020

Yes, it’s lock-free, but it doesn’t actually work.

laddng · on Dec 21, 2020

I found this blog to be very helpful for understanding lockfree algorithms: http://www.1024cores.net/home/lock-free-algorithms

known · on Dec 21, 2020

https://archive.vn/fWRVX should be coupled with https://archive.is/Ah0DO

mrkeen · on Dec 21, 2020

> Shared mutexes are ruled out trivially

Is there such thing as a non-shared mutex?

p0nce · on Dec 21, 2020

Lock-free is a bad substitute for having larger task size.