Condvars and atomics do not mix

jlokier · on March 24, 2024

Rust doesn't solve this. You can easily make the same mistake as the article describes, in Rust.

Thiez points out that a Rust mutex wraps the data it protects. But that doesn't fix the issue described in the article. The C++ in the article also accesses all mutex-protected data safely. You could translate it almost line-for-line to Rust and still have the same race hazard.

It's because the race hazard is from wait-condition ordering, rather than mutex-protected data.

The article explains, it's tempting and intuitive to make kill_flag an atomic, outside the data protected by the mutex, because it's a sticky state, only written in one place.

I've seen this pattern used often enough in the real world, sometimes with the flag being a global variable. People do it. Shutdowns get stuck occasionally, nobody is sure why. Sometimes the flag is set in a signal handler, where it isn't safe to use mutex/notify operations anyway, and nobody thought hard enough about making those wakups reliable.

But those patterns lead to a race hazard with wait conditions. The article goes on to explain that kill_flag should be part of the data protected by the condition's mutex. And then there's no need to make it atomic.

After seeing that, it might be a surprise to find kill_flag doesn't have to be protected by the mutex after all!

That intuition that it can go outside was correct. It just needs a little ordering. The following code will also avoid the race hazard:

  // Notify all workers that they need to die right now.
  m_kill_flag = true;
  // Acquire and release the mutex to ensure condvar waiters reliably see all state changes.
  { std::unique_lock<std::mutex> lock(m_mutex); }
  // Wake all sleeping condvar waiters.
  m_has_work.notify_all();

I would not recommend this for the situation in the article. But you can see how it generalises, for example you might have a global kill flag or memory statistic that's used for many different things, protected by different mutexes.

lostmsu · on March 24, 2024

IMHO, the best (in the sense of comprehensibility) fix to this is to never wait indefinitely. Waits should always have a timeout and on timeout all preconditions should be rechecked.

krackers · on March 25, 2024

Not sure why this is flagged, but there's nothing actually special about atomics here, I thought this was fairly common knowledge that when using condition variables you must always couple them with a mutex to provide a critical region and avoid lost wakeup. This is easier to see with the no-abstrations pthread version:

On the waiter side

   pthread_cond_lock(&mutex)
   while (!predicate) {
     pthread_cond_wait(&cvar, &mutex);
   }

On the signaler side

  pthread_mutex_lock(&mutex);
  toggle_predicate(&predicate);
  pthread_mutex_unlock(&mutex);
  pthread_cond_signal(&cvar);

(Note that it's safe to signal after unlock, although a bit more error-prone as you have to be careful when destroying the cvar). The mutex serves two key purposes: first is providing acq/rel semantics on observations of predicate, the second is that it provides a barrier region to ensure that the predicate is either changed before the waiter enters into the wait loop, or after the waiter has added itself to the wait queue so that signal will wake it up.

Changing the predicate var to atomic allows you to get rid of the first one (and I think it can in fact be a relaxed atomic since mutex lock already has acq/rel semantics), but you still need it to serve the second purpose.

See this SO answer for more info: https://stackoverflow.com/a/9918764

James_K · on March 24, 2024

I don't have the most experience writing concurrent code but wouldn't semaphores have made more sense here? The issue was that condition variables aren't "sticky" yet semaphores are and so would not have that problem.

jlokier · on March 24, 2024

A condition variable is used because the workers are waiting for a condition, `m_queue.size() > 0`, in addition to the kill flag. The kill flag is sticky, but the queue condition isn't.

thayne · on March 24, 2024

Since the post doesn't really explain it, the race condition is because when notify_all is called there might be a thread that already has acquired the lock and is running, so that thread won't get the notification, and might not wake up on the next iteration.

I think wrapping the call to notify_all in the lock would also fix the race condition.

Thiez · on March 24, 2024

Rust solves this (in common cases) by having the mutex wrap the data it protects. You can't make this mistake because you can't perform any operation without first taking the lock.

Of course this doesn't cover all usecases but hopefully the people who have the exceptional situation are smart enough to avoid the problem in the article.

plorkyeran · on March 24, 2024

Rust does not solve this by having the mutex wrap the data it protects. The problem is checking something in the condition variable's predicate which isn't guarded by the mutex. Stopping you from accessing the data guarded by the mutex without holding the mutex does nothing to help with this.

You would need to make it so that the CV predicate can only access the state guarded by the mutex, which is not something that Rust does.

chadaustin · on March 24, 2024

You can still put atomics outside the mutex in Rust.

kentonv · on March 24, 2024

FWIW, my C++ toolkit library, KJ, does the same thing.[0]

But presumably you could still write a condition predicate which looks at things which aren't actually part of the mutex-wrapped structure? Or does is the Rust type system able to enforce that the callback can only consider the mutex-wrapped value and values that are constant over the lifetime of the wait? (You need the latter e.g. if you are waiting for the mutex-wrapped value to compare equal to some local variable...)

[0] https://github.com/capnproto/capnproto/blob/e6ad6f919aeb381b...

bonzini · on March 24, 2024

The right solution when using atomics is to use a binary semaphore instead of condition variables, and implement the wait as

    if (!cond) {
      sem.try_acquire();
      if (!cond) {
        sem.acquire();
      }
    }

Though this only works if there is a single waiter.

ibobev · on March 24, 2024

Why this article was flagged?

feverzsj · on March 24, 2024

Well, that's the sort of mistake even noob won't make.

yyyfb · on March 24, 2024

C++, the infinite foot-gun factory

PaulDavisThe1st · on March 24, 2024

Whether or not that's true, this issue is independent of C++.

Concurrent programming is hard, partly because concurrent anything is hard (even in meatspace).

yyyfb · on March 24, 2024

Abstractions should really make it easier than what's described in the article. There's no reason that in 2024, after almost fifty+ years of computer science, we still have articles saying "hey, here's a common pitfall that will throw no errors or warnings at you, but will deadlock non-deterministically one time out of one billion".

PaulDavisThe1st · on March 24, 2024

Sometimes, this is precisely what you need.

The problem with what is described in the article is the race condition. If you hit the race on your one pass through this code, you're dead.

But there are other conditions whether you know you will repeatedly pass through the code, and the race doesn't matter.

So the language/programming model has abstractions that can be used successfully for the latter. This has the unfortunate side effect of them also being capable of being used to create bugs in the former.

If you want a real world example of this sort of thing, consider a single-reader/single-write circular buffer/queue. You don't need any sort of interlock between the read ptr/index and the write ptr/index for this sort of data structure because if they mis-ordered (ie. the writer has added more data but not yet updated the write ptr), the worst that can happen is that the reader thinks there is less data in the buffer to read than there actually is.

However, that analysis relies on the model that the reader and writer are periodically checking the data structure. If instead they are both doing a single write and a single read, this same design will lead to loss of data flow between them. To ensure this doesn't happen, you need a mutex that they both share.

So is the right thing to do to require any such shared data to be protected by a mutex? Well, if you're only goal is to prevent data races, then sure. But if you also care about e.g. realtime behavior, then absolutely not.

This is just one example. There are many more cases where in case A you don't want or need synchronization primitives and in case B you absolutely do.

yyyfb · on March 25, 2024

I'm familiar with the trade-off. But I disagree that there's a choice to make between safe and powerful. It is possible to make safe behaviors by default, while leaving the door open for more advanced behaviors. I guess code linters can help as crutches. But the right abstractions too.

It's the motivating example for languages like C++ in the first place (over assembly): type safety, easier reasoning about control flow, etc.

Note how the case here is very simple - waiting on a variable to change - but advice just boils down to "be more careful" instead of "here's the foolproof way to do this". You can only ask a programmer to watch out for so many things, and C++ has way too many of these things.

PaulDavisThe1st · on March 29, 2024

It's not safe vs powerful, typically. It's more often safe versus performant.

lostmsu · on March 24, 2024

I would agree with GP comment: in the particular case described in the article using channels instead of lower-level primitives would solve the problem.

E.g. when workers need to be stopped you close the channel sender, and receiver returns appropriate result to all wait calls.

PaulDavisThe1st · on March 24, 2024

That's a highly specific solution that doesn't address the fundamental interplay between atomics and various kinds of synchronization primitives.

lostmsu · on March 25, 2024

This objection of yours was directly addressed by the comment I'm trying to support.

E.g. yes, it is important to know the low-level details, but also the abstraction chosen is just not fit for the purpose. And it's partially fault of C++ because channels are not in the standard library.