The article is nice, aside from the buggy and suboptimal implementations of the ...

mfilion · on Feb 9, 2022

The mutex example was purely for didactic purposes, and definitely not a complete mutex. It was meant to help demonstrate the basics of how a mutex works. Blog post has been updated to clarify this.

leshow · on Feb 9, 2022

I didn't really get the example, futex is itself a syscall, so how does implementing mutex in terms of futex help you create a "fast userspace" mutex? The example presented looks to do almost nothing in userspace. Am I missing something?

aidenn0 · on Feb 9, 2022

mutex_lock makes 0 system calls in the fast case (when cmpxchg succeeds on the first try). mutex_unlock makes 1 system call always, but TFA mentions that it can be optimized by adding some user-space complexity to check if signalling might be required (if you can demonstrate zero waiters, then the FUTEX_WAKE is not needed).

ncmncm · on Feb 10, 2022

The example really should have fleshed this out, because most of the point of the feature is that you can have a pathway that involves no kernel operations in the common case. The rest of its point is that if you do end up waiting, the kernel scheduler knows about it.

gpderetta · on Feb 9, 2022

Yes, usually you keep a (saturating) waiter count, at the limit just one additional bit.

kvhdude · on Feb 9, 2022

when the cas (compare and swap) succeeds in user space, the subsequent futex call to wait will not be made, i suppose.

dataflow · on Feb 9, 2022

What are some use cases for having so many futexes that would exceed the number of available file descriptors? Are you assuming a low FD limit, or do you really mean millions/billions of futexes?

ncmncm · on Feb 10, 2022

The more futexes you have, the less contention there ought to be for each. In principle. In practice, of course, you often end up with contention for certain items anyway, but it is hard to know in advance which. What matters for this case is that all the unlocked ones hardly cost anything, especially if you have some use for the other bytes in the cache line each is in (or if "false sharing", with multiple of them in the same cache line, is OK). The greater the ratio of threads or cores to locks, the less you probably worry about false sharing or contention.

But it is not a very good design for throughput, because any given lock is in general very unlikely to live in the cache of the core that wants it, and getting it there takes a long time. If, as is common, a tiny subset of the data set is "hot" and could fit in the sum of the caches available, you would like for any hot item to only ever be in one cache, with only the core that cache belongs to ever looking at it. If you can ensure this, then you don't need locks for those. If you can mostly ensure it, the locks will anyway already live in the right cache and will mostly not be contended for.

So, if you do have a hot subset of big data, it may be better to treat the hot subset specially, falling back to simple, general, maybe slow methods for other items until they become hot and are moved to the hot section and assigned to a specific core.

If there is no hot subset, you want to anyway avoid having cores waiting around for items (with their locks) to be fetched into cache so you can lock and use them. You can issue "pre-fetch" instructions for a series of items a core is interested in, while it works on others, and hope that by the time you get back to each it has been fetched into cache and is ready to be worked on immediately; and that no other core will want it right away.