The article is nice, aside from the buggy and suboptimal implementations of the mutex.
But I would like to stress that futexes are not just useful to implement mutexes (nor they are even the most interesting feature). They are a generalized waitqueue abstraction that can be used for all kind of signaling primitives. Also the nice thing about futexes is that they do not consume resources when they are not in use: you could implement a fast pathed mutex on top of, say, an eventfd, but allocating a file descriptor for each mutex can potentially run out of resources quickly.
The mutex example was purely for didactic purposes, and definitely not a complete mutex. It was meant to help demonstrate the basics of how a mutex works. Blog post has been updated to clarify this.
I didn't really get the example, futex is itself a syscall, so how does implementing mutex in terms of futex help you create a "fast userspace" mutex? The example presented looks to do almost nothing in userspace. Am I missing something?
mutex_lock makes 0 system calls in the fast case (when cmpxchg succeeds on the first try). mutex_unlock makes 1 system call always, but TFA mentions that it can be optimized by adding some user-space complexity to check if signalling might be required (if you can demonstrate zero waiters, then the FUTEX_WAKE is not needed).
The example really should have fleshed this out, because most of the point of the feature is that you can have a pathway that involves no kernel operations in the common case. The rest of its point is that if you do end up waiting, the kernel scheduler knows about it.
What are some use cases for having so many futexes that would exceed the number of available file descriptors? Are you assuming a low FD limit, or do you really mean millions/billions of futexes?
The more futexes you have, the less contention there ought to be for each. In principle. In practice, of course, you often end up with contention for certain items anyway, but it is hard to know in advance which. What matters for this case is that all the unlocked ones hardly cost anything, especially if you have some use for the other bytes in the cache line each is in (or if "false sharing", with multiple of them in the same cache line, is OK). The greater the ratio of threads or cores to locks, the less you probably worry about false sharing or contention.
But it is not a very good design for throughput, because any given lock is in general very unlikely to live in the cache of the core that wants it, and getting it there takes a long time. If, as is common, a tiny subset of the data set is "hot" and could fit in the sum of the caches available, you would like for any hot item to only ever be in one cache, with only the core that cache belongs to ever looking at it. If you can ensure this, then you don't need locks for those. If you can mostly ensure it, the locks will anyway already live in the right cache and will mostly not be contended for.
So, if you do have a hot subset of big data, it may be better to treat the hot subset specially, falling back to simple, general, maybe slow methods for other items until they become hot and are moved to the hot section and assigned to a specific core.
If there is no hot subset, you want to anyway avoid having cores waiting around for items (with their locks) to be fetched into cache so you can lock and use them. You can issue "pre-fetch" instructions for a series of items a core is interested in, while it works on others, and hope that by the time you get back to each it has been fetched into cache and is ready to be worked on immediately; and that no other core will want it right away.
But I would like to stress that futexes are not just useful to implement mutexes (nor they are even the most interesting feature). They are a generalized waitqueue abstraction that can be used for all kind of signaling primitives. Also the nice thing about futexes is that they do not consume resources when they are not in use: you could implement a fast pathed mutex on top of, say, an eventfd, but allocating a file descriptor for each mutex can potentially run out of resources quickly.