Hacker News new | past | comments | ask | show | jobs | submit login

It is correct that on x86 all normal stores have release semantics and all normal loads have acquire semantics (only a few instructions behave differently, i.e. the string instructions and the streaming "non-temporal" store instructions).

However the ordering properties of the x86 stores and loads are stronger than that. An x86 store is not only a release store, but it has an additional property that could be called of being an "initial store", i.e. a store that is guaranteed to be performed before all subsequent stores.

An x86 load is not only an acquire load, but it has an additional property that could be called of being a "final load", i.e. a load that is guaranteed to be performed after all prior loads.

The research paper that has introduced the concepts of acquire loads and release stores (in May 1990) was mistaken. They have claimed that these 2 kinds of accesses are enough for the synchronization of accesses to shared resources.

While it is true that acquire loads and release stores are sufficient for implementing critical sections, there are other algorithms for accessing shared resources that need the other 2 kinds of ordered loads and stores, i.e. final loads and initial stores.

On x86, this does not matter, because the normal loads and stores provide all 4 kinds of ordered loads and stores, so any algorithm can be implemented without using memory barriers or other special instructions.

In ISAs that provide only 2 kinds of ordered loads and stores, i.e. acquire loads and release stores, like Arm Aarch64, the other 2 kinds of ordered loads and stores must be synthesized using memory barriers, i.e. an initial store is made by a weakly-ordered store followed by a store barrier, while a final load is made by a weakly-ordered load preceded by a load barrier.

Arm Aarch64 does not have load barriers, but it has stronger acquire barriers, which can always be used instead of any load barriers. The Arm Aarch64 acquire barrier has the confusing mnemonic DMB.LD, apparently inspired by the x86 LFENCE, which is also an acquire barrier, not a load barrier, despite its confusing mnemonic.

(a load barrier guarantees that all prior loads are performed before all subsequent loads; an acquire barrier guarantees that all prior loads are performed before all subsequent loads and stores; such memory barriers are stronger than necessary; the weaker ordering properties provided by the 4 kinds of ordered loads and stores are sufficient)




Interesting. I don't consider myself an expert, but I have written my share of lock free algorithms. For which algorithms you specifically needed an "initial store" or "final read" for?

I could see these being useful for seq locks, but usually you represent them with relaxed operations and explicit memory barriers.


An example is when you have a shared data structure that is written by one or more writers and to which one or more readers want read-only access.

In this case, the readers can use optimistic access to the shared data structure, also known as lock-free access (a very inappropriate name in my opinion).

Thus all the readers access the shared resource concurrently, but they must be able to detect when an update is in progress or an update has overlapped in time their read access, in order to stall or retry their reading of the shared data structure.

To implement this, a version counter is used, which is incremented after each update by a writer of the shared data. The version counter must be updated twice, both before and after the rest of the shared data is updated. The first update must provide an invalid version number, to enable the detection of an update being in progress.

There are two simple ways to achieve this. The version counter can be complemented before data update and negated after data update. Complement followed by negate is equal to increment, and negative version numbers would indicate an update in progress. The alternative is to increment the version counter both before and after the data update, when odd numbers would indicate an update in progress and even numbers would be valid version numbers.

A reader must read the version number both before and after reading the shared data. An invalid version number before reading will stall the reading, while after the reading an invalid version number or one that is different from the first version number will cause a retry of the reading.

The two version counter stores by a writer and the two version counter loads by a reader have different ordering requirements.

The version counter store before data update must be an "initial store" (as defined in the previously posted message).

The version counter store after data update must be a release store.

The version counter load before data reading must be an acquire load.

The version counter load after data reading must be a "final load" (as defined in the previously posted message).

There are also other frequently used algorithms, e.g. for queuing locks, which need "initial stores".

The explicit memory barriers are a mistake in ISA design. They have stronger ordering properties than the various kinds of ordered loads and stores that can replace them, and these stronger ordering properties are never needed.

Moreover, explicit memory barriers increase the code size and they occupy instruction decode slots and various locations in the internal structures of a CPU core without providing any benefit.

Explicit memory barriers are absolutely necessary for Arm Aarch64 and for Intel/AMD x86-64 (for the latter only when accessing memory regions with weakly-ordered accesses), but only because these ISAs do not include all the kinds of ordered loads and stores that are needed.


You are describing a seq lock. Hans Boehm [1] has discussed in details the impedance mismatch between seq lock and the acquire/release model, and the bottom line is that you need compiler help to recognize non-modifying RMW that are purely used to establish ordering and lower them the correct optimal sequence. No compiler does it yet so we all hack it around.

[1] "Can Seqlocks Get Along with Programming Language Memory Models?" I can no longer find an easily accessible free version. The link at HP from Boehms home is broken.


I am pretty sure that variants of this algorithm have been used many decades before Linux and before the coining of the name "seq lock", which is not a good name for it.

It is not needed only on multiprocessors, but it is needed also even when there is a single writer and a single reader, both running on the same CPU core, e.g. when some data is updated by an interrupt service routine or by a signal handler and the main program must also access that data.

However, before the fashion of implementing weakly-ordered memory accesses, which has spread mostly after 1990, nobody has paid much attention to the ordering requirements for its stores and loads, because in ancient times in all existing computers those ordering requirements were satisfied implicitly by all stores and loads.

Thanks for pointing to the paper written by Hans Boehm, SciHub has it.


A name is a name, RCU is a similarly terrible name, but it stuck and it is immediately recognizable. the kernel (and related articles on LWN) simply popularized the algo; I have no doubt that it was in use well before linux, but that's how I learned it.


(oh, I just realized you must be the same Adrian from RWT that was arguing for initial stores/last load)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: