Ways to break your systems code using volatile (2010)

raphlinus · on May 26, 2019

This should say 2010. I believe much of it is out of date, as C11 does have a memory model, and does provide both atomics and barriers. Many, if not most, uses of volatile should probably be replaced by atomics.

https://en.cppreference.com/w/c/atomic

oconnor663 · on May 26, 2019

Herb Sutter's (three hour long!) Atomic Weapons talk goes over the modern meaning of volatile towards the end: https://youtu.be/KeLBd2EJLOU?t=5299

User23 · on May 26, 2019

You're not wrong, but C is a language where plenty of apps are still using older versions, so while as per site etiquette this should have the date, it's still interesting, especially for the embedded space.

sk0g · on May 27, 2019

There's using an outdated version, and then there's C.

The more commonly used language specs, IME, are C99 and C89!

That's 20 and 30 years of 'stability'.

icedchai · on May 26, 2019

Many projects are stuck in C99, or even C89...

BubRoss · on May 27, 2019

And many projects aren't. It is still better to label a title correctly.

kosma · on May 27, 2019

Many compilers are stuck in C99 or C89 or even earlier. There are other worlds besides gcc, clang and msvc.

aidenn0 · on May 26, 2019

In terms of "Using volatile too much" I found a comment along the lines of "Not sure why this has to be volatile, but it doesn't work without it" and the answer was "There is a race condition and volatile slows down one path enough to make it go away."

Yuck.

alain94040 · on May 26, 2019

You really should just use volatile for device drivers when accessing IO space with side-effects. Do not use volatile to build your own synchronization primitives.

vardump · on May 26, 2019

Don't forget about memory barriers. Otherwise your driver will fail on other CPU architectures.

Just because it works on x86, doesn't mean it works on ARM, MIPS, POWER or RISC-V. CPUs other than x86 can reorder stores with other stores and loads with other loads. It can cause the CPU to do the store that starts DMA before the stores that set up length and address are done!

Or just use C11 or C++11 memory model. Although those are still not available in too many cases, curse of having to use an ancient compiler...

burfog · on May 26, 2019

Even on x86, even with the C11 memory model, you can still get burned by transaction reordering as the MMIO passes through bridges. Plain old PCI can do this.

oconnor663 · on May 26, 2019

I thought x86 wasn't allowed to do write-write reordering? Does that rule not apply to peripherals? Is an `mfence` guaranteed to fix it, or are there just no rules at all at that point?

burfog · on May 26, 2019

PCI isn't x86, and x86 isn't PCI. PCI itself, for example in a PCI-to-PCI bridge chip, can do store buffering and read prefetches. The PCI specification lays out what you must to to suppress this.

There are at least 5 different sets of rules for ordering on x86, due to memory types. It's in the Intel manual, along with a table that shows how they interact with each other.

ridiculous_fish · on May 26, 2019

What do you think about signal handlers?

Atomics may be implemented with locks, which makes them unsuitable for signal handlers. The only guaranteed lock-free type is `std::atomic_flag` which is not very useful.

`volatile sig_atomic_t` still seems like the better choice for signals.

gpderetta · on May 27, 2019

If the architecture is so broken that atomic load and stores need to use locks, I can't see how would sig_atomic_t would ever be implementable.

andreyv · on May 28, 2019

Indeed, "volatile sig_atomic_t" is the recommended official C way to change a flag from a signal handler.

dahfizz · on May 26, 2019

I think the title and parts of the article are misleading. Using volatile will never make a correct program incorrect. It cannot "break" a correct implementation.

It should not be overused, because as the article mentions it makes for slower and more confusing code, but it's not quite something to be afraid of either.

It is slower to use volatile, and bad form

burfog · on May 26, 2019

That note at the end about Linux is missing a link to the Documentation/volatile-considered-harmful.txt document. Basically, don't use volatile. Here, with your choice of formatting:

https://github.com/torvalds/linux/blob/master/Documentation/...

https://www.mjmwired.net/kernel/Documentation/volatile-consi...

https://www.kernel.org/doc/html/latest/process/volatile-cons...

User23 · on May 27, 2019

Neat how inline assembly is one of the valid use cases. I'm given to understand that essential parts of the Linux kernel can't actually be implemented in pure C and that some assembly is required.

Narishma · on May 27, 2019

Isn't that the case for most (all?) operating systems?

SAI_Peregrinus · on May 27, 2019

The entire section on declarations can also be fixed by always binding type modifiers and quantifiers to the left. Rewriting the examples:

    int* p;                              // pointer to int  
    int volatile* p_to_vol;              // pointer to volatile int  
    int* volatile vol_p;                 // volatile pointer to int  
    int volatile* volatile vol_p_to_vol; // volatile pointer to volatile int

This method always starts with the most basic type, then adds modifiers sequentially. The modifier binds to everything left of it.

klingonopera · on May 26, 2019

  > "Side note: although at first glance this code looks like it fails to account for the case where TCNT1 overflows from 65535 to 0 during the timing run, it actually works properly for all durations between 0 and 65535 ticks."

From example 1, ignoring device and setup-specifics what to do when TCNT1 overflows, it actually works properly for all ticks, both "first" and "second" are unsigned (therefore behaviour is defined), and the delta between them both is always between 0 and 65535, no matter what values they may have, and also correct in all cases.

E.g.:

  timeDelta = timeStampNow - timeStampLast = 0 - 65535 = 1

tntn · on May 26, 2019

But if the duration is > 65535 ticks, the calculated duration will be wrong, no? There is no mechanism to count how many times TCNT1 overflows, so it will be incorrect if the duration of what you are timing exceeds 65535 ticks.

klingonopera · on May 27, 2019

That is correct, yes. I had erroneously understood that the author meant "all durations between 0 and 65535 ticks" as "any duration between the device's 0th and 65535th tick", my bad... Also makes this entire thread obsolete, but FWIW, one shouldn't be attempting to measure a duration that can't even be contained in the variable's bit width. Some workarounds would be to add more bits, slow down the tickrate or add overflow counters.

klingonopera · on May 26, 2019

Can someone please tell me why this got downvoted? It's particularly annoying when correcting common misconceptions to get penalised for it... (and if what I said was wrong, I'd also like to know why!)

icedchai · on May 26, 2019

Probably because the article already says it works for all ticks...

klingonopera · on May 26, 2019

Yes, upon rereading I see that, too. When I first read it, I understood in such a way that it only would work until (a timestamp of) 65535 ticks since device-startup had passed, but he was referring to durations of that length.

Thank you for clarifying.

legohead · on May 27, 2019

I've never had to use volatile in code. This was all very interesting!

For issue #5, a possible solution not mentioned could be to write inline assembly, no? It would keep the array non-volatile and should be portable.

kelnos · on May 28, 2019

Inline assembly is basically the definition of non-portable.

loeg · on May 26, 2019

volatile should only be used for accessing MMIO registers in device drivers; that's it.

ridiculous_fish · on May 26, 2019

There's definitely more uses. For example, shared memory between processes: you should mark it volatile.

C++ atomics are no good here, because they are not guaranteed to be lock free or address free.

loeg · on May 27, 2019

Shared memory is way outside the scope of standard C or C++. It's implementation-defined. It's inconsistent to insist on the weakest definition of atomics allowed by the C/C++ standard(s) and simultaneous invoke one of the weirdest implementation-defined mechanisms defined by POSIX. If your implementation provides shared memory of some kind, it's up to your implementation to define some sort of reasonable semantics.

In POSIX' case, it's up to POSIX operating systems to define reasonable semantics on the memory, using constructs like PTHREAD_PROCESS_SHARED and "robust" pthread mutexes.

vardump · on May 27, 2019

> C++ atomics are no good here, because they are not guaranteed to be lock free or address free.

That's not right; you can still use std::memory_order to get the memory barriers generated that are required. These are going to obviously be lock free, they deal with memory ordering—what you tried to deal with volatile, but in general case.

See: https://en.cppreference.com/w/cpp/atomic/atomic/store

Effectively std::atomic stores and loads generate volatile accesses plus the required memory barriers to get the desired behavior.

ychen306 · on May 27, 2019

maybe use fences?

ychen306 · on May 27, 2019

This is probably not useful for production, but volatile is a great way to see what kind of code compiler generates in a realistic setting. For example, if you want to see how compiler optimizes a code snippet and the code depends on a constant that you don't want to get constant folded away.

drivebycomment · on May 27, 2019

That is a reasonable heuristic but your statement is not technically correct. E.g. you need volatile around setjmp/longjmp and that has nothing to do with IO.

rurban · on May 27, 2019

And if your GC doesn't dump the registers. Only with volatile you can keep all locals on the stack.

And yes, that's not stupid. It's actually faster than all the register "optimizations" for practical use cases in fast VM's. Register saving across calls and at the GC is much more expensive. mem2reg is an antipattern mostly

loeg · on May 27, 2019

GC (of the kind where an external context walks other thread machine stacks) definitely falls under implementation-defined :-).

loeg · on May 27, 2019

Technically for sig_atomic_t as well. Both are... messier parts of the C abstract machine.

kaetemi · on May 27, 2019

Volatile seems quite sufficient for a PleaseExitThread boolean.

flafla2 · on May 26, 2019

Edit: Looks like the slides had an inaccuracy (see replies). Huh, looks like I learned something today :)

I think a good way of summarizing volatile is this slide from my parallel architectures class [1]:

    > Class exercise: describe everything that might occur during the 
    > execution of this statement
    >     volatile int x = 10
    >
    > 1. Write to memory
    > 
    > Now describe everything that might occur during the execution of
    > this statement
    >     int x = 10
    > 
    > 1.  Virtual address to physical address conversion (TLB lookup)
    > 2.  TLB miss
    > 3.  TLB update (might involve OS)
    > 4.  OS may need to swap in page to get the appropriate page 
    >     table (load from disk to physical address)
    > 5.  Cache lookup (tag check)
    > 6.  Determine line not in cache (need to generate BusRdX)
    > 7.  Arbitrate for bus
    > 8.  Win bus, place address, command on bus
    > 9.  All caches perform snoop (e.g., invalidate their local 
    >     copies of the relevant line)
    > 10. Another cache or memory decides it must respond (let’s 
    >     assume it’s memory)
    > 11. Memory request sent to memory controller
    > 12. Memory controller is itself a scheduler
    > 13. Memory controller checks active row in DRAM row buffer.
    >     (May > need to activate new DRAM row. Let’s assume it does.)
    > 14. DRAM reads values into row buffer
    > 15. Memory arbitrates for data bus
    > 16. Memory wins bus
    > 17. Memory puts data on bus
    > 18. Requesting cache grabs data, updates cache line and tags, 
    >     moves line into exclusive state
    > 19. Processor is notified data exists
    > 20. Instruction proceeds
    > * This list is certainly not complete, it’s just 
    >   what I came up with off the top of my head.

It's also worth mentioning that this assumes a uniprocessor model, so out-of-order execution is still possible which leads to complications in any sort of multithreaded or networked system (See #5, 6, 7, 8 in the OP article).

I think a lot of the confusion stems from the illusion that a uniprocessor + in-order execution model implies to programmers who have never dealt with system-level code. I think in the future, performant software will require a bit more understanding of the underlying hardware on the part of your average software developer -- especially when you care about any sort of parallelism. It doesn't help that almost all common CS curriculum ignores parallelism until the 3rd year or more.

[1] http://www.cs.cmu.edu/~418/lectures/12_snoopimpl.pdf - the last 2 slides

colanderman · on May 26, 2019

That interpretation of those slides is incorrect. "volatile" means nothing more than "ensure that a store instruction is issued". It absolutely does not bypass any of the mechanisms listed. Write a test program and look at the assembler output on multiple architectures for proof. (Or look at the intermediate output from Clang.)

Looking at the formatting on the actual slides, I think the 1st is meant to be a question, and the 2nd is the answer. That the first contains the word "volatile" and the second doesn't looks to me like an editing error; they probably both said "volatile" at one time (or didn't) and the proof failed to update one when updating the other.

chrisseaton · on May 26, 2019

> looks to me like an editing error; they probably both said "volatile" at one time (or didn't) and the proof failed to update one when updating the other

Isn't it sobering to think that a university slide could have a minor error like that, someone could read it and internalise it as being very important, and then go off and ask interview questions about it (as suggested on the slide!!!!) for the rest of their career!

(Not the fault of the student in this thread, of course.)

keldaris · on May 26, 2019

I don't understand these slides. The volatile keyword does not magically bypass the mechanism by which modern CPUs write to main memory. Am I missing something, or are they somehow meant to be ironic?

flafla2 · on May 26, 2019

It (in theory) should bypass any caches in between physical memory and the CPU. Of course this is compiler/arch/OS dependent so YMMV...

The slide is admittedly a bit vague, the point is mostly to convey "lots of complicated things that you probably haven't considered are going on in the background to speed up memory accesses in a uniprocessor model." Keep in mind the class is exploring parallel architectures, and that lecture is about snooping-based cache coherence.

chrisseaton · on May 26, 2019

Check for yourself - look at the compiler output using https://godbolt.org for volatile on a variety of architectures. Ask yourself 'where is the logic to bypass the cache or virtual memory?' You won't find it.

keldaris · on May 26, 2019

The volatile keyword will certainly lead to implications for cache coherency, but it cannot bypass the TLB or somehow magically avoid the need to involve the memory controller. Unless I'm grossly misunderstanding something, a majority of the points on the second slide should also be on the first.

burfog · on May 26, 2019

Yep. The slide is completely wrong. It is showing low-level architecture details that would be 100% identical between the two cases. Volatile changes nothing on that list.

Volatile just makes sure the compiler bothers. Otherwise, a pair of writes to the same memory location could be optimized by eliminating the first write. Volatile makes the compiler do that. Of course, the CPU itself may then do this optimization, so volatile is thus not good enough for IO.

keldaris · on May 26, 2019

> It is showing low-level architecture details that would be 100% identical between the two cases.

To be as charitable as I can possibly be, the only part that could theoretically make sense is that the compiler could emit non-temporal store instructions to bypass the cache. I know compilers currently don't do that for volatile, but I don't know why.

comex · on May 27, 2019

> the only part that could theoretically make sense is that the compiler could emit non-temporal store instructions to bypass the cache. I know compilers currently don't do that for volatile, but I don't know why.

Two reasons:

First, using nontemporal accesses would break mixed volatile and non-volatile accesses to the same memory, something which is not defined by the C standard but which some programs rely on anyway.

Second, more importantly: why would they?

- If the address you’re accessing points to hardware registers, the page table entry should be marked non-cacheable, which makes nontemporal accesses unnecessary. And if for some reason it’s not marked properly, nontemporal accesses wouldn’t be sufficient to guarantee that things work anyway, because nontemporal is just a hint which the hardware may not respect. In any case, at least on x86, AFAIK the only nontemporal instructions access 128+ bits of memory at a time, which wouldn’t even work for hardware registers (which generally require you to use a specific access size).

- If the address you’re using points to regular memory, on the other hand, volatile is probably being used to implement atomics, in which case bypassing the cache is unnecessary and also slow. In theory, compilers could compile volatile into accesses surrounded by memory barrier instructions, which would enforce a stronger memory ordering (while being faster than bypassing the cache entirely), especially useful on architectures with weaker memory models than x86. In fact, that’s what volatile does in Java. But in C, it’s pretty long-established that volatile accesses should just compile to regular load/store instructions, and any necessary barriers must be inserted manually. People writing high-performance code wouldn’t be happy if the compiler started inserting unnecessary barrier instructions for them… In any case, usage of volatile for atomics is deprecated in favor of C/C++11 atomics, which do insert barriers for you.

Gibbon1 · on May 27, 2019

I think the reason is the details are too complicated to be captured by the volatile keyword.

For instance the processor I use has a controller that enforces consistency on IO memory operations. So volatile works 'fine'. I know that. The compiler is targeting a core not an implementation. So it has no idea.

tempguy9999 · on May 26, 2019

I don't get this, most likely due to my ignorance, but I thought volatile doesn't necessarily force anything to RAM, it can just push it out so cache coherence handles the rest, between cores (and perhaps peripherals). MESI can do the work without actually hitting memory.

if you want to force actually to ram then perhaps you'd need a memory barrier.

This is not my area though. Wrong? Right?

johntb86 · on May 26, 2019

Yes, and you'll either need to set up the memory mapping as uncached or issue the correct cache flush/invalidate operations.

spc476 · on May 26, 2019

What happens with this code?

    volatile int x;
    int          y;
    int          z;
    
    x = 10;
    x = 20;
    y = x;
    z = x;

Answer:

    the constant 10 is written to x
    the constant 20 is written to x
    the contents of x is read and written into y
    the contents of x is read and written into z

Now, what happens with this code?

    int x;
    int y;
    int z;
    
    x = 10;
    x = 20;
    y = x;
    z = x;

One answer is the same as the above. Another valid answer is:

    the constant 20 is written to x
    the constant 20 is written to y
    the constant 20 is written to z

Why? Because x is not used between the two assignments, so the first will never be seen. Also, x is not used between it's assignment and the assignment to y, so the compiler can do constant propagation.

All volatile does it tell the compiler "all writes must happen, and no caching of reads".

tempguy9999 · on May 27, 2019

Understood but we're talking about different things I think (though this is very much not my area).

You're saying volatile is acting as a kind of memory barrier instruction for the compiler - got it. But I'm saying I understand that at the CPU level, just considering x86 instructions, writes don't have to be forced to RAM, despite a common assumption that they are; they can remain in caches. See johntb86's reply confirming this.

the8472 · on May 26, 2019

> Now describe everything that might occur during the execution of this statement.

fun fact: either swap+loopback devices+FUSE/network filesystems or userfaultfd means arbitrary userspace code execution including IO to remote machines might occur.

chrisseaton · on May 26, 2019

How does volatile bypass, for example, a TLB lookup or miss?

flafla2 · on May 26, 2019

I didn't write the slides myself, but I think the implication is that the TLB is not consulted at all and the physical address is resolved again for every memory access. Of course, this is compiler / architecture / OS dependent though, so YMMV. The point is mostly to convey "lots of stuff you probably didn't consider is going on in the background and may have a nontrivial impact on parallelism."

chrisseaton · on May 26, 2019

> I think the implication is that the TLB is not consulted at all

This is not true.

flafla2 · on May 26, 2019

Could you clarify? I am merely a student of that class and we didn't discuss TLBs in detail so I'm all ears for details.

chrisseaton · on May 26, 2019

On an architecture with virtual protected memory (the one being described in the slides) there is no compiler control over the TLB. There is no mechanism for the compiler to bypass it. It isn't the semantics in theory or in practice for volatile to bypass almost anything on that list you have for the non-volatile case. It just isn't true. There must be some misunderstanding somewhere that is only clarified viva voce.

If you are still in the class I'd love to hear a clarification - maybe I'm wrong!

nullwasamistake · on May 26, 2019

Ironically volatile is just as bad in Java for different reasons. Frequently used for "lock free" synchronization, its usually actually worse than using locks because it can't be cached between cores. The variable is always loaded from main memory, which is usually much worse than holding a lock mutex in registers.

the8472 · on May 26, 2019

The standard pattern for working with atomics in java (volatiles are of limited use without atomic field updaters or varhandles) is to read it into a local variable, operate on that and only write it back to the volatile once you're done.

That has many benefits, among them the ability to store its value in registers.

nullwasamistake · on May 27, 2019

For primitives Java uses special CPU instructions. In the Atomic* package. It's not recommended for plain objects.

chrisseaton · on May 26, 2019

> Frequently used for "lock free" synchronization, its usually actually worse than using locks

If you want lock-free what do you suggest we use instead of volatile?

> which is usually much worse than holding a lock mutex in registers

How can you hold a mutex in a register? That doesn't make any sense.

nullwasamistake · on May 27, 2019

You can use a lock but held in a regular CPU register. They're just regular variables for the most part.

Lock free in Java is usually worse that what the JVM can pull off with lock elison

chrisseaton · on May 27, 2019

> You can use a lock but held in a regular CPU register. They're just regular variables for the most part.

I don't understand this. If your lock variable is in a CPU register how do other CPUs acquire the lock?

> Lock free in Java is usually worse that what the JVM can pull off with lock elision

I don't understand this either. Java's lock elision is only going to make a concurrent object 'lock-free' in the case where the object does not escape the compilation unit. In which case again how would another thread use it? Java will also combine adjacent critical sections created by monitors even if they escape, but it won't make them lock-free in that case.

nullwasamistake · on May 27, 2019

JVM is very smart about locking. My knowledge is limited but this is a great article. https://shipilev.net/jvm/anatomy-quarks/19-lock-elision/

chrisseaton · on May 27, 2019

I work with the JVM at Oracle and I’ve given talks about the lock elision algorithm. It doesn’t do what you think it does and what it does do is not related to lock-free like you think it is.

tus87 · on May 26, 2019

Err...volatile just tells the compiler not to cache the value in a register, that's it. If you don't understand volatile you really, really are not the kind of programmer who should even think about using it.

aidenn0 · on May 26, 2019

The very first example in TFA shows the compiler doing more than this.

empiricus · on May 26, 2019

This is my understanding of volatile as well: volatile just forces read/write to memory. What a read/write to memory entails is a different story. What happens with no volatile is again another story. If my understanding is wrong, someone please enlighten me.

moefh · on May 27, 2019

"Forcing read/write to memory" is very different from "not caching the value in a register". Optimizations can involve not just caching values in registers, but also reordering operations, calculating things at compile-time and so on.

For a trivial example, see this code:

    int f() {
        int sum = 0;
        for (int i = 0; i < 10; i++) sum += i;
        return sum;
    }

As you can see from [1], a smart compiler will calculate the sum at compile time and make the function simply return the resulting number (i.e., no loop is generated).

If you make "sum" volatile, the compiler is forced to do the loop[2].

[1] https://godbolt.org/z/3sX5mU

[2] https://godbolt.org/z/F5CiDJ