Secrets of dispatch_once; x86 performance trickery

userbinator · on June 7, 2014

What I would do is have dispatch_once NOP out the call instruction that called it as one of its first operations. Inside, an atomic exchange + compare on the predicate to stop any other threads that slipped past the call, and that's all there is to it. CPUs are optimised for executing NOPs since they're so commonly used for alignment, so the only thing this costs is fetch bandwidth. Here's a quick x86 implementation PoC off the top of my head:

    ; assumptions:
    ; * this function is called via a 2-byte call instruction
    ; * cdecl register allocation
    ; * [esp] : return addr
    ; * [esp+4] : predicate ptr
    ; * [esp+8] : ptr to func() to be called once
    dispatch_once:
        pop edx              ; edx = block
        pop ecx              ; ecx = pred ptr
        mov eax, 1
        xchg eax, [ecx]
        test eax, eax        ; was a different thread here before us?
        jnz dispatch_done
        mov eax, [esp]            ; return address
        mov [eax-2], word 0x9090  ; overwrite the call
        jmp edx                   ; execute the block +return
    dispatch_done:
        ret

As the saying goes, "The fastest way to do something is to not do it at all." In this case, the compare + function call can be completely eliminated after the first time.

Marat_Dukhan · on June 7, 2014

One day the call instruction will cross cache line boundary, the two cache lines will propogate to other cores' instruction caches with a delay, some thread will try to execute partially changed call instruction, and some senior programmer will get mad trying to debug it, a la "look, I swap these two lines and it gets to work" (b/c the call instruction does not cross cache line anymore)

userbinator · on June 7, 2014

I realise this approach has its assumptions and limitations, I was just trying to show how far you can take things if you're really after absolute performance. NOP'ing out the call instruction is the subtle part, but if you can ensure that all your calls to this function are 2-byte calls and aligned, then everything works out great.

Direct calls on x86 are 5 bytes so NOP'ing them out atomically is a bit more difficult; two ways of doing it I can think of: [1] use an 8-byte load/store, or [2] ensure the displacement part is aligned, and overwrite it with the address of a single RET instruction so the call immediately returns. The first way could be slower but effectively NOPs out the whole call, the second is simpler and faster to do but leaves the useless call/ret (while still eliminating the compare/branch.)

Incidentally, doing this is much easier on RISCs like MIPS and ARM thanks to fixed-length instructions (that automatically align), and they tend to have much shorter pipelines too so the flush penalties are lower.

lilyball · on June 7, 2014

That requires the ability to write to the executable page, which is definitely not allowed on iOS, and not necessarily allowed elsewhere either.

It also makes the assumption that a given call to `dispatch_once()` will only ever be called with the same predicate. If it selects among a set of predicates (or is given a predicate by whatever function called it), then NOPing it out would completely break the application.

mikeash · on June 7, 2014

Even ignoring W^X problems, how do you ensure that the reads and writes of the data are properly ordered with respect to the reads and writes of the code?

cfallin · on June 7, 2014

x86 self-modifying code semantics are actually quite strong relative to most other ISAs -- a store to the currently executing code-stream logically takes effect immediately, so you could in theory modify the instruction immediately following the store, and be OK. In other words, instruction memory ordering is exactly as strong as data memory ordering.

(That said, self-modifying code is usually a bad idea when performance is concerned, because the CPU must conservatively flush the pipeline when it detects that any store might have modified the raw bytes of any instruction in the pipeline, and might have to conservatively throw away other state -- probably more expensive than whatever other atomic operations we were playing tricks to avoid.)

k4st · on June 7, 2014

My understanding is that the semantics are not as strong as they appear, although I have no experience with other architectures, so what is said to be the case and what is the case might still be relatively easier to use than other ISAs.

The optimization and software developer manuals suggest that self- and cross-modifying code must always be used with a synchronizing instruction (e.g. CPUID).

Also, this LKML discussion (https://lkml.org/lkml/2009/3/2/194) suggests that only modifying the first byte of an instruction with an int3 is safe, whereas modifying the other parts of an instruction can result in spurious faults when that instruction is next executed, unless the correct algorithm is followed.

cfallin · on June 7, 2014

Fair point -- for cross-modifying code (one core's stores affect another core's fetch), this may be true.

For self-modifying code, though (same core doing the store and fetch), the semantics are strong. See the Intel Developer Manual 3A [1], section 11.6: "[the processor] check[s] whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated." In other words, stores are checked against instructions already in the pipe.

Also, I was involved in an out-of-order x86 design and know for certain that we cared about getting SMC right. No serializing instructions necessary :-)

[1] http://www.intel.com/content/dam/www/public/us/en/documents/...

k4st · on June 7, 2014

Good to know! I'm in a funny place in my research project where I recently found out about the limitations of cross-modifying code. I had previously assumed that everything would work out for the best (doing hot-code patching in a DBT system that runs in kernel space), and didn't (appear to) face issues on the last version of my DBT system. For the next version, I'd like to play things safer. However, I suppose I could transparently recover from GP faults, as this is something I already do for other scenarios.

Do you have any suggestions (besides stop_machine-like IPIs for synchronizing all CPUs) on how to go about dynamic code patching in a JIT-based DBT-system? In my case, it's a priori unclear on whether or not the instruction being patched has been prefetched by another core.

cfallin · on June 7, 2014

That sounds like a tricky/interesting problem -- out of curiosity, what's the higher-level problem you're trying to solve? What are the required semantics? For CMC, synchronizing with the destination core is probably necessary. Details of how the snooping works aren't documented and you probably can't rely on particular uarch implementation details anyhow.

But -- specifically for DBT, and just a guess -- you're trying to avoid an indirection when going between translated blocks/traces, and patching them directly together? Or at least somehow modify blocks/traces already in use by other cores. Then -- at the cost of more memory usage (and icache misses), you might be able to sidestep the IPIs by generating new traces and pushing synchronization up a level to whatever dispatch/map mechanism you use to find translated code. (Think of a persistent datastructure where the only mutable state is the root pointer, not the datastructure nodes -- same concept, same concurrency benefits.)

k4st · on June 7, 2014

I'm trying to solve a few related high-level problems. One, is that I want Valgrind-like debugging for the kernel. In my last kernel DBT system, I was able to do some pretty neat things, but actually using the DBT system was hell. A lot of this had to do with some of my poor design decisions (e.g. quick hacks that revealed interesting research areas, but were never refactored into good code).

Another problem that I want to solve is turn on/off pervasive profiling. This will sound similar to kprobes / systemtap / dtrace / etc, but what I want you to think about is something like "tainting" some objects (like injecting radioactive die) and then being able to observe their entire lifetime. I want to make it easy for someone without 1) domain specific knowledge of parts of the kernel, and 2) the ability to change the kernel source code to be able to answer the following types of questions: "if I write some data to a socket, how long does it take that data to go out over the wire, where are hold ups, etc."

Specifically for DBT, you hit the nail on the head: I am patching jumps in the code cache to point to other blocks in the code cache. Your suggestion is analogous to a copy-on-write tree/graph. I will have to think more on it, as it is interesting.

If you're curious about my project, then feel free to reach out to me :-) My email is on my HN profile.

hollerith · on June 7, 2014

DBT == dynamic binary translation, in case anyone else is wondering.

MichaelGG · on June 7, 2014

This is used in JITs, too - is there a reason why the author wouldn't cover it? Other than being "cheating" so to speak? I suppose environments (like game consoles) where you cannot write to executable pages?

comex · on June 7, 2014

Copying my comment from the post:

I don't think the cpuid is actually required, but feel free to tell me that I'm wrong.

The comment in dispatch_once talks about systems with weakly ordered memory models like ARM. But x86 has a strongly ordered model: among other things, "any two stores are seen in a consistent order by processors other than those performing the stores" [Intel]. After calling dispatch_once, I expect to see any writes performed by the once routine, but once the store to the dispatch_once_t flag has been observed, this is already guaranteed.

The main reordering pitfall in x86 is that reads may be reordered with earlier writes to different locations, but I don't see how that would cause a problem here.

nkurz · on June 7, 2014

I'm tending to agree with your reasoning, but am reaching a slightly different conclusion: there might be a problem, but if there is, I don't see how 'cpuid' solves it.

Like any delay, it reduces the chances that it will occur, but I don't think it provides any guarantee. What is to prevent a process from reading the initial predicate==0 and then being swapped out for thousands of cycles just before the conditional is executed? I presume the solution for this is that the "write side" does CAS before it actually executes the block. But since it's doing this anyway, I don't see how the 'cpuid' is actually helping things. I think the CAS will be doing a read-to-own whether or not it succeeds, and thus it should never see the wrong state.

Edit: Looking at the comments in the source through the link that Marat supplied below, I'm now realizing that the (supposed) purpose of the 'cpuid' is not to prevent double execution of the initialization block, but to prevent the process that performs the initialization from pre-reading a data variable that is not yet initialized. Since this isn't necessary on x86, it seems even more likely the 'cpuid' is doing anything useful here.

nkurz · on June 7, 2014

Great article! I'm still trying to wrap my head around this, but in the meantime I'll quibble about one tiny throwaway aside:

DISPATCH_EXPECT is a macro that tells the compiler to emit code that tells the CPU that the branch where 8predicate is ~0l is the more likely path. This can improve the success of branch prediction, and thus improve performance.

While this is what the macro would do on processors that support such code, x86/x64 isn't one of those processors (Itanium was). Instead, it's more important purpose here is to hint to the compiler (not the processor) to lay out the conditional in assembly such that the expected result is on the faster "branch not taken" path.

That said, is the actual implementation of the "write side" viewable online?

Marat_Dukhan · on June 7, 2014

There are hints for taken/non-taken branches in x86. However, the only family to use them was Pentium 4; other processors silently ignore those hints.

As for the write side, you'll find it here: https://opensource.apple.com/source/libdispatch/libdispatch-...

gsg · on June 7, 2014

Hints are useless, but arranging for cold code to be on a forward branch can be beneficial.

colin_mccabe · on June 7, 2014

The missing element from this explanation (or maybe I missed it?) is that dispatch_once uses atomic instructions (cmpxchg, to be exact), to ensure that even if multiple threads attempt to create the singleton, only one succeeds. The fast path for reads only kicks in after everything else has been tidied up.

Another way to do this kind of thing on Linux is to use ELF TLS to have a thread-local variable identifying if the expensive operation has been completed. If the TLS is not there, you can take a mutex and fall back to the slow path.

rossjudson · on June 7, 2014

This feels racy, and I'm not sure it will work on multi-node NUMA architectures. It's one thing to delay and hope that your write is visible in the local processor cache; it's another to ensure that write is visible to other processor nodes, when there's no explicit cache coherence rule that says it will be.

quotemstr · on June 7, 2014

Relying on an implementation detail of the cpuid instruction seems like a very bad idea. It'd be safer, I think, and just as fast, to initially use the atomic version of the code, then NOP out the LOCK prefix once initialization is complete.

mikeash · on June 7, 2014

The rules are different when you're an OS vendor who makes all of the hardware that the OS officially supports.

It would be a bad idea for you or I, but for Apple there's no problem. If Intel ever ships a CPU that needs something else, then Apple can update their code for the OS release that goes out on the corresponding Macs.

userbinator · on June 7, 2014

Intel specifies CPUID to be serialising, this is unlikely to change in the future.

cfallin · on June 7, 2014

It's actually part of the architectural definition, so Intel could not change this without breaking ISA compatibility. And they take x86 compatibility very seriously!

jsnell · on June 7, 2014

That can't possibly work. There's just one dispatch_one function but many callsites, and for each callsite you could have multiple distinct predicates.

quotemstr · on June 7, 2014

You'd allocate a trampoline for each predicate, jump unconditionally into that, then jump unconditionally back to mainline code. Each trampoline can be modifier individually. Of course that doesn't work with the current dispatch_once API, though.

mikeash · on June 7, 2014

Doesn't this have all the same problems as the way they currently do it? The only difference between the two is that Apple's implementation puts the flag in a variable, while you're basically proposing to put the flag in code. You still need to ensure proper ordering on both reads and writes and ensure that speculative execution doesn't end up giving you the wrong answer.

adamjernst · on June 7, 2014

Requires writing to executable pages, which iOS doesn't have.

MichaelGG · on June 7, 2014

iOS doesn't have x86 processors either, does it?

paxswill · on June 7, 2014

Unless I'm mistaken, the simulator is an x86 implementation of iOS.

n0rm · on June 7, 2014

Can someone link a mirror?

I get timeout for some reason.