Even ignoring W^X problems, how do you ensure that the reads and writes of the data are properly ordered with respect to the reads and writes of the code?
x86 self-modifying code semantics are actually quite strong relative to most other ISAs -- a store to the currently executing code-stream logically takes effect immediately, so you could in theory modify the instruction immediately following the store, and be OK. In other words, instruction memory ordering is exactly as strong as data memory ordering.
(That said, self-modifying code is usually a bad idea when performance is concerned, because the CPU must conservatively flush the pipeline when it detects that any store might have modified the raw bytes of any instruction in the pipeline, and might have to conservatively throw away other state -- probably more expensive than whatever other atomic operations we were playing tricks to avoid.)
My understanding is that the semantics are not as strong as they appear, although I have no experience with other architectures, so what is said to be the case and what is the case might still be relatively easier to use than other ISAs.
The optimization and software developer manuals suggest that self- and cross-modifying code must always be used with a synchronizing instruction (e.g. CPUID).
Also, this LKML discussion (https://lkml.org/lkml/2009/3/2/194) suggests that only modifying the first byte of an instruction with an int3 is safe, whereas modifying the other parts of an instruction can result in spurious faults when that instruction is next executed, unless the correct algorithm is followed.
Fair point -- for cross-modifying code (one core's stores affect another core's fetch), this may be true.
For self-modifying code, though (same core doing the store and fetch), the semantics are strong. See the Intel Developer Manual 3A [1], section 11.6: "[the processor] check[s] whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated." In other words, stores are checked against instructions already in the pipe.
Also, I was involved in an out-of-order x86 design and know for certain that we cared about getting SMC right. No serializing instructions necessary :-)
Good to know! I'm in a funny place in my research project where I recently found out about the limitations of cross-modifying code. I had previously assumed that everything would work out for the best (doing hot-code patching in a DBT system that runs in kernel space), and didn't (appear to) face issues on the last version of my DBT system. For the next version, I'd like to play things safer. However, I suppose I could transparently recover from GP faults, as this is something I already do for other scenarios.
Do you have any suggestions (besides stop_machine-like IPIs for synchronizing all CPUs) on how to go about dynamic code patching in a JIT-based DBT-system? In my case, it's a priori unclear on whether or not the instruction being patched has been prefetched by another core.
That sounds like a tricky/interesting problem -- out of curiosity, what's the higher-level problem you're trying to solve? What are the required semantics? For CMC, synchronizing with the destination core is probably necessary. Details of how the snooping works aren't documented and you probably can't rely on particular uarch implementation details anyhow.
But -- specifically for DBT, and just a guess -- you're trying to avoid an indirection when going between translated blocks/traces, and patching them directly together? Or at least somehow modify blocks/traces already in use by other cores. Then -- at the cost of more memory usage (and icache misses), you might be able to sidestep the IPIs by generating new traces and pushing synchronization up a level to whatever dispatch/map mechanism you use to find translated code. (Think of a persistent datastructure where the only mutable state is the root pointer, not the datastructure nodes -- same concept, same concurrency benefits.)
I'm trying to solve a few related high-level problems. One, is that I want Valgrind-like debugging for the kernel. In my last kernel DBT system, I was able to do some pretty neat things, but actually using the DBT system was hell. A lot of this had to do with some of my poor design decisions (e.g. quick hacks that revealed interesting research areas, but were never refactored into good code).
Another problem that I want to solve is turn on/off pervasive profiling. This will sound similar to kprobes / systemtap / dtrace / etc, but what I want you to think about is something like "tainting" some objects (like injecting radioactive die) and then being able to observe their entire lifetime. I want to make it easy for someone without 1) domain specific knowledge of parts of the kernel, and 2) the ability to change the kernel source code to be able to answer the following types of questions: "if I write some data to a socket, how long does it take that data to go out over the wire, where are hold ups, etc."
Specifically for DBT, you hit the nail on the head: I am patching jumps in the code cache to point to other blocks in the code cache. Your suggestion is analogous to a copy-on-write tree/graph. I will have to think more on it, as it is interesting.
If you're curious about my project, then feel free to reach out to me :-) My email is on my HN profile.