IMHO the best way to think of it (well, it's how I have long thought of it) is two lemmas:
1 - The compiler has more "fingers" than a human does. Back when I wrote programs in assembly I would use printout to keep track of what I was doing and for debugging, and so would often put a finger on the paper to mark where a jump was and then go to the destination to see if it matched up, etc. This process isn't at all scalable, so the compiler is inevitably going to do better than I ever could on anything more than something trivial.
2 - I know more about the program constraints than I could ever express (much less be bothered expressing) in a HLL. "All the values will be less than 16" or "I can use this memory location for something else after I've read from it" or who knows what. So sometimes I can chop stuff out or even rewrite the compiler output locally to speed it up.
Also I can only do this for a few architectures...and with modern x86 there are so many variants with all sorts of microoptimization affordances that I can rarely beat the compiler (this is a corollary of lemma 1)
A compiler is a combinatorial optimizer (think bin-packing). In general, optimizers/solvers basically search for the best solution. Most production compilers don't have solvers in them, they use heuristics instead, but even the best solvers use tons of heuristics. Naturally a computer will search/try heuristics faster and more thoroughly than you but sometimes you can do better because performant searching is all about "knowing where to look".
Modern compilers are not doing much searching in general. It's mostly apply some feed-forward heuristic to determine whether to apply a transformation or not.
I think a slower, search based compiler could have a lot of potential for the hottest parts you're willing to spend exorbitant time on a search.
I have heard search based compiler optimization called "superoptimization"[1]. It seems interesting, but as far as I know has not seen much industrial use.
It scales well enough. You can apparently run Souper on SQLite in 24 hours with a beefy machine, according to a talk I recently attended, by one of the developers.
That’s a cool data point, thanks! Though mind that it is a superoptimizer on LLVM bitcode, not on machine code itself, avoiding all the combinatorial explosion of register allocations and probably a bunch of more, but I don’t know enough about the program.
> Modern compilers are not doing much searching in general.
This is false. Any compiler that does register allocation and instruction scheduling (all of them) is searching for an optimal (or just good enough) solution to an optimization problem.
I'm not sure there is a clear separation between applying heuristics and searching a space. Often in compilers you search a subset of a space using heuristics, and you can adjust those to control how much of the space you cover.
For example, here is a pass that reorders WebAssembly globals in the Binaryen optimizer:
We have a simple criteria for the quality of a solution - how big the binary size is with an order - but the space of possible orders is huge (every permutation that keeps every global after its dependencies). What we do is a targeted search of that space using some heuristics using parameters that work well enough and aren't too slow in practice.
> I'm not sure there is a clear separation between applying heuristics
There is and it's quite simple: if your heuristic reduces the size of your search space faster than it takes to perform the search (ie try solutions) then you have a real algo on your hands. Otherwise you're just searching. This is basically the border between P and NP and it's just that in compilers most of the problems are NP hard so none of the heuristics are really that good.
To me an algorithm is closed, while a heuristic (aka rule of thumb) is just a fast way to probably get a better solution / subset of the solution space at the cost of possibly missing the optimal result or even ending up in a pessimal corner case.
With an NP complete problem you'd rather have some solution rather than use up your lifetime searching for the best.
i dunno what "closed" means here? converges? lots of things with heuristics converge...
most things people think of as algorithms are just heuristics codified. take for example unit propagation in DPLL (fairly modern SAT approach); quoting wiki[1]
> In practice, this often leads to deterministic cascades of units, thus avoiding a large part of the naive search space.
that *in practice* there means there's no guarantee because ofc not - otherwise they would've proven something about NP. but they didn't, they just came up with a way to search that's sometimes/often not bad. a lot of people call this an algorithm ("... is a refinement of the earlier Davis–Putnam algorithm") because it fits the dictionary definiton (a repeatable process) but i do not because it's a repeatable process that isn't proven to produce anything (ie faster than brute force). and the intuition i'm proposing for why it doesn't is because it doesn't actually shrink the search space fast enough.
note, my definitions/intuitions don't translate/have the same force elsewhere (especially in continuous/differentiable spaces) but they're a pretty standard/obvious perspective in combinatorial optimization (np-hard/np-complete problems).
I don't know what you're asking for - this isn't some kind of controversial topic - any iterative algo that isn't polynomial time (or is approximate) is search.
In the context of compilers there are many. Look at this block diagram for Chaitin's register allocator:
I am going to be the token programming language researcher and say that what you really want is a dependently typed assembly language that your dependently typed higher level language lowers to. One school of thought that has yet to bear fruit in “mainstream“ programming, but gives tantalizing hints of what is possible, is expressing increasing amounts of your programs constraints in the type system, thereby informing the compiler precisely about your intent.
Doesn't make any sense. The "type constraints" on assembly operands (registers and numbers) is the ISA and thus those constraints are combinatorial not logical
You misunderstand. If I, at the assembly level, could express things like "this register ranges from 0 to 4"—I might be able to provide ever-lower levels of the stack with more information about my intent. Maybe the ISA can now do something clever to further optimize my program. Now, there will likely always be a time when all types are erased, but if we can push that erasure lower and lower in the stack, then compilers have more and more information about what we actually want them to do, provided our type systems are rich enough to express that intent.
that's not a type system, that's an ILP. and we already have that in many compilers (bindings to ILP solvers).
> Maybe the ISA can now do something clever to further optimize my program.
this is a malformed sentence - the ISA is fixed and isn't making any gametime decisions about your code (modulo branch prediction/cache-fretching). it's all in the compiler and like i said we already have this in many many compilers and it's surfaced in various ways (eg trip count on loops).
There's the reason "token programming language researchers" are incapable of understanding modern computer architectures: a huge gap where would have been EE education.
Correct - most of them think the goal is an infinite tower of sophisticated "abstractions" not realizing there's nothing abstract about a finite piece of silicon.
While I do think there are a lot of reasons to why one should not write in assembly, time-critical base routines can often benefit a lot from being written in assembly (take GMP's mpn-module for instance). However, this puts some constraint on how the code should be formatted (in GMP's case, this means that everything the mpz-module does is derived from the mpn-module), which cannot be said for all applications.
> The dispatch loop takes a single indirect branch to the opcode-specific implementation. This means that the branch will be nigh unpredictable!
Modern branch predictors can actually predict indirect branches with multiple destinations, because they hash recent branch history into the prediction. The exact same indirect branch will end up with multiple BTB entries, based on previous control flow.
I was curious where the threaded code speedup is actually coming from. It's possible much of the speedup is coming from eliminating the extra branch back to the dispatch loop. Or maybe the history tracking in the M1's branch predictor doesn't work well for this type of control flow.
So I checked out this code and modified the next macro, adding a dummy branch over a nop, to roughly isolate this factor.
On my M1, the extra unconditional branch benchmarks at 1.08 sec for fib, and 1.19 sec for Mandelbrot (all other times were the same).
Looks like the speedup is a mixture of the two. Eliminating that extra branch is responsible for about 20-30% of the speedup, and improving prediction on indirect branches is responsible for the other 70-80%
Ok, I spent quite a bit of time looking at performance counters, trying to understand what the M1's branch predictor was doing.
The branch predictor is really accurate with a common dispatcher, it predicts those indirect branches correctly 99.25% of the time. Switching to threaded jumps improves this slightly to 99.75%, but not because the indirect branches are at different addresses. This improvement in accuracy is entirely because of the unconditional branch back the dispatcher that was removed as a side effect. With that gone, the branch predictor can now track branch histories that are about twice as many VM instructions long.
My modified version with a dummy branch in the threaded dispatcher negates this longer history and (according to the performance counters) results in the exact same 99.25% correctly predicted branches as the common dispatcher, yet it's still significantly faster, only 20% slower than threaded jumps.
-------------------
Why are threaded jumps faster on the M1, if it's not increasing branch prediction accuracy?
Well, the M1 essentially has two branch predictors[1]. The faster one can return a prediction in one cycle, but it's not checking branch history it all, and it's almost always wrong for these unpredictable indirect branches. The slower predictor does take branch history into account, but takes three cycles to produce the correct result.
Which means there is a short pipeline stall when the second prediction comes in. This stall doesn't show up as a BRANCH_INDIR_MISPRED_NONSPEC, because the branch was correctly predicted. Instead, it seems to show up in the FETCH_RESTART counter.
So while threaded jumps doesn't improve overall branch prediction accuracy (except because of the longer history), it does slightly improve the accuracy of the faster branch predictor. With a common dispatcher, it predicts wrong almost 100% of the time, but with threaded code, the accuracy improves to 40%.
[1] Or at least that's a decent mental model. I suspect it's actually be a single ITTAGE branch predictor that returns the zero-history result in 1 cycle
The other thing I noticed while playing around; Performance absolutely falls off a cliff if your hot loop starts missing in the L1i cache.
This blog post [1] from CloudFlare has a few interesting hints about the M1's branch predictor. First, (in their testing) to get the one cycle predictions at all, your hot code needs to fit in just 4KB.
Second, if the brach target isn't in L1 cache, you don't get a prediction at all. The branch target prediction probably points directly at the cache line + way, so even if a cache line moves to a different way, the prediction will still fail.
Which means, I'm not sure this optimisation is worth while. It works for fibonacci and mandelbrot because they have reasonably tight loops, and adding a dispatcher to each instruction handler doesn't push the hot code over the limit. But when interpreting more generic code, you might be better off trying to minimise cache usage.
> Making all of the opcode implementations the same size (padding to the size of the largest opcode implementation with .balign 256), then removing the jump table entirely. This was also slower, also probably because of cache friendliness: the opcode implementations go from 16.6 KiB total to 64 KiB.
Probably normalizing the opcode implementations length to the maximum length is not optimal. A possible experiment would be to normalize to the modal length. I would expect most opcodes to be arithmetic or logic operations and to share the same implementation length. The (hopefully few) more complex ones would need a trampoline.
> I've proven to my satisfaction that writing an interpreter in assembly is both fun and performant!
Fun maybe/maybe not, but define "performant". I might drop into assembly for 1 or 2 orders of magnitude faster for a non-toy project. Even then, what are my requirements? What is the bus factor of LuaJIT again? Try getting support for s390x, or your patch accepted.
Look at the speedup of Lua v. LuaJIT (C language VM v. C/Lua VM with JIT code gen):
I wish you wouldn't broadcast the sentiment contained in the first paragraph. Compilers lack the ability to consistently perform many basic optimizations to an embarrassing extent. Including even the ones you would think would be the first optimizations you'd implement when writing a compiler. Open up Godbolt and tell me if you still think the compiler knows best. I try to submit at least one issue to LLVM every time I look at the assembly it gives me. I say "try to" because sometimes it's too much to deal with.
And by the way, probably the absolute worst things the compiler does are the "smart" things. I write my code knowing what the emit should look like, and the compiler sometimes thinks it has a better idea than how I wrote the code and makes the emit a lot more convoluted and slower than it would be if it just did what I said. Actually, I do know the machine decently well, thank you.
Saying "centuries of engineering" is misleading. A lot of those centuries are people arguing about theoretical optimizations of dubious value while we still haven't even finished on the most basic optimizations that are obviously beneficial.
The second paragraph making it sound all mysterious that compilers are even more awful at interpreters than normal code really speaks to a complete lack of exposure to the subject matter. This stuff is only esoteric because you couldn't be bothered to look at godbolt.org. Which is totally fine by the way, just don't go telling me how high quality compilers are if you've never even looked.
That would be like me telling web developers that Dreamweaver produces phenomenal HTML and CSS, without ever looking at what it emits.
Sorry for the rant but it just bothers me how much praise is heaped upon compilers like they are some kind of gift from God, forged by monks, wizards, and unrivaled geniuses. A compiler is just a tool. There is no magic.
No problem here. I deleted a rant turning into a screed about blog articles superimposing "The Book of Dragon" and "The Dragon Book", but after some consideration found this link:
With respect to "refuting the central point", if in the OP it is "Performance engineering should be about knowledge and science, not superstition and myth. (Godbolt not Godwin?)", then I agree. See:
"Any sufficiently advanced technology is indistinguishable from magic."
When people call themselves engineers, without placing their feets on an engineering school, with compiler development degrees, rather a six weeks bootcamp, their compilers feel like a sufficient advanced technology.
You really don't need a degree for this stuff. A typical degree has 1 class about this, and half of it is dedicated to formal language theory or the unfortunate practice of using parser generators.
Then my degree was atypical, as we had several classes about this, scattered around two years of other stuff.
The theoritical stuff, history and evolution of programming languages, compiler and language design, the actuall toy language implementation all the way to native code.
30 years ago, maybe the quality of teaching went down, I guess.
Get Modern Compiler Implementation in C [0] (with Java and ML versions also available),or the Kaleidoscope LLVM tutorial [1], and do the whole thing in 2 weeks, as starting point.
Very cool! I enjoy seeing people writing asm, and letting us get the most out of our CPUs.
I see you already tried what I thought of, which is getting rid of the jump table and making each instruction handler the same size. Do you think that could still work if you limited each instruction handler to 64 or 32 bytes instead of 256, and then for longer handlers jumped to a larger body of code somewhere else?
Instead of using unsafe Rust to optimize the interpreter loop, I would prefer to write a transpiler that compiles the source UXN binaries to local CPU binaries, which would not require making code unsafe/less readable and would permit further speed enhancements by getting rid of an interpreter loop altogether.
This is an interesting idea, but gets tricky if someone writes self-modifying code!
There are Uxn instructions which write to RAM; normally, this is used for storing data, but nothing prevents programs from editing their own code as they're running.
> but nothing prevents programs from editing their own code as they're running
On some platforms writing to memory mapped in with PROT_EXEC will trigger a page fault and the process will be killed. In other words, self modifying executables are effectively forbidden by design (the workaround is to unmap the memory, map it as PROT_WRITE, modify, unmap, map it in as PROT_EXEC, and resume execution - which is what JITs do on MacOS, for example).
The parent comment was referring to programs running inside the UXN virtual machine, which explicitly supports and endorses self-modifying code. Many of the creators' own programs take advantage of this capability, so any conforming implementation has to find a way to make it work.
There's a better workaround, incidentally, which is to map the same memory to two different address ranges with different access permissions. This allows code in the JIT region to be updated while threads may be running in it.
> Making all of the opcode implementations the same size (padding to the size of the largest opcode implementation with .balign 256), then removing the jump table entirely.
Another idea would be to sort the opcodes by size (32-byte, 64-byte, 128-byte, 256-byte - or perhaps +1 to avoid the set-associativity problem below) rather than simply indexing.
> This was also slower, also probably because of cache friendliness: the opcode implementations go from 16.6 KiB total to 64 KiB.
This is probably cache-unfriendly due to alignment too (associativity limits mean more conflict misses for highly-aligned memory), not just size. Most documentation only talks about this problem regarding the data cache though ...
0x100002ec0: b 0x100002d1c ; jump back to the dispatch loop
I'd expect tail call based control flow - i.e. computing the jump destination at the end of each opcode function - to be nicer to the branch predictor because it could track targets separately.
Currently that's hard to achieve in Rust. There's an experiment in progress to see if guaranteed tail calls can be added to the language
It would be interesting to see what happens if you turn on profile-guided optimization. This seems like a very good application for PGO.
The 518ms profile result for the branch-table load is very peculiar. To my eyes, it suggests that the branch-table load is is incurring cache misses at a furious rate. But I can't honestly think of why that would be. On a Pi-4 everything should fit in L2 comfortably. Are you using a low-performance ARM processor like an A53?
I would love to understand better this uxn thing? Is it some academic experiment or it has real world applications? What’s the deal with the additional return stack?
Why does making the FETCH part of the cycle a macro make it faster? Surely the branch predictor is fine with unconditional immediate branches. What am I missing here?
Also, it has jump-to-register immediately after the instruction that sets that register. Wouldn't it be faster if it went like:
get jmp address
execute VM opcode
jmp to next instruction
Modern Out-of-order CPUs (like the M1), they can't see branches until far too late.
The M1's frontend at least 24 instruction past the unconditional branch before the early possible moment it can even see it.
So the branch predictor isn't just responsible for predicting which way conditional branches go. It must remember where all branches are, and their target so that the front end can follow them with zero cycle delay. This means all branches, including call and return instructions too.
Which means that unconditional immediate branches cost about the same as a correctly predicted conditional branch.
But that's not actually why the fetch has been moved.
The other thing to note is that the frontend and backend of a modern CPU are completely disconnected. The frontend doesn't even try to get the correct address of an indirect jump from the backend. It always uses the branch predictor to predict the indirect branch.
And by inlining, each VM instruction has its own indirect jump, which means it gets different slot in the branch predictor allowing for better predictions.
At least that's the theory behind threaded code. I'm unsure how much of this speedup is coming from eliminating the extra unconditional immediate branch and how much is from better prediction of indirect branches.
Side note: Intel CPUs since Skylake and also recent AMD CPUs (since Zen 3 or so?) store a history for indirect branches. On such processors, using threaded jumps does not really improve performance anymore (I've even seen 1-2% slowdowns on some cores).
Pretty sure it's Haswell and Zen 2. They both implement IT-TAGE based branch predictors.
I just assumed the M1 branch predictor would also be in the same class, but I guess not. In another comment (https://news.ycombinator.com/item?id=40952404), I did some tests to confirm that it was actually the threaded jumps responsible for the speedup.
I'm tempted to dig deeper, see what the M1's branch predator can and can't do.
Turns out that M1 can track the history of indirect branches just fine, but it takes 3 cycles for a correct prediction. With threaded jumps, the M1 gets a slightly higher hit rate for the initial 1 cycle prediction.
Unfortunately, it was hard to read with the monokai.css theme because comments were nearly invisible, and a lot of your information was in comments. Changing the color from #75715e to #95917e did the trick. I guess Monokai is for programmers who never read comments.
Does the compiler do anything differently if you stick to Rust but change the dispatch implementation to be an #[inline] fn next() that you put at the end of each opcode?
It looks to me like that crate is supposed to help with exposing rust functions to C safely, not calling foreign functions in foreign code safely. Am I missing something?
1 - The compiler has more "fingers" than a human does. Back when I wrote programs in assembly I would use printout to keep track of what I was doing and for debugging, and so would often put a finger on the paper to mark where a jump was and then go to the destination to see if it matched up, etc. This process isn't at all scalable, so the compiler is inevitably going to do better than I ever could on anything more than something trivial.
2 - I know more about the program constraints than I could ever express (much less be bothered expressing) in a HLL. "All the values will be less than 16" or "I can use this memory location for something else after I've read from it" or who knows what. So sometimes I can chop stuff out or even rewrite the compiler output locally to speed it up.
Also I can only do this for a few architectures...and with modern x86 there are so many variants with all sorts of microoptimization affordances that I can rarely beat the compiler (this is a corollary of lemma 1)