An ordinary developer cannot write performant x86 without a massive optimizing c...

LegionMammal978 · on July 11, 2024

> Actual instruction encoding is horrible. If you’re arguing that you can write a high-level assembly over the top, then you aren’t so much writing assembly as you are writing something in between.

One instruction in x86 assembly is one instruction in the machine code, and one instruction as recognized by the processor. And except for legacy instructions that we shouldn't teach people to use, each of these is not much higher-level than an instruction in any other assembly language. So I still don't see what the issue is, apart from "the byte encoding is wacky".

(There are μops beneath it of course, but these are reordered and placed into execution units in a very implementation-dependent manner that can't easily be optimized until runtime. Recall how VLIW failed at exposing this to the programmer/compiler.)

> padding a cache line, avoiding instructions with too many uops

Any realistic ARM or RISC-V processor these days also supports out-of-order execution with an instruction cache. The Cortex processors even have μops to support this! The classic 5-stage pipeline is obsolete outside the classroom. So if you're aiming for maximum performance, I don't see how these are concerns that arise far less in other assembly languages. E.g., you'll always have to be worried about register dependencies, execution units, optimal loop unrolling, etc. It's not like a typical program will be blocked on the μop cache in any case.

> APX 32 registers and more normal shorter instructions

APX doesn't exist yet, and I'd wager there's a good chance it will never reach consumer CPUs.

hajile · on July 11, 2024

Which “one instruction” is the right one? You can have the exact same instruction represented many different ways in x86. Some are shorter and some are longer (and some are different, but the same length). When you say to add two numbers, which instruction variant is correct?

This isn’t a straightforward answer. As I alluded to in another statement you quoted, padding cache lines is a fascinating example. Functions should ideally start at the beginning of a cache line. This means the preceding cache line needs to be filled with something. NOP seems like the perfect solution, but it adds unnecessary instructions in the uop cache which slow things down. Instead, the compiler goes to the code on the previous cache line and expands instructions to their largest size to fill the space because adding a few bytes of useless prefix is allowed and doesn’t generate NOPs in the uop cache.

Your uop statements aren’t what I was talking about. Almost all RISCV instructions result in just one uop. On fast machines, they may even generate less than one uop if they get fused together.

x86 has a different situation. If your instruction generates too many uops (or is too esoteric), it will skip the fast hardware decoders and be sent to a microcode decoder. There’s a massive penalty for doing this that slows performance to a crawl. To my knowledge, no such instructions exist in any modern ISA.

Intel only has one high performance core. When they introduce APX, it’ll be on everything. There’s good reason to believe this will be introduced. It adds lots of good features and offers increased code density in quite a few situations which is something we haven’t seen in a meaningful way since AMD64.

saagarjha · on July 12, 2024

> x86 has a different situation. If your instruction generates too many uops (or is too esoteric), it will skip the fast hardware decoders and be sent to a microcode decoder. There’s a massive penalty for doing this that slows performance to a crawl. To my knowledge, no such instructions exist in any modern ISA.

Not surprising that modern ISAs are missing legacy instructions