Which “one instruction” is the right one? You can have the exact same instructio...

Which “one instruction” is the right one? You can have the exact same instruction represented many different ways in x86. Some are shorter and some are longer (and some are different, but the same length). When you say to add two numbers, which instruction variant is correct?

This isn’t a straightforward answer. As I alluded to in another statement you quoted, padding cache lines is a fascinating example. Functions should ideally start at the beginning of a cache line. This means the preceding cache line needs to be filled with something. NOP seems like the perfect solution, but it adds unnecessary instructions in the uop cache which slow things down. Instead, the compiler goes to the code on the previous cache line and expands instructions to their largest size to fill the space because adding a few bytes of useless prefix is allowed and doesn’t generate NOPs in the uop cache.

Your uop statements aren’t what I was talking about. Almost all RISCV instructions result in just one uop. On fast machines, they may even generate less than one uop if they get fused together.

x86 has a different situation. If your instruction generates too many uops (or is too esoteric), it will skip the fast hardware decoders and be sent to a microcode decoder. There’s a massive penalty for doing this that slows performance to a crawl. To my knowledge, no such instructions exist in any modern ISA.

Intel only has one high performance core. When they introduce APX, it’ll be on everything. There’s good reason to believe this will be introduced. It adds lots of good features and offers increased code density in quite a few situations which is something we haven’t seen in a meaningful way since AMD64.