Does a compiler use all x86 instructions? (2016)

benlivengood · on July 18, 2022

There are still instructions for binary coded decimal arithmetic in our 64-bit superscalar multicore processors (although they throw #UD in 64-bit mode).

DAA and AAA for addition, DAS and AAS for subtraction, AAM for multiplication, AAD for division.

The software interrupts (INT) were almost universally replaced with SYSENTER and SYSCALL (and their SYSEXIT/SYSRET counterparts).

ENTER and LEAVE instructions for manipulating the stack for function calls are slower than direct modification of *SP and *BP.

The extra segment registers (ES, FS, and GS) are nearly useless in flat memory model (CS, DS, ES, and SS must be equal for practical purposes). Thread local storage uses FS in Linux, and the kernel uses GS for per-cpu data structures. It looks like BSD may use GS for thread local storage.

mkup · on July 18, 2022

I've recently observed gcc 10.x emitting LEAVE instruction in functions using __builtin_alloca or __attribute__(( force_align_arg_pointer )), although only for x86_64 targets (for i686 it's still MOV ESP, EBP / POP EBP pair).

leeter · on July 18, 2022

I looked this up once, LEAVE is fast but ENTER is slower than just doing it by hand. So compilers will use LEAVE under some circumstances but ENTER is a dead instruction AFAIK, just like LOOP and a few others.

rep_lodsb · on July 18, 2022

How to convert an integer to hex digit in 5 bytes:

    ; AL in range 0..35 (usually 0..15)
    cmp   al,10
    sbb   al,0x69
    das
    ; AL = '0'..'9', 'A'..'Z'

(equivalent code also works on the Z80. Maybe also with the 6502's decimal mode?)

The oldest and least used feature is probably the parity flag. It originally came from the Datapoint 2200 terminal, and still exists even in 64-bit mode!

userbinator · on July 18, 2022

Yes, the classic "cmp sbb das". It's well-known amongst the Asm/demoscene community. The opposite, ASCII to nybble in 4 bytes, is a little less well-known:

    ; AL = 30..39, 41..46, 61..66
    and al, 4fh
    aam 37h
    ; AL = 0..F

It's case-insensitive too.

krallja · on July 18, 2022

6502 decimal mode is a flag that changes the behavior of arithmetic operations; there’s no easy way to convert BCD<->binary

rep_lodsb · on July 18, 2022

But the way it changes ALU operation is similar in effect to doing "DAS" after subtraction (it doesn't "convert" to BCD, only adjusts the result). There may be differences in how invalid BCD values are handled though.

p_l · on July 18, 2022

FS and GS are used by windows APIs, 286 segmentation model was used to play with stuff before NX became a thing though (or VT-x/SVM)

wruza · on July 18, 2022

why there are so many lea

LEA does addressing math without an actual fetch/store. Which is

  base + k * n + offset

Every time you put & before an lvalue there is a chance for LEA.

https://stackoverflow.com/questions/1658294/whats-the-purpos...

andreareina · on July 18, 2022

IIRC it's also (ab?)used for a lot of actual math that has that pattern, which is a lot.

adgjlsfhk1 · on July 18, 2022

it's basically the Internet equivalent of fma which is the best floating point instruction.

Findecanor · on July 18, 2022

It is not as generic. k can only be the constants 1, 2, 4 or 8, which makes the "multiplication" a shift. And the offset is also a constant (8 or 32 bits).

But you could multiply by 2+1, 4+1 or 8+1 by having base and n be the same register.

orlp · on July 18, 2022

Unfortunately compilers can't emit FMA instructions from regular floating point math without specifying (an equivalent of) `-ffast-math` as its results aren't identical to a*b + c done with two instructions.

thxg · on July 18, 2022

> Unfortunately compilers can't emit FMA instructions from regular floating point math without specifying (an equivalent of) `-ffast-math`

... if you specify an ISO language standard (e.g. -std=c99). By default, in GNU mode, gcc will happily emit FMA at -O3. From the man page:

By default, -fexcess-precision=fast is in effect; this means that operations may be carried out in a wider precision than the types specified in the source if that would result in faster code, and it is unpredictable when rounding to the types specified in the source code takes place. [...] [-fexcess-precision=standard] is enabled by default for C if a strict conformance option such as -std=c99 is used.

> Unfortunately compilers can't emit FMA instructions without specifying [-ffast-math]

I would argue that it would have been fortunate if FMA was disabled by default. And that it is unfortunate that it is not.

Yes, FMA has a performance boost and (usually) better accuracy... But it comes at the cost of bit-for-bit reproducibility. If we let the compiler automatically decide when to apply FMA, then a compiler upgrade or even unrelated code changes could lead to slightly different results as rounding is performed at different points in the computation. For numerical codes in which small perturbations can yield wildly different solution paths, debugging can become a nightmare.

Of course this is a subjective preference and it strongly depends on the application domain. I agree that for some people defaulting to best performance at -O3 is the right choice.

orlp · on July 18, 2022

> By default, in GNU mode, gcc will happily emit FMA at -O3.

I had no idea that gcc was in 'GNU mode' by default, and that specifying -std would turn that off. I always assumed it just had a default standard version that is (very) irregularly incremented.

> I would argue that it would have been fortunate if FMA was disabled by default.

I agree, and (outside of my earlier ignorance of GNU mode) it is most everywhere.

My 'unfortunate' wasn't aimed at compilers per se, but rather at (unavoidable) the non-commutative and associative nature of floating point. I do wish that it was easier to specify at a per-file or per-function level that emitting FMA / performing algebraic and other non-bit-for-bit reproducible optimizations is ok.

adgjlsfhk1 · on July 18, 2022

This is a major portion of why I use Julia as my primary math programming language. By default it doesn't re-associate (unlike C) which makes it much easier to write error compensating arithmetic, and it has macros to make it easy on to give any function/expression fastmath semantics (or narrower re-association only semantics without the NaN/Inf/subnormal effects of fastmath. It also has a reinterpret function which makes it much simpler to do bitcasts than C where there are 100 different ways, 99% of which are technically UB.

gpderetta · on July 18, 2022

There are parallel strict C or C++ standard version flags and the corresponding gnu variants (for example -std=gnu++17) which enable gnu extensions for each corresponding standard. The gnu mode (of whatever standard is default for a specific gcc version) is enabled by default.

mtklein · on July 18, 2022

I think you maybe meant to say “integer”?

adgjlsfhk1 · on July 18, 2022

yes. yes I did.

29athrowaway · on July 18, 2022

Load effective address

froh · on July 18, 2022

thank you for spelling out the mnemonic!

badrabbit · on July 18, 2022

Something I heard a while back is that very roughly 90%+ of compiler generated code is made up of 19 instructions and the rest are rare enough that you on x86/64 you shouldn't memorize them, just keep the manual handy. I wonder if that included various variations of instructions. But I am very curious about OPs question now.

Moreover, can you influence GCC over what instructions it emits? On Gentoo for example, you can setup your make file so that the while system is compiled with your specific processor in mind.

duskwuff · on July 18, 2022

> Moreover, can you influence GCC over what instructions it emits? On Gentoo for example, you can setup your make file so that the while system is compiled with your specific processor in mind.

For what it's worth, this is largely unnecessary nowadays. Tuning for a specific processor used to allow the compiler to use instruction scheduling and enabled use of SSE instructions. However:

1) Modern x86 CPUs all use out-of-order execution, making instruction scheduling unnecessary. (Some older Atom CPUs are in-order, but their performance is pretty awful even with instruction scheduling.)

2) If you're building for x86_64, you can assume a certain baseline of instruction set support -- in particular, CMOV, SSE, and SSE2 instructions are always available on any 64-bit CPU. While these don't cover all instructions, it does cover the bulk of what the compiler is likely to want to generate in typical application code.

masklinn · on July 18, 2022

> For what it's worth, this is largely unnecessary nowadays.

Not true, compilers remain extremely conservative in their default codegen, we’re talking pre-nehalem (SSE2 at the top end).

I just checked on godbolt, by default u32::count_ones compiled with rustc 1.62 with -O generates 15 lines of assembly hand-counting the bits in the word.

Adding `-C target-cpu=x86-64-v2` reduces that to a single popcnt.

Even if you don’t care for SSE3+, there’s a handful of really useful instructions which are not part of the baseline x86 profile, especially for bit-twiddling and endianness manipulation: popcnt, lzcnt (v3), movbe (v3), BMI1 and BMI2 (v3)

mcbain · on July 18, 2022

My vote is that pext and pdep are the highlight from BMI2.

planede · on July 18, 2022

Just be aware that even though these instructions are supported on pre-zen3 AMD, they are pretty much emulated and not very fast.

Otherwise I agree that these are awesome, I know a fast Hilbert-curve implementation using those.

klodolph · on July 18, 2022

Yes, there’s the annoying tendency of extensions to be slow on the first generation of processors that support them, and then faster afterwards.

planede · on July 19, 2022

While that's true, the PDEP/PEXT numbers are a bit on the extreme side on AMD early generations. On Zen1 and Zen2 these instructions have latency and rthroughput of 18 and 19 clock cycles respectively. On Zen3 it's a perfectly sensible 3 cycle latency and 1 cycle rthroughput [1].

I'm not aware of an other extensions with such a slow implementation in first generations.

[1] https://www.agner.org/optimize/instruction_tables.pdf

It's unfortunate that PDEP and PEXT is part of BMI2. AMD probably wanted to push BMI2 for the other instructions and they had to include emulated PDEP and PEXT before they had a proper implementation for those.

anonymoushn · on July 18, 2022

This stuff seems actually quite difficult to me. For example on one silly programming contest problem[0] there's a cluster of users from about 6th place to about 21st place who are basically reading memory sequentially as fast as they can using some simple loop that does 2 or more 32-byte loads into vector registers and then does some trivial calculations that can be done concurrently with the next loads (this is maybe 20 instructions if you choose to do a lot of loads per iteration, way fewer than that if you do just 1 or 2). There is also a small group of users who are going significantly faster! The top solution seems to be reading RAM 20% faster than a loop whose instructions look fine to me.

[0]: https://highload.fun/tasks/5/leaderboard

Taniwha · on July 18, 2022

Even with out of order scheduling there's still use from instruction scheduling - move those loads as far forward in the instruction stream as you can - a CPU can't execute instructions out of order if it hasn't decoded them yet

gpderetta · on July 18, 2022

These days CPUs have instruction windows that are hundreds of instructions deep. That typically allows for way more load hoisting than a compiler can realistically do (and that's pretty much the reason why all in-order or VLIW designs have fallen out of favor for general purpose computation).

NohatCoder · on July 18, 2022

They can't do a load before actually reaching the load instruction, and they can't fill the instruction buffer all that fast, maybe 5 instructions per clock, depending on loads of factors. They don't know what instructions are on the critical path, so it is perfectly possible to pick wrong ones in this regard. Having a big instruction buffer does not necessarily mean that the CPU can always pick a perfect selection of instruction from it, that is already a very hard issue that only becomes harder when you give it over a hundred to pick from.

In short, the out of order instruction buffer can do some amazing stuff for code that would otherwise run much slower, that doesn't mean you can't gain or lose performance by reordering instructions. For non-trivial code the best composition is almost certainly different between CPUs.

gpderetta · on July 18, 2022

OoO windows do not need to do a perfect job. They just need to do a better job than the compiler at exposing memory parallelism, and that's usually the case.

crest · on July 19, 2022

Instruction placement and scheduling is still important on OoO cores e.g. splitting a function intentionally over multiple pages to keep the commonly used parts of functions used together in less pages which helps with instruction cache utilisation and iTLB pressure.

Lots of high performance microarchitectures use assumptions about control flow for optimisation e.g. use the direction (backward, forward) of conditional jumps and branches as weak prediction, convert conditional branches over a single instruction into predicated execution, fuse common instruction pairs into a single micro-operation. At least limited cooperation between compiler and CPU core is required to make good use of those features.

nwallin · on July 18, 2022

> Something I heard a while back is that very roughly 90%+ of compiler generated code is made up of 19 instructions and the rest are rare enough that you on x86/64 you shouldn't memorize them, just keep the manual handy. I wonder if that included various variations of instructions.

It would definitely have to include the various variations. I ran it on my system, and of the top 38 instructions, 13 of them are variations on `mov`. The 22nd most common instruction is `and`, 26th is `shl`, 34th is `shr`, 36th is `or`, 39th is `imul`. You need to know those.

astrange · on July 18, 2022

You can with -march flags, but there was a long time where the added instructions weren't useful for "typical" programs, only high end ones using SIMD. So many of them still won't be generated, since autovectorization isn't all that reliable.

rurban · on July 19, 2022

/usr/bin has certainly no -march=native extensions, not even SIMD. It's extremely conservative, unless you are using Gentoo (nobody does anymore).

/usr/local/bin with tuned CFLAGS sounds better for such a statistical analysis.

stefantalpalaru · on July 18, 2022

> Moreover, can you influence GCC over what instructions it emits? On Gentoo for example, you can setup your make file so that the while system is compiled with your specific processor in mind.

Yes, of course. On a Gentoo ~amd64 system where everything is compiled with "-O3 -march=native", I get 1311 different ASM instructions: https://gist.github.com/stefantalpalaru/932b6d0ef439756e825d...

badrabbit · on July 18, 2022

Wow, that is fascinating. I need to play with this myself, I never heard of some of those like prefetcht0.

chungy · on July 18, 2022

Article might raise an interesting question, but the methodology used to determine if his system's binaries use the entire available set of instructions is dubious at best. He doesn't describe what operating system or distribution he is running, or how the binaries were built at all.

If it's on Gentoo with all the optimization settings tweaked out, it may be useful data. If it's on something like Debian where binaries are built to a lowest common denominator, that's... much less useful.

nwallin · on July 18, 2022

On my Gentoo system with `-march=znver2` and `-ftree-vectorize` set, I ran:

    find /usr/bin /usr/lib64 -type f -executable -or -name '*.so*' -exec objdump -d '{}' + | cut -f3 | grep -oE "^[a-z]+" | sort | uniq -c

Which should find all executables and shared libraries. It found 1174 instructions. So ballpark 3 times as many instructions as on his system, and twice as many as the number of instructions he claims to exist. So you're right; suffice to say his methodology has some holes in it.

Interestingly xor got dropped from 6th place on his list to 11th on mine. My list upgraded jmp to 4th place: above 5th place je, which surprised me.

His objdump calling convention appears to be different than mind; he's seeing the `callq`, `jmpq`, and `retq` instructions, I get `call`, `jmp`, and `ret`. I don't know what that means.

adrian_b · on July 18, 2022

The mnemonics ending in -q just specify explicitly that they are 64-bit (quad-word, where "word" was 16-bit on Intel 8088) instructions, while the size of the mnemonics without -q is specified implicitly by a previous assembler directive.

Also on a Gentoo system, even when searching just on /usr/bin I have seen a result similar to yours, with plenty of AVX instructions. This is expected, because our executables have been compiled for a modern CPU, while those of the author have been compiled for a generic 64-bit CPU, i.e. 20-years old.

The high frequency of "jmp", which I have noticed too, can come e.g. from a higher proportion of "if ... else ..." versus simple "if", or from a higher proportion of "switch" versus "if", or from some compiler that does tail-call optimization.

The methodology is OK for determining the static frequency of instructions, which determines the size of a program. The dynamic frequency of instructions, which influences the execution time, can be very different, because many computational instructions are mostly used either in loops or in frequently invoked functions, so their static frequency can be orders of magnitude less than their dynamic frequency.

To the one-liner it is convenient to add at the end " | sort -n ", to get the list of mnemonics already sorted by frequency.

johannes1234321 · on July 18, 2022

> If it's on something like Debian where binaries are built to a lowest common denominator, that's... much less useful.

Depending on the point one wants to make. Only very few people will run maxed out Gentoo builds, but there is a huge number of people running debian and red hat derived distributions. Thus findings there represent what is in broad use.

29athrowaway · on July 18, 2022

If you build a binary locally, you may enable the use of instruction sets that your processor supports.

When you download a x86 binary, it will likely stick to the subset of instructions that most processors have.

klodolph · on July 18, 2022

The subset that most have is going to be pretty broad. These days, you'd definitely assume that processors have SSE2, for example.

However, most SSE2 instructions won't be emitted by GCC until you crank things up to -O3.

duskwuff · on July 18, 2022

> These days, you'd definitely assume that processors have SSE2, for example.

In fact, x86_64 specifically guarantees the availability of SSE2.

masklinn · on July 18, 2022

These are completely unrelated issues.

Vectorising non-vector C code is difficult (and likely relies on other expensive optimisations around flow control manipulation).

And it’s not a “sure fire” optimisation because the compiler can absolutely vectorise a loop whose 99p is an iteration or two making the vectorisation a complete waste of compilation and runtime.

So being an expensive and not perfectly reliable optimisation, it makes sense for it to be at O3.

That has nothing to do with ISA subsetting/compatibility.

klodolph · on July 18, 2022

It sounds like we agree on everything here, but I’m not sure what you are saying is a “completely unrelated issue”.

josefx · on July 18, 2022

Not surprising, after all C doesn't have vector math. There is no way to make sensible use of these instructions without the compiler significantly altering the program flow first.

klodolph · on July 18, 2022

Using -O3 is a very sensible way to use vector instructions, in practice.

It’s not universally good. It’s just that if you take naïve C code that does bulk operations on arrays, and compile it at -O3, you might get a very nice improvement from the auto-vectorizer. With some additional restrict keyword annotations, you can sometimes improve the performance significantly.

Then there are non-standard tricks you can use, like __builtin_assume_aligned().

This is something of an arcane art. Sometimes it just doesn’t work at all, sometimes it’s not obvious how to write your code so that the autovectorizer can work well, and getting the autovectorizer to work requires enabling various code transformation passes that will often make code worse. However, it’s still worth using, because when your function does autovectorize well, it saves you a ton of work trying to vectorize it manually.

rep_movsd · on July 18, 2022

Actually, if you tell the compiler to use SSE and above, it will use them extensively and even for floating point instead of x87

josefx · on July 18, 2022

Yes, it will do fun things like turn 1.0f + 1.0f into (1.0f,0,0,0) + (1.0f,0,0,0), that is better than having to deal with the old floating point stack, but doesn't really exploit the full potential of SSE instructions.

SotCodeLaureate · on July 18, 2022

SSE has scalar instructions - like "addss", "subss" etc.

These will be used by default (at O2 without vectorization enabled) for single-precision floating point math on x86-64 (where stack-based FPU is deprecated).

Vector instructions you're referring to end with "ps", they're very unlikely to be emitted for typical fp operations.

MichaelMoser123 · on July 18, 2022

There are even more instructions: see this lecture on finding undocumented instructions https://www.youtube.com/watch?v=KrksBdWcZgQ

uberman · on July 18, 2022

Highly recommend watching this as it was a great presentation. Thank you for posting this

jiri · on July 18, 2022

Article is interesting but short. It is pretty interesting question, but I am still missing the answer for it! (Probably: It does not use all instructions.)

duskwuff · on July 18, 2022

> It is pretty interesting question, but I am still missing the answer for it!

In a very literal sense, the answer is trivially yes, because the compiler can be forced to generate any instruction sequence in an inline assembly block.

But assuming that isn't what you mean, there are plenty of instructions that the compiler will never emit when compiling CPU-independent C code (i.e. no inline assembly or CPU-specific compiler intrinsics). These fall into a few general categories:

1) Privileged instructions, like IN/OUT, WAIT, or SGDT. These instructions aren't even usable outside the kernel, and there's no way to represent their effects in pure C anyway.

2) Instructions which interact with specific hardware capabilities of the CPU, like the AES/SHA instructions, CPUID, RDTSC, or PREFETCH. Much like the privileged instructions, there's no way to represent their effects in C.

3) 8086 legacy instructions like ENTER, LOOP, or XLAT, which operate on 16-bit registers in highly inflexible ways, making them awkward for a compiler to generate code around. Most of them are also microcoded, making them slower than equivalent code sequences -- so there's no reason to use them anyway.

4) Instructions which interact directly with the flags register, like LAHF or PUSHF. C code doesn't have any concept of the flags register, and most compilers only generate flag-dependent instructions immediately after setting flags (e.g. in CMP/Jcc sequences), so they never need to save/restore flags.

5) Complex SIMD instructions like PSHUFB or PUNPCKxx which modern compilers can't reason deeply enough about your code to use effectively. (I'd love to be proven wrong on this one!)

hakfoo · on July 18, 2022

There's definitely been a "death spiral" for some of these instructions.

Nobody used "LOOP", for example, so it tended to be poorly optimized, which made compilers even less likely to use it-- just use the seperate instructions that have similar effects. Eventually it actually became a liability to run it efficiently. (The famous "Windows 95 won't work right on a fast K6-2" bug is based on this problem-- LOOP was much less efficient on a Pentium than a K6)

nemothekid · on July 18, 2022

>The famous "Windows 95 won't work right on a fast K6-2" bug is based on this problem

Wow that's pretty interesting on it's own. It turns out a program was trying to figure out long the LOOP instruction would take to run by running it 2^20 times and dividing it by the amount of time taken in milliseconds. On a Pentium 3 this took 17ms, but on a K-2 (and faster) this could take 0 milliseconds.

https://gigazine.net/gsc_news/en/20200606-windows95-failed-s...

p_l · on July 18, 2022

Then suddenly the REP prefix became much more useful than it used to be (around Nehalem I think?) to the point that REP MOV could be as fast or faster than SIMDified copy routine.

innocenat · on July 18, 2022

Isn't like REP MOV the only thing useful in application code? I think it's special cased in CPU to use whatever available to copy as fast as possible.

rep_lodsb · on July 18, 2022

It takes some time to start up, but can be the fastest for large block copies.

REPNE SCASB is useful to implement strlen, and REPE CMPSB for strcmp. Don't know offhand how optimized these are though.

REP LODSB, on the other hand, is completely useless :)

toast0 · on July 18, 2022

> Privileged instructions, like IN/OUT, WAIT, or SGDT. These instructions aren't even usable outside the kernel, and there's no way to represent their effects in pure C anyway.

It's possible to allow IN/OUT from outside of the kernel (FreeBSD has set_i386_ioperm(2) and io(4), Linux has ioperm(2) and iopl(2); other OSes may vary). You still wouldn't be able to emit IN/OUT from pure C, as you said.

wruza · on July 18, 2022

And this works by setting bits in io bitmap at the end of a task state segment (TSS), which directs CPU to allow IN/OUT for specific ports in a process.

[So glad that I studied the system programming manual back then and now can look smart on a forum :]

bogomipz · on July 18, 2022

I thought that Linux did not use the TSS though. I understand TSS is mandatory but my understanding was that Linux just creates a single TSS entry for CPU just to satisfy the requirement that there's something there.

toast0 · on July 18, 2022

Whilst researching my post, I saw something that said when switching tasks, Linux copies the soon to be current task's TSS relevant information into the TSS for the cpu. So it still can do all the TSS stuff, it just doesn't keep a TSS entry for all tasks.

SotCodeLaureate · on July 18, 2022

> the compiler will never emit ... instructions like PSHUFB or PUNPCKxx

These are most certainly generated by GCC from C++ code (no intrinsics and such) - can even provide an evidence in the form of a codebase with appropriate build flags.

Their use is rather straightforward I think, but still.

duskwuff · on July 18, 2022

Huh, TIL. I guess compilers have gotten smarter since I last looked into this a decade or so ago. :)

nwallin · on July 18, 2022

> 5) Complex SIMD instructions like PSHUFB or PUNPCKxx which modern compilers can't reason deeply enough about your code to use effectively. (I'd love to be proven wrong on this one!)

I ran his example on my own machine and found this:

    193110 vshufps
    179984 pshufd
    129664 vpshufd
     88279 vpshufb
     73569 shufps
     14409 pshuflw
      8931 pshufhw
      8896 vpshuflw
      5997 pshufb
      3115 vshufpd
      1730 pshufw
      1514 vpshufhw
      1281 vshufi
       620 shufpd
        42 vpshufbitqmb
         4 vshuff

    113872 vunpcklps
     89357 vpunpcklwd
     87224 vpunpcklqdq
     80218 vpunpckhwd
     69531 vunpckhps
     51203 punpcklbw
     36962 punpcklqdq
     29082 punpcklwd
     28967 vpunpckldq
     22217 vunpcklpd
     20415 vpunpcklbw
     19423 vpunpckhdq
     18669 unpcklps
     16693 punpckldq
     16620 vpunpckhbw
     14712 vpunpckhqdq
     12289 punpckhwd
     10788 vunpckhpd
      9446 punpckhbw
      8006 unpckhps
      6207 punpckhdq
      3278 punpckhqdq
       803 unpcklpd
       483 unpckhpd
         6 kunpckwd
         5 kunpckdq
         4 kunpckbw

It's probably pretty likely that some of the lower counts are a result of hand written assembly, but the compiler has to be doing the high count ones.

innocenat · on July 18, 2022

ffmpeg and other multimedia library are full of handwritten assembly and C/C++ with intrinsics function. Most math/scientific libraries too.

They also usually contain macro to generate code for multiple variations of instruction sets (SSE2, SSE3, SSE4, AVX, AVX2 at least, maybe more).

Shuffle and Unpack are very, very common in multimedia code.

SotCodeLaureate · on July 18, 2022

> PREFETCH

It's generated by GCC with "-fprefetch-loop-arrays", not supported in clang I believe.

bogomipz · on July 18, 2022

>"1) Privileged instructions, like IN/OUT, WAIT, or SGDT. These instructions aren't even usable outside the kernel, and there's no way to represent their effects in pure C anyway."

Could you elaborate on this? I understand you need to be in ring 0 for these instructions but you still need be able to compile the kernel. Or I am completely misunderstanding something obvious or missing your point completely?

deckard1 · on July 18, 2022

The kernel either uses a separate file written in assembly (.S file) or inline assembly that GCC understands.

https://github.com/torvalds/linux/blob/ce990f1de0bc6ff3de43d...

https://github.com/torvalds/linux/blob/35e43538af8fd2cb39d58...

duskwuff · on July 18, 2022

There isn't any pure C code you can write which would be expressed with those instructions, in the sort of way that MUL would be an expression of "a * b" (for example). The C virtual machine doesn't have the concept of an I/O port or a global descriptor table; the only way of interacting with these features from C is to either use a CPU-specific intrinsic (not available for these instructions) or inline assembly.

bogomipz · on July 18, 2022

Ah right these would all be inline assembly in the kernel code and so the assembler would handle these. Thanks.

p_l · on July 18, 2022

Kernels can't really be built with standard-compliant C, and use noticeable amount of assembly and intrinsics (which are essentially inlined asm functions or generators for those) to drive such details.

hyperman1 · on July 18, 2022

SGDT and friends are not privileged. You could execute them perfectly fine in user mode. If you did, you get a number which is completely useless in userspace, but as they don't trap, virtualization software has to validate or rewrite all userspace code to filter them out.

One of the things done by modern CPUs to aid VMs is making them trappable by setting the CR4 UMIP bit

adrian_b · on July 18, 2022

That those instructions had not been privileged has been widely considered as a serious design error of Intel 80286.

The virtualized modes introduced at about the same time by Intel and AMD, around 2005 / 2006, have been necessary mainly for fixing the broken privileged mode of x86/x86-64.

There is the well-known paper from IBM, published in 1970, "A Virtual Machine Time-Sharing System", which has listed the requirements for a CPU that can be virtualized, and which have been violated in the design of the Protected Mode of 80286 (in 1982), whose defects have been inherited by the later Intel and AMD CPUs.

https://www.seltzer.com/margo/teaching/CS508.19/papers/meyer...

In a CPU with a well-designed privileged mode there is no need for any other extra "hypervisor" mode, because an operating system cannot detect whether it is executed in the privileged mode on the real machine or in the non-privileged mode in a virtual machine, and the OS can be protected from the user processes by the same mechanisms that protect the user processes between themselves.

The UMIP feature available in all recent Intel and AMD CPUs is another fix for the original mistake.

dvt · on July 18, 2022

The answer is obviously "no" because there's no such thing as "the set of all x86 instructions." Every vendor has many vendor-specific x86 extensions inside their microcode, some of which end up being adopted by all other vendors (essentially becoming "canonized" as x86 proper). In some cases, these extensions are widely accepted (like x86-64, which was initially developed by AMD, hence its amd64 moniker). In other cases, these extensions die a quiet death, like 3DNow! (also one of AMD's extensions, introduced in their K6-2 line). And in other cases, extensions are only implemented for specific lines of processors (Xeons famously include AVX-512[1]).

To look at a more concrete exemplification of this, gcc has both AVX-512 and 3DNow! flags[2] (because it was around back in 1998), while go only has AVX-512 flags (and doesn't care about deprecated extensions like 3DNow!).

[1] https://colfaxresearch.com/skl-avx512/

[2] https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

CodesInChaos · on July 18, 2022

> Nobody really seems to know how many x86 instructions there are, but someone counted 678, meaning there are over 200 instructions that do not occur even once in all the code in my /usr/bin.

deckard1 · on July 18, 2022

That linked stackoverflow was a bit silly and misleading. They admitted they were too lazy to count the instructions given in Intel's manual.

Of course we know how many instructions there are. Assemblers (e.g. NASM) have to know what instructions exist if Intel or AMD want people to use them. There are undocumented and dead vendor instructions that nobody cares about (3DNow, Cyrix maybe?) But we definitely know all officially available instructions because they are all in Intel's manual.

innocenat · on July 18, 2022

The main problem is what is a definition of an "instruction"?

For example, let see the MOV instruction. How many instruction do you count it as?

Intel manual listed 2 variants of MOV using the same mnemonic. One for general register, and another for control register.

But if are also consider GNU AT&T syntax, then we have movb, movw, movl, movq, and movs, depend on operarand size. Do you count as 5 or 1? To make the matter more complex, movd and movq in GNU can also be used for MMX/SSE register, while Intel listed MOVD and MOVQ as seperate instructions for MMX/SSE (but they also list MOVD and MOVQ in the same section). How many do we count?

If you think all of these are silly and a move is a move, do you also count MOVDQA and MOVDQU as the same (MOVDQA works only with aligned 128-bit data, while MOVDQU also works with unaligned data). How about the VEX prefixed version, VMOVDQA and VMOVDQU, that works exactly the same but has no penalty when use in stream of VEX-prefixed instructions?

If we count by actual machine code, then there are like 10+ variations of MOV instruction depend on the operands.

(And I still probably have forgotten about a few more MOVxxx instructions)

The linked stackoverflow is a little bit silly, yes, but we actually cannot count how many instructions there are in x86/amd64.

plopfill · on July 18, 2022

Indeed. This page has some more discussion of the complexities: <https://fgiesen.wordpress.com/2016/08/25/how-many-x86-instru...>

deckard1 · on July 18, 2022

> But if are also consider GNU AT&T syntax

Why, though? The official x86 syntax is Intel syntax. In any case, it doesn't matter. Every possible encoding of each mnemonic is given in Intel's manual. This whole business about "what is an instruction anyway" is just sophistry. Anyone using gas would hopefully know how the AT&T syntax is mapping to Intel encodings.

> but we actually cannot count how many instructions there are in x86/amd64.

Again, you can. It's quite simple. Define an instruction how ever you want, and then go and look at the manual. It's a finite known set. Intel isn't hiding this info.

innocenat · on July 18, 2022

Maybe I was not clear. The problem is that no one is going to agree to the same definition.

I have already listed several points where disagreements may occurs, and that is just the surface of the complexity.

jsjohnst · on July 18, 2022

> but we actually cannot count how many instructions there are in x86/amd64

Sorry for being pedantic, but we can, if everyone agrees on what is actually being counted, aka what the definition of an “instruction” is.

You already made the case supporting that POV yourself, then oddly flipped back to implying there was something preventing it besides that at the end. We definitely “can” agree on a definition of an “instruction”, but yes, we probably won’t.

adrian_b · on July 18, 2022

You are right that when using a clear definition of what distinct instructions are it is always possible to count how many instructions an ISA contains.

The only problem is that many people are careless when counting so they do not apply consistent rules, thus obtaining different results.

The most widely used traditional rule is that 2 instructions are not distinct when they perform the same operation on operands of a specified data type, so that the only difference between them is where the operands are.

The operands may be in various kinds of registers, in the instruction stream (immediate operands) or in data memory, and their addresses may be computed in various ways, but the operation performed by the instruction is the same.

The distinct kinds of instructions obtained by this definition can be further grouped in a smaller number of generic instruction types, which perform the same kind of operation, for example a multiplication or a comparison, but on different data types, e.g. on 8-bit/16-bit/32-bit/64-bit integers, signed or unsigned or signed with saturation or unsigned with saturation, fixed-point numbers or floating-point numbers, polynomials with binary coefficients, bit strings or character strings, short vectors or matrices, and so on.

innocenat · on July 18, 2022

That's a big "if".

We can't even get a single assembly syntax, and Intel vs GNU/AT&T syntax war is almost as bad as vi/emac.

jsjohnst · on July 18, 2022

Completely agree, but the point in question wasn’t if it was likely, but if it was possible at all.

rep_movsd · on July 18, 2022

LEA is used for calculating integer A + B * C in one instruction for C that is 1, 2 or 4

It is meant for calculating array index addresses base+index*itemsize but was found by many assembly programmers of the early days to be useful for general arithmetic. All compilers use this trick

dathinab · on July 18, 2022

sometimes I wonder if just shipping a 64bit only no legacy instructions (i.e. cleaned up x86_64) instruction set would put a stop to the ARM is fundamentally better discussion (and sure there is ARM with legacy instructions etc. but Apple silicon is 64bit no legacy cruft AFIK).

jsjohnst · on July 18, 2022

> sometimes I wonder if just shipping a 64bit only no legacy instructions (i.e. cleaned up x86_64) instruction set would put a stop to the ARM is fundamentally better discussion

No

> but Apple silicon is 64bit no legacy cruft AFIK

No. It’s an ARMv8 ISA, legacy cruft and all.

dathinab · on July 18, 2022

through its A64 only (no T32 neither A32) which cuts out a lot of legacy instructions/instruction variants.

Doing the same for x86_64 would be a bit like no 16bit and 32bit specific machine instructions/instruction variants, no support for legacy bios things etc.

hajile · on July 18, 2022

uaarch64 has very little legacy stuff left.

When ARM announced A715, they got rid of uaarch32 and said it got them a 75% reduction in the size of the decoder while increasing the width of the decoder from 4 to 5-wide. They even got rid of the uop-cache because the new encoding was so much smaller and efficient that it didn't make sense to keep it around.

junon · on July 18, 2022

Depends on the compiler. LGDT is probably never emitted by the compiler, no.