Article is interesting but short. It is pretty interesting question, but I am st...

duskwuff · on July 18, 2022

> It is pretty interesting question, but I am still missing the answer for it!

In a very literal sense, the answer is trivially yes, because the compiler can be forced to generate any instruction sequence in an inline assembly block.

But assuming that isn't what you mean, there are plenty of instructions that the compiler will never emit when compiling CPU-independent C code (i.e. no inline assembly or CPU-specific compiler intrinsics). These fall into a few general categories:

1) Privileged instructions, like IN/OUT, WAIT, or SGDT. These instructions aren't even usable outside the kernel, and there's no way to represent their effects in pure C anyway.

2) Instructions which interact with specific hardware capabilities of the CPU, like the AES/SHA instructions, CPUID, RDTSC, or PREFETCH. Much like the privileged instructions, there's no way to represent their effects in C.

3) 8086 legacy instructions like ENTER, LOOP, or XLAT, which operate on 16-bit registers in highly inflexible ways, making them awkward for a compiler to generate code around. Most of them are also microcoded, making them slower than equivalent code sequences -- so there's no reason to use them anyway.

4) Instructions which interact directly with the flags register, like LAHF or PUSHF. C code doesn't have any concept of the flags register, and most compilers only generate flag-dependent instructions immediately after setting flags (e.g. in CMP/Jcc sequences), so they never need to save/restore flags.

5) Complex SIMD instructions like PSHUFB or PUNPCKxx which modern compilers can't reason deeply enough about your code to use effectively. (I'd love to be proven wrong on this one!)

hakfoo · on July 18, 2022

There's definitely been a "death spiral" for some of these instructions.

Nobody used "LOOP", for example, so it tended to be poorly optimized, which made compilers even less likely to use it-- just use the seperate instructions that have similar effects. Eventually it actually became a liability to run it efficiently. (The famous "Windows 95 won't work right on a fast K6-2" bug is based on this problem-- LOOP was much less efficient on a Pentium than a K6)

nemothekid · on July 18, 2022

>The famous "Windows 95 won't work right on a fast K6-2" bug is based on this problem

Wow that's pretty interesting on it's own. It turns out a program was trying to figure out long the LOOP instruction would take to run by running it 2^20 times and dividing it by the amount of time taken in milliseconds. On a Pentium 3 this took 17ms, but on a K-2 (and faster) this could take 0 milliseconds.

https://gigazine.net/gsc_news/en/20200606-windows95-failed-s...

p_l · on July 18, 2022

Then suddenly the REP prefix became much more useful than it used to be (around Nehalem I think?) to the point that REP MOV could be as fast or faster than SIMDified copy routine.

innocenat · on July 18, 2022

Isn't like REP MOV the only thing useful in application code? I think it's special cased in CPU to use whatever available to copy as fast as possible.

rep_lodsb · on July 18, 2022

It takes some time to start up, but can be the fastest for large block copies.

REPNE SCASB is useful to implement strlen, and REPE CMPSB for strcmp. Don't know offhand how optimized these are though.

REP LODSB, on the other hand, is completely useless :)

toast0 · on July 18, 2022

> Privileged instructions, like IN/OUT, WAIT, or SGDT. These instructions aren't even usable outside the kernel, and there's no way to represent their effects in pure C anyway.

It's possible to allow IN/OUT from outside of the kernel (FreeBSD has set_i386_ioperm(2) and io(4), Linux has ioperm(2) and iopl(2); other OSes may vary). You still wouldn't be able to emit IN/OUT from pure C, as you said.

wruza · on July 18, 2022

And this works by setting bits in io bitmap at the end of a task state segment (TSS), which directs CPU to allow IN/OUT for specific ports in a process.

[So glad that I studied the system programming manual back then and now can look smart on a forum :]

bogomipz · on July 18, 2022

I thought that Linux did not use the TSS though. I understand TSS is mandatory but my understanding was that Linux just creates a single TSS entry for CPU just to satisfy the requirement that there's something there.

toast0 · on July 18, 2022

Whilst researching my post, I saw something that said when switching tasks, Linux copies the soon to be current task's TSS relevant information into the TSS for the cpu. So it still can do all the TSS stuff, it just doesn't keep a TSS entry for all tasks.

SotCodeLaureate · on July 18, 2022

> the compiler will never emit ... instructions like PSHUFB or PUNPCKxx

These are most certainly generated by GCC from C++ code (no intrinsics and such) - can even provide an evidence in the form of a codebase with appropriate build flags.

Their use is rather straightforward I think, but still.

duskwuff · on July 18, 2022

Huh, TIL. I guess compilers have gotten smarter since I last looked into this a decade or so ago. :)

nwallin · on July 18, 2022

> 5) Complex SIMD instructions like PSHUFB or PUNPCKxx which modern compilers can't reason deeply enough about your code to use effectively. (I'd love to be proven wrong on this one!)

I ran his example on my own machine and found this:

    193110 vshufps
    179984 pshufd
    129664 vpshufd
     88279 vpshufb
     73569 shufps
     14409 pshuflw
      8931 pshufhw
      8896 vpshuflw
      5997 pshufb
      3115 vshufpd
      1730 pshufw
      1514 vpshufhw
      1281 vshufi
       620 shufpd
        42 vpshufbitqmb
         4 vshuff

    113872 vunpcklps
     89357 vpunpcklwd
     87224 vpunpcklqdq
     80218 vpunpckhwd
     69531 vunpckhps
     51203 punpcklbw
     36962 punpcklqdq
     29082 punpcklwd
     28967 vpunpckldq
     22217 vunpcklpd
     20415 vpunpcklbw
     19423 vpunpckhdq
     18669 unpcklps
     16693 punpckldq
     16620 vpunpckhbw
     14712 vpunpckhqdq
     12289 punpckhwd
     10788 vunpckhpd
      9446 punpckhbw
      8006 unpckhps
      6207 punpckhdq
      3278 punpckhqdq
       803 unpcklpd
       483 unpckhpd
         6 kunpckwd
         5 kunpckdq
         4 kunpckbw

It's probably pretty likely that some of the lower counts are a result of hand written assembly, but the compiler has to be doing the high count ones.

innocenat · on July 18, 2022

ffmpeg and other multimedia library are full of handwritten assembly and C/C++ with intrinsics function. Most math/scientific libraries too.

They also usually contain macro to generate code for multiple variations of instruction sets (SSE2, SSE3, SSE4, AVX, AVX2 at least, maybe more).

Shuffle and Unpack are very, very common in multimedia code.

SotCodeLaureate · on July 18, 2022

> PREFETCH

It's generated by GCC with "-fprefetch-loop-arrays", not supported in clang I believe.

bogomipz · on July 18, 2022

>"1) Privileged instructions, like IN/OUT, WAIT, or SGDT. These instructions aren't even usable outside the kernel, and there's no way to represent their effects in pure C anyway."

Could you elaborate on this? I understand you need to be in ring 0 for these instructions but you still need be able to compile the kernel. Or I am completely misunderstanding something obvious or missing your point completely?

deckard1 · on July 18, 2022

The kernel either uses a separate file written in assembly (.S file) or inline assembly that GCC understands.

https://github.com/torvalds/linux/blob/ce990f1de0bc6ff3de43d...

https://github.com/torvalds/linux/blob/35e43538af8fd2cb39d58...

duskwuff · on July 18, 2022

There isn't any pure C code you can write which would be expressed with those instructions, in the sort of way that MUL would be an expression of "a * b" (for example). The C virtual machine doesn't have the concept of an I/O port or a global descriptor table; the only way of interacting with these features from C is to either use a CPU-specific intrinsic (not available for these instructions) or inline assembly.

bogomipz · on July 18, 2022

Ah right these would all be inline assembly in the kernel code and so the assembler would handle these. Thanks.

p_l · on July 18, 2022

Kernels can't really be built with standard-compliant C, and use noticeable amount of assembly and intrinsics (which are essentially inlined asm functions or generators for those) to drive such details.

hyperman1 · on July 18, 2022

SGDT and friends are not privileged. You could execute them perfectly fine in user mode. If you did, you get a number which is completely useless in userspace, but as they don't trap, virtualization software has to validate or rewrite all userspace code to filter them out.

One of the things done by modern CPUs to aid VMs is making them trappable by setting the CR4 UMIP bit

adrian_b · on July 18, 2022

That those instructions had not been privileged has been widely considered as a serious design error of Intel 80286.

The virtualized modes introduced at about the same time by Intel and AMD, around 2005 / 2006, have been necessary mainly for fixing the broken privileged mode of x86/x86-64.

There is the well-known paper from IBM, published in 1970, "A Virtual Machine Time-Sharing System", which has listed the requirements for a CPU that can be virtualized, and which have been violated in the design of the Protected Mode of 80286 (in 1982), whose defects have been inherited by the later Intel and AMD CPUs.

https://www.seltzer.com/margo/teaching/CS508.19/papers/meyer...

In a CPU with a well-designed privileged mode there is no need for any other extra "hypervisor" mode, because an operating system cannot detect whether it is executed in the privileged mode on the real machine or in the non-privileged mode in a virtual machine, and the OS can be protected from the user processes by the same mechanisms that protect the user processes between themselves.

The UMIP feature available in all recent Intel and AMD CPUs is another fix for the original mistake.

dvt · on July 18, 2022

The answer is obviously "no" because there's no such thing as "the set of all x86 instructions." Every vendor has many vendor-specific x86 extensions inside their microcode, some of which end up being adopted by all other vendors (essentially becoming "canonized" as x86 proper). In some cases, these extensions are widely accepted (like x86-64, which was initially developed by AMD, hence its amd64 moniker). In other cases, these extensions die a quiet death, like 3DNow! (also one of AMD's extensions, introduced in their K6-2 line). And in other cases, extensions are only implemented for specific lines of processors (Xeons famously include AVX-512[1]).

To look at a more concrete exemplification of this, gcc has both AVX-512 and 3DNow! flags[2] (because it was around back in 1998), while go only has AVX-512 flags (and doesn't care about deprecated extensions like 3DNow!).

[1] https://colfaxresearch.com/skl-avx512/

[2] https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

CodesInChaos · on July 18, 2022

> Nobody really seems to know how many x86 instructions there are, but someone counted 678, meaning there are over 200 instructions that do not occur even once in all the code in my /usr/bin.

deckard1 · on July 18, 2022

That linked stackoverflow was a bit silly and misleading. They admitted they were too lazy to count the instructions given in Intel's manual.

Of course we know how many instructions there are. Assemblers (e.g. NASM) have to know what instructions exist if Intel or AMD want people to use them. There are undocumented and dead vendor instructions that nobody cares about (3DNow, Cyrix maybe?) But we definitely know all officially available instructions because they are all in Intel's manual.

innocenat · on July 18, 2022

The main problem is what is a definition of an "instruction"?

For example, let see the MOV instruction. How many instruction do you count it as?

Intel manual listed 2 variants of MOV using the same mnemonic. One for general register, and another for control register.

But if are also consider GNU AT&T syntax, then we have movb, movw, movl, movq, and movs, depend on operarand size. Do you count as 5 or 1? To make the matter more complex, movd and movq in GNU can also be used for MMX/SSE register, while Intel listed MOVD and MOVQ as seperate instructions for MMX/SSE (but they also list MOVD and MOVQ in the same section). How many do we count?

If you think all of these are silly and a move is a move, do you also count MOVDQA and MOVDQU as the same (MOVDQA works only with aligned 128-bit data, while MOVDQU also works with unaligned data). How about the VEX prefixed version, VMOVDQA and VMOVDQU, that works exactly the same but has no penalty when use in stream of VEX-prefixed instructions?

If we count by actual machine code, then there are like 10+ variations of MOV instruction depend on the operands.

(And I still probably have forgotten about a few more MOVxxx instructions)

The linked stackoverflow is a little bit silly, yes, but we actually cannot count how many instructions there are in x86/amd64.

plopfill · on July 18, 2022

Indeed. This page has some more discussion of the complexities: <https://fgiesen.wordpress.com/2016/08/25/how-many-x86-instru...>

deckard1 · on July 18, 2022

> But if are also consider GNU AT&T syntax

Why, though? The official x86 syntax is Intel syntax. In any case, it doesn't matter. Every possible encoding of each mnemonic is given in Intel's manual. This whole business about "what is an instruction anyway" is just sophistry. Anyone using gas would hopefully know how the AT&T syntax is mapping to Intel encodings.

> but we actually cannot count how many instructions there are in x86/amd64.

Again, you can. It's quite simple. Define an instruction how ever you want, and then go and look at the manual. It's a finite known set. Intel isn't hiding this info.

innocenat · on July 18, 2022

Maybe I was not clear. The problem is that no one is going to agree to the same definition.

I have already listed several points where disagreements may occurs, and that is just the surface of the complexity.

jsjohnst · on July 18, 2022

> but we actually cannot count how many instructions there are in x86/amd64

Sorry for being pedantic, but we can, if everyone agrees on what is actually being counted, aka what the definition of an “instruction” is.

You already made the case supporting that POV yourself, then oddly flipped back to implying there was something preventing it besides that at the end. We definitely “can” agree on a definition of an “instruction”, but yes, we probably won’t.

adrian_b · on July 18, 2022

You are right that when using a clear definition of what distinct instructions are it is always possible to count how many instructions an ISA contains.

The only problem is that many people are careless when counting so they do not apply consistent rules, thus obtaining different results.

The most widely used traditional rule is that 2 instructions are not distinct when they perform the same operation on operands of a specified data type, so that the only difference between them is where the operands are.

The operands may be in various kinds of registers, in the instruction stream (immediate operands) or in data memory, and their addresses may be computed in various ways, but the operation performed by the instruction is the same.

The distinct kinds of instructions obtained by this definition can be further grouped in a smaller number of generic instruction types, which perform the same kind of operation, for example a multiplication or a comparison, but on different data types, e.g. on 8-bit/16-bit/32-bit/64-bit integers, signed or unsigned or signed with saturation or unsigned with saturation, fixed-point numbers or floating-point numbers, polynomials with binary coefficients, bit strings or character strings, short vectors or matrices, and so on.

innocenat · on July 18, 2022

That's a big "if".

We can't even get a single assembly syntax, and Intel vs GNU/AT&T syntax war is almost as bad as vi/emac.

jsjohnst · on July 18, 2022

Completely agree, but the point in question wasn’t if it was likely, but if it was possible at all.