A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp eax, 0x87654321" onl...

BeeOnRope · on July 11, 2024

Probably, if the uops come from the uop cache you get the fast speed since the prefix and any decoding stalls don't have any impact in that case (that mess is effectively erased in the uop cache), but if it needs to be decoded you get a stall due to the length changing prefix.

Whether a bit of code comes from the uop cache is highly dependent on alignment, surrounding instructions, the specific microarchitecture, microcode version and even more esototeric things like how many incoming jumps target the nearby region of code (and in which order they were observed by the cache).

dzaima · on July 11, 2024

Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.

Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)

BeeOnRope · on July 13, 2024

Sounds a bit like the jcc erratum?

aengelke · on July 11, 2024

The behavior/penalty of 66/67 length-changing prefixes is documented in the Intel Optimization Reference Manual (3.4.2.3):

> Assembly/Compiler Coding Rule 19. (MH impact, MH generality) Favor generating code using imm8 or imm32 values instead of imm16 values.

dzaima · on July 11, 2024

Some interesting quotes around there:

> The following alignment situations can cause LCP stalls to trigger twice:

> · An instruction is encoded with a MODR/M and SIB byte, and the fetch line boundary crossing is between the MODR/M and the SIB bytes.

> · An instruction starts at offset 13 of a fetch line references a memory location using register and immediate byte offset addressing mode.

So that's the order of funkiness to be expected, fun.

> False LCP stalls occur when (a) instructions with LCP that are encoded using the F7 opcodes, and (b) are located at offset 14 of a fetch line. These instructions are: not, neg, div, idiv, mul, and imul. False LCP experiences delay because the instruction length decoder can not determine the length of the instruction before the next fetch line, which holds the exact opcode of the instruction in its MODR/M byte.

The "true" LCP stall for the F7 opcodes would be "test r16,imm16", but due to the split opcode info across the initial byte & ModR/M, the other F7's suffer too.