A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp eax, 0x87654321" only in the presence of the data16 prefix, and thus the longer immediate; and it's the only significant case (I think?) of a prefix changing the total instruction size, and thus, for some such instructions, the 16-bit version, sometimes (but not always!) is significantly slower. "but not always" as in, if you try to microbenchmark a loop of such, sometimes you can have entire microseconds of it consistently running at 0.25 cycles/instr avg, and sometimes that same exact code (in the same process!) will measure it at 3 cycles/instr (tested on Haswell, but uops.info indicates this happens on all non-atom Intel since Ivy Bridge).
Probably, if the uops come from the uop cache you get the fast speed since the prefix and any decoding stalls don't have any impact in that case (that mess is effectively erased in the uop cache), but if it needs to be decoded you get a stall due to the length changing prefix.
Whether a bit of code comes from the uop cache is highly dependent on alignment, surrounding instructions, the specific microarchitecture, microcode version and even more esototeric things like how many incoming jumps target the nearby region of code (and in which order they were observed by the cache).
Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.
Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)
> The following alignment situations can cause LCP stalls to trigger twice:
> · An instruction is encoded with a MODR/M and SIB byte, and the fetch line boundary crossing is between the MODR/M and the SIB bytes.
> · An instruction starts at offset 13 of a fetch line references a memory location using register and immediate byte offset addressing mode.
So that's the order of funkiness to be expected, fun.
> False LCP stalls occur when (a) instructions with LCP that are encoded using the F7 opcodes, and (b) are located at offset 14 of a fetch line. These instructions are: not, neg, div, idiv, mul, and imul. False LCP experiences delay because the instruction length decoder can not determine the length of the instruction before the next fetch line, which holds the exact opcode of the instruction in its MODR/M byte.
The "true" LCP stall for the F7 opcodes would be "test r16,imm16", but due to the split opcode info across the initial byte & ModR/M, the other F7's suffer too.