Probably, if the uops come from the uop cache you get the fast speed since the p...

dzaima · on July 11, 2024

Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.

Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)

BeeOnRope · on July 13, 2024

Sounds a bit like the jcc erratum?