Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Probably, if the uops come from the uop cache you get the fast speed since the prefix and any decoding stalls don't have any impact in that case (that mess is effectively erased in the uop cache), but if it needs to be decoded you get a stall due to the length changing prefix.

Whether a bit of code comes from the uop cache is highly dependent on alignment, surrounding instructions, the specific microarchitecture, microcode version and even more esototeric things like how many incoming jumps target the nearby region of code (and in which order they were observed by the cache).



Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.

Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)


Sounds a bit like the jcc erratum?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: