Hacker News new | past | comments | ask | show | jobs | submit login

Modern Out-of-order CPUs (like the M1), they can't see branches until far too late.

The M1's frontend at least 24 instruction past the unconditional branch before the early possible moment it can even see it.

So the branch predictor isn't just responsible for predicting which way conditional branches go. It must remember where all branches are, and their target so that the front end can follow them with zero cycle delay. This means all branches, including call and return instructions too.

Which means that unconditional immediate branches cost about the same as a correctly predicted conditional branch.

But that's not actually why the fetch has been moved.

The other thing to note is that the frontend and backend of a modern CPU are completely disconnected. The frontend doesn't even try to get the correct address of an indirect jump from the backend. It always uses the branch predictor to predict the indirect branch.

And by inlining, each VM instruction has its own indirect jump, which means it gets different slot in the branch predictor allowing for better predictions.

At least that's the theory behind threaded code. I'm unsure how much of this speedup is coming from eliminating the extra unconditional immediate branch and how much is from better prediction of indirect branches.




Side note: Intel CPUs since Skylake and also recent AMD CPUs (since Zen 3 or so?) store a history for indirect branches. On such processors, using threaded jumps does not really improve performance anymore (I've even seen 1-2% slowdowns on some cores).


Pretty sure it's Haswell and Zen 2. They both implement IT-TAGE based branch predictors.

I just assumed the M1 branch predictor would also be in the same class, but I guess not. In another comment (https://news.ycombinator.com/item?id=40952404), I did some tests to confirm that it was actually the threaded jumps responsible for the speedup.

I'm tempted to dig deeper, see what the M1's branch predator can and can't do.


too late to edit

Turns out that M1 can track the history of indirect branches just fine, but it takes 3 cycles for a correct prediction. With threaded jumps, the M1 gets a slightly higher hit rate for the initial 1 cycle prediction.

https://news.ycombinator.com/item?id=40953764




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: