It's not unclear in the slightest. If the value dropped before the callee return...

manwe150 · on Nov 11, 2019

My mental model comparison says that the cost of those may potentially be about the same, with no clear winner over many benchmarks. I’ve even seen recently a claim that using two registers was measured to be slower than stack memory return when tested in another language, so it’s been on my TODO list now to performance test it (I work on another language runtime). But I don’t actually expect to be able to measure the difference—instead I expect the latency often may simply disappear in the execution of the `ret` statement. Plus, for a non-trivial function, returning via a register may mean you have to materialize an extra load, whereas the box could have been filled much earlier and kept a tiny bit more of the stack hot.

quotemstr · on Nov 11, 2019

Sure, that part of the stack is going to be in L1 cache, and there's not a huge difference between that and the register file. But think about the code size cost too. The icache hit won't show up in a microbenchmark, but when you have 100,000 of these functions, the difference will be real.

> I’ve even seen recently a claim that using two registers was measured to be slower than stack memory return

I find that difficult to believe. Those output registers are undefined anyway. I get that store-to-load forwarding may hide the latency hit of the stack access, but they'll apply to the registers too. At best, the stack pattern runs at the same speed as register return, but with a bigger code size. How on earth could a register return be worse? If it really were worse, we'd use a pointer return for multi-word PODs, and we do actually use a pair of registers to on x86_64 to return a pair of words.

manwe150 · on Nov 12, 2019

I agree it’s hard to believe it could be faster. But we expect there must be _some_ cut off where more registers is worse, it’s just a question of where. But more-than-one being usually worse seems surprising to me too. I expect it’s unlikely someone actually benchmarked when making that decision, but just went with the “obvious must” reaction too though. On reflection now, I feel like there could be some reasons the stack pattern may sometimes be better or equivalent for multi-return. To your point though, curiously the Win64 ABI does only use one return register (unlike the SysV ABI which allocates up to two integer registers as you say).

I’m guessing their microbenchmark may have been that if the compiler had to spill the value earlier, it’s better to be able to spill directly to the sret pointer, than to need extra code to reload it at the end.

The bigger reason I usually see that this matters none is that if you’ve missed the inlining opportunity of something that small, you’re already so far away from optimal {code size, performance, memory usage} it is really too premature to optimize the shape of this code (and the ABI).