Memory Renaming: Fast and Accurate Processing of Memory Communication (1999) [pdf]

eigenform · on Feb 16, 2024

I tried to characterize this on Zen 2 a little while ago[^1], although I don't think that implementation has anything to do with unused physical registers: it ultimately relies on the idea that the store queue (48 entries on Zen 2) is an extra set of storage locations. During renaming, you try to look up a store queue entry with some of the operands (ie. a register and some of the displacement bits).

It seems like only the youngest six stores are ever eligible for renaming on Zen 2. I'd love to know how this has changed on Zen 3/4, but I don't have any newer machines to play with.

[^1]: https://github.com/eigenform/perfect/blob/main/src/bin/stlf....

KMag · on Feb 16, 2024

The basic idea here is to use unused physical registers in a register-renaming processor as a level-0 cache. I imagine this is much easier to get good performance out of this technique on architectures with weaker memory models than x86.

convolvatron · on Feb 16, 2024

isn't it a little more? I admit to just skimming and having a pretty incomplete grasp. but rather than implementing a spacio-temporal cache (lines over usage), this is explicitly trying to match producers and consumers of individual words.

so yes, like a cache, but it approached naming of values in a slightly different way.

two things bother me here. one is that we have the space for more registers, but we can't afford to pay for them in the instruction encoding. the other of course is the sad truth that hardware people need to innovate under the compiler instead of in cooperation with it.

on the first point, it seems like there may be other approaches to solving the naming issue. register windows was one. but maybe we can wire up producers and consumers more explicitly in the ISA?

fweimer · on Feb 17, 2024

There is Clockhands (and STRAIGHT before that): https://dl.acm.org/doi/fullHtml/10.1145/3613424.3614272

Previous discussion: https://news.ycombinator.com/item?id=38581719

I think such things make sense only as a means for more compact instruction encoding. Eventually, you will get bigger chips that will implement some form of register renaming anyway, to compensate for bottlenecks in the ISA.

_a_a_a_ · on Feb 17, 2024

Not my area but...

> the other of course is the sad truth that hardware people need to innovate under the compiler instead of in cooperation with it

Can you expand a little on that please.

> register windows was one. but maybe we can wire up producers and consumers more explicitly in the ISA?

Register windows, at least in sparc, were a bugger because they substantially block OO execution, but I'd be curious about what your 'wiring up of producers and consumers in the ISA' would look like, can you elaborate?

convolvatron · on Feb 17, 2024

Just that compilers and hardware people both build to the isa. It’s a fixed point in the design space that lets them both iterate independently. But as such, it makes it hard to refactor. A couple times recently I’ve seen projects propose exposing microcode, and while that had trade offs I guess, that’s one way to provide for a somewhat easier back and forth. It’s sort of an industry-wide comedy’s law.

In the second point - you can think of registers as edges in a dataflow graph (people do). If we run out of registers (names really) then we start writing to memory and we obscure the producer consumer relationship because of aliasing and we can’t do our register renaming (which is where this paper comes in). Imagine we had a weird instruction encoding that let us specify the graph though direct relations insteadd of the names.

I guess another idea would be to have hierarchical names (register pages). It’s interesting to think about since we’ve been stuck on this point for a while in isa design.

The only reason I brought up register windows is not that I’m a huge fan of stack based evaluation, or that they were a good idea, but it was one identifiable way we did explore about about how to map that state more implicitly and keep a tight encoding.

I guess another potentially fruitful and related area here is to think about how me manage the cache, I think there is a very defensible argument that these should be scratchpads (again moving the isa boundary up into the compiler). This makes the naming of those memory locations more explicit

_a_a_a_ · on Feb 17, 2024

Interesting, thanks. A quick note, SPARC designers apparently did not talk to the compiler writers so didn't understand how much inlining could do, and implemented register windows as the wrong solution. Conversely I understand the alpha AXP ISA was designed with the hardware and compiler writers pretty much in the same room, and its beautiful and fast.

If I understand you correctly, I think the solution is to have ISA-independent low-level code (ANDF and C-- were/are examples)

I think register pages might bring you right back into the same issue as register windows; blocking of 000 execution.

Given the extraordinary cost of cache I think the idea of scratchpad memory is very promising, but I don't know how it would best be exposed to the compiler.

Tuna-Fish · on Feb 16, 2024

I believe AMD implements this since Zen2.