I worked on the A510 (codenamed Klein) at Arm on the L1 memory system design. It...

clamchowder · on Oct 2, 2023

Author here.

> I can see why they described it as a '5 entry load buffer' for instance but it's not really accurate to the true microarchitecture.

That's what it looks like to software. I can put 5 loads between two other loads that miss cache, and the cache miss latency will overlap. It feels like a 5 entry load buffer to software. I'm more than happy to describe the true microarchitecture if Arm talks about it :) Otherwise it'll be "hey, how does it look to software from a performance perspective?"

> There's also lots of fun details of the crazy things you have to do to hit your frequency target (often where you just ignore certain rare hazard or error conditions and detect them later and fix it up).

Yeah people do this everywhere. Sometimes you can find 20+ cycle penalties for stuff like a load that depends on a page-crossing store. The trick is making sure the hazards are indeed rare relative to the penalty. Also the biggest penalties in practice are cache misses and DRAM latency in the hundreds of cycles. Intra-core penalties never come close.

klelatti · on Oct 2, 2023

On a slight tangent, as you've worked on both Arm and RISC-V cores, could you comment on how the two ISAs compare from a design standpoint? How much more complexity does designing for an ISA like ARM64 add when compared to a more 'purist' RISC design like RISC-V? Many thanks!

gchadwick · on Oct 2, 2023

My RISC-V experience is all based around the low-end (no MMUs, hypervisors, indeed mostly no caches) so it's hard for me to give a full informed comparison but my feeling is the Arm ISA is better specified and RISC-V won't necessarily be a simpler architecture in the end anyway. There is an explosion of extensions in RISC-V right now, lots of tiny ones along with some bigger ones. Whilst the size of the ISA manual between the two is often highlighted (the Arm architecture reference manual is almost 13'000 pages in the PDF I just downloaded) those pages are there for reason. It's no surprise now there are multiple well-funded startups trying to build high end RISC-V cores that there's also a massive push to add a lot more stuff to the RISC-V architecture.

My feeling, as it's a bunch of different companies all with a commercial pressure to get things shipped, is we'll end up in a rather chaotic place with the RISC-V ISA, though hopefully that can be improved with time.

I do get frustrated with the loose specification of the RISC-V ISA. The Arm ARM for instance has a precise description of its exception model (see section D1.3 'Exceptions') the RISC-V ISA in contrast never explicitly defines what an exception is and to find out what the exception model is you need to read through the definitions of a few system CSRs (control and status registers) and join the dots yourself. It's seemingly written with the idea that everyone knows what an exception is in the context of computer architecture so no need to give a long formal definition, plus everyone will agree on what the 'obvious' right answer is where the spec isn't precise. However the devil, as ever, is in the details and the details are often lacking.

I also note there's no complete RISC-V conformance test suite and indeed even building such a thing could be intractable look at PMP, physical memory protection, you could do all kinds of odd implementations of this that would be spec compliant (e.g. only region 12 is allowed to have execute permissions set and cannot be read/write/execute) but would be a total nightmare to support in a conformance test suite. So such a suite, when it does hopefully emerge, will actually only conform compliance with a subset of compliant CPUs. There are some obvious stupid design decisions it's reasonable to rule out but I'm sure there will be some other less clear cut things where the spec allows multiple behaviours, the compliance suite will only work with one or two and they'll be some shipping implementation which went with another one. People will either spend lots of time arguing about how the compliance suite should work, or just ignore it altogether. A lot of these difficulties stem from the loose specification of RISC-V.

Loq · on Oct 2, 2023

> loose specification of the RISC-V ISA.

This is being worked on with the Sail model [1]. In order for a RISC-V extension to be ratified it ought to be implemented in Sail. The understanding is also that the RISC-V ISA manual should be built with code snippets from the Sail model (similar to how the Arm ARM is build from ASL definition). The main issue is a lack of people willing and able to write Sail for RISC-V. But that is beginning to change, since RISC-V member companies are increasingly use Sail. As an example, the RISC-V exception type is defined in [2]. Is that precise enough for you?

The formal RISC-V ISA specification is not finished, you are welcome to make a PR to clarify things.

[1] https://github.com/riscv/sail-riscv

[2] https://github.com/riscv/sail-riscv/blob/master/model/riscv_...

klelatti · on Oct 2, 2023

Thanks so much for a super helpful reply!

Topfi · on Oct 2, 2023

Amazing to hear from someone so deeply involved in the design process. Thanks for clarifying that some parts of the article are not 100% accurate. Concerning:

> [...] really pushing the bounds of what you can do with an in order core and made me realise the out of order/in order boundary is a lot more blurry than you might think. Plenty of out of order stuff happening in the memory system for instance.

Are you able to expand on that a bit or point me in a direction on where to get some more information (books, articles, papers)? I'd love to read some more on what can be done to bridge the gap between the two and the specifics of A510 in that regard. Only if you want and are able to (NDA wise) of course.

Would you speculate that a future design is going to lean even more heavily/be entirely based around OOE or is such a "hybrid approach" (please excuse my layman terms) still something that can serve certain applications (e.g. space efficiency) well in the future? I understand if you don't want to make a statement one-way or another of course.

gchadwick · on Oct 2, 2023

> Are you able to expand on that a bit or point me in a direction on where to get some more information (books, articles, papers)?

No books/articles etc to point to but I can highlight something mentioned in the linked article.

> Specifically, the A510 can overlap two cache misses with the following between them: ... 5 loads. The A53 would stall on any memory access past a cache miss.

If you're being strictly in order then on a cache miss you'll have to stall waiting for the cache fetch to complete (see the A53 behaviour). The A510 doesn't do that, it can happily service a load (or indeed a store if I'm remembering correctly) that hits in the cache whilst the previous memory access is still on-going. This is out of order behaviour.

> Would you speculate that a future design is going to lean even more heavily/be entirely based around OOE or is such a "hybrid approach" (please excuse my layman terms) still something that can serve certain applications (e.g. space efficiency) well in the future?

My gut feeling (i.e. I haven't spent time doing a proper analysis here) is that an in-order design that leans on out of order behaviour at a smaller scale (so no big re-order queues massive front end issue windows etc) is the sweet spot for small CPUs. Doing a full out of order design adds a bunch of complexity that costs you power and area and adds to the complexity of verification and risk you'll get a post-silicon bug.

I would note that CPU design is a messy business. It's a mistake to try to bin things into strict categories (that often don't have a precise definition anyway!) like RISC v CISC or in order v out of order. When a team is looking at how to push the design to hit some power, performance or area metric they're not caring about the label one might apply, they'll do what they need to hit the targets. So everything ends up as a hybrid design one way or the other.

ademeure · on Oct 2, 2023

As a former GPU architect, that's really interesting, thanks! I didn't realise A53's caches were strictly in-order and couldn't service hits ahead of misses, I always assumed this was something even much simpler designs were capable of.

I think complexity of verification as an argument against out-of-order is questionable, because if out-of-order resulted in a better core and a competitor did manage to build and properly verify such a core, then they would have a strong competitive advantage. But that might not be true in practice given the area/power cost.

As an aside: different GPU vendors also have different limitations when it comes to in-order vs out-of-order caches, and GPUs have the extra complexity that loads are effectively doing "gather", e.g. 32-wide warps doing a load with 32 addresses that may or may not uniquify, so a single "return" to the shader processor may be anything from 1 to 32 (or even 64) cachelines.

And GPU gets even more tricky with the texture unit doing trilinear+anisotropic filtering, so a single pixel may require 32x as many inputs, and you may even get into situations where the cache isn't big enough (or doesn't have enough ways) to handle the worst case and you have to revert to in-order for certain modes, or process things at a finer granularity than entire warps! Or just do in-order for everything with huge latency FIFOs and accept the latency cost. Lots of different ways to handle this, also depending on what granularity of returns your shader processor can handle. As you said, both modern CPUs and GPUs can't really be defined using simple labels.

Gather makes things a lot harder for load pipelines so I'm not surprised Zen4 seems to still just split it into uOps, but I'm curious exactly how Intel solves handles it in their CPU microarchitecture. Sadly this is the kind of thing that's practically impossible to know as an outsider!

artemonster · on Oct 12, 2023

Can you recommend something to read and learn for an experienced hardware designer (with a bit of graphics pipeline knowledge), if I want to make my own toy GPU? The field seems to be exceptionally interesting and I have no idea how to get in :)

The_Colonel · on Oct 2, 2023

Question: how much is the A510 optimized for energy efficiency vs. optimized for die size?

AnandTech tech did some comparison where the Apple's E-cores are several times more performant than A55 while consuming only slightly more power, and I'm curious if this discrepancy could be explained by different optimization target. If this is not it, could you speculate on what are the reasons for the discrepancy?

(I assume A55 is not different enough from A510 to render this question obsolete)

ksec · on Oct 2, 2023

Well for a start all the Apple Core die area ( both P and E ) are double of ARM's offering. And that is ignoring the UnCore part.

amelius · on Oct 2, 2023

Could you tell a little bit about what simulation tools you used to verify your designs?

gchadwick · on Oct 2, 2023

CPU verification is generally done at a number of levels. For the A510 we had a block level test suite for the L1 memory system. This runs in a RTL simulator (provided by commercial EDA tool vendors, Cadence Xcelium, Synopsys VCS and Mentor/Siemans Questa are the big 3). This is good for testing detailed behaviour in particular weird corner cases that are hard to produce. As a design engineer this is generally your primary verification environment.

Then you have top-level/system-level. Here you're running full programs on the CPU alone or on the CPU in a wider system simulation. There's multiple ways to run this simulation. Again you can use RTL simulations, the major disadvantage is these will be very slow (think on the order of 1-10 kHz). You can use an FPGA, this has the advantage of speed but limited design visibility (an RTL simulation can provide you with a complete dump of what every signal is doing at every clock, in an FPGA you have to explicitly add an internal logic analyser to look at a limited selection with a limited time window). Finally you have emulators, special purpose super computers for RTL simulation effectively. These are fantastic, you can get good speed (1 MHz order of magnitude) and design visibility closer to RTL simulators than FPGA. The downside here is cost, they are very very expensive (think $1 million starting point).

At Arm all of the above techniques were used.

amelius · on Oct 2, 2023

Thanks! :)

In case anyone else is interested, I found some additional info about the hardware emulators here:

https://en.wikipedia.org/wiki/Hardware_emulation

sentinalien · on Oct 2, 2023

Any particular reason they decided to write the RTL from scratch?

gchadwick · on Oct 2, 2023

It's always a trade-off. On one hand you have an existing known good design with known properties on the other hand adapting the design to your new micro-architecture can be painful and starting from scratch allows you to build a better design in less time. In this case it was a sufficiently radical overhaul of the micro-architecture than the start from scratch approach was preferable.