> Why the duplication? I have not yet observed Metal using different programs for each.
I'm guessing whoever designed the system wasn't sure whether they would ever need to be different, and designed it so that they could be. It turned out that they didn't need to be, but it was either more work than it was worth to change it (considering that simply passing the same parameter twice is trivial), or they wanted to leave the flexibility in the system in case it's needed in future.
I've definitely had APIs like this in a few places in my code before.
I don't understand why the programs are the same. The partial render store program has to write out both the color and the depth buffer, while the final render store should only write out color and throw away depth.
E.g in their example in the link above for deferred rendering (figure 4) the multiple G buffers won't actually need to leave the on-chip tile buffer - unless there's a partial render before the final shading shader is run.
Right, I had the article's bunny test program on my mind, which looks like it has only one pass.
In OpenGL, the driver would have to scan the following commands to see if it can discard the depth data. If it doesn't see the depth buffer get cleared, it has to be conservative and save the data. I assume mobile GPU drivers in general do make the effort to do this optimization, as the bandwidth savings are significant.
In Vulkan, the application explicitly specifies which attachment (i.e. stencil, depth, color buffer) must be persisted at the end of a render pass, and which need not. So that maps nicely to the "final render flush program".
The quote is about Metal, though, which I'm not familiar with, but a sibling comment points out it's similar to Vulkan in this aspect.
So that leaves me wondering: did Rosenzweig happen to only try Metal apps that always use MTLStoreAction.store in passes that overflow the TVB, or is the Metal driver skipping a useful optimization, or neither? E.g. because the hardware has another control for this?
> Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones.
PowerVR has its roots in a desktop video card with somewhat limited release and impact. It really took off when it was used in the Sega Dreamcast home console and the Sega Naomi arcade board. It was only later that people put them in phones.
Unified memory was introduced by SGI with the O2 workstation in 1996, then they used it again with their x86 workstations SGI 320 and 540 in 1999. So it was a workstation-class technology before being a mobile one :)
The N64's unified memory model had a pretty big asterisk though. The system had only 4kB for textures out of 4MB of total RAM. And textures are what uses the most memory in a lot of games.
That’s a somewhat misleading way to describe it. The N64 has 4K texture memory (TMEM), but you can use far more than 4K of texture during a frame—because you can load data into TMEM as many times as you like during a frame.
In practice, you might think of TMEM like it’s a cache, it’s just that you have to manage this cache manually. You can use as much RAM as you like for textures.
TMEM is also not part of main RAM, like the RSP’s DMEM and IMEM.
Then that SGI team broke out to form ArtX, developed the GameCube hardware, then were snatched up by ATI and went on to form the foundation of ATI, now AMD GPUs
But being a Tiling rendering architecture which is normal for mobile applications and not how desktop GPUs are architectured, it would be fair to call it a mobile GPU.
According to the sources I've read, it uses a tiled rasterizing architecture but it's not deferred in the same way as typical mobile TBDR that bins all vertexes before starting rasterization, deferring all rasterization after all vertex generation, and flushing each tile to the framebuffer once.
NV seems to rasterize vertexes in small batches (i.e. immediately) but buffers the rasterizer output on die in tiles. There can still be significant overlap between vertex generation and rasterization. Those tiles are flushed to the framebuffer, potentially before they are fully rendered, and potentially multiple times per draw call depending on the vertex ordering. They do some primitive reordering to try to avoid flushing as much, but it's not a full deferred architecture.
I actually had one of those cards! The only games I could get it to work with were Half-Life, glQuake, and Jedi Knight, and the bilinear texture filtering had some odd artifacting IIRC
To be fair, the architecture used in the early “desktop” variants was quite different from the modern mobile ones (MBX/SGX and beyond); excepting the TBDR.
I really appreciate the writing and work that was done here.
It is amazing to me how complicated these systems have become. I am looking over the source for the single triangle demo. Most of this is just about getting information from point A to point B in memory. Over 500 lines worth of GPU protocol overhead... Granted, this is a one-time cost once you get it working, but it's still a lot to think about and manage over time.
I've written software rasterizers that fit neatly within 200 lines and provide very flexible pixel shading techniques. Certainly not capable of running a cyberpunk 2077 scene, but interactive framerates otherwise. In the good case, I can go from a dead stop to final frame buffer in <5 milliseconds. Can you even get the GPU to wake up in that amount of time?
Modern gpus + drover stack usually had more than one frame in flight. You have to output a frame every 4ms, but you do not need the latency from the start of the application rendering code to the frame being on screen to be 4ms - pipelining is allowed. But keeping that pipelining down to a minimum is also important, as it contributes to input lag which gamers care about.
Huh, I always thought tilers re-ran their vertex shaders multiple times -- once with position-only to do binning, and then again when computing for all attributes with each tile; that's what the "forward tilers" like Adreno/Mali do. That's crazy they dump all geometry to main memory rather than keeping it in pipe. It explains why geometry is more of a limit on AGX/PVR than Adreno/Mali.
That's what I thought, too, until I saw ARM's Hot Chips 2016 slides. Page 24 shows that they write transformed positions to RAM, and later write varyings to RAM. That's for Bifrost, but it's implied Midgard is the same, except it doesn't filter out vertices from culled primitives.
That makes me wonder whether the other GPUs with position-only shading - Intel and Adreno - do the same.
As for PowerVR, I've never seen them described as position-only shaders - I think they've always done full vertex processing upfront.
Mali's slides here still show them doing two vertex shading passes, one for positions, and again for other attributes. I'm guessing "memory" here means high-performance in-unit memory like TMEM, rather than a full frame's worth of data, but I'm not sure!
I was under that impression as well. If they write out all attributes, what is really the remaining difference to a traditional immediate more renderer? Nvidia reportedly has vertex attributes going through memory for many generations already (and they are at least partially tiled...).
I suppose the difference is whether the render target lives in the "SM" and is explicitly loaded and flushed (by a shader, no less!) or whether it lives in a separate hardware block that acts as a cache.
NV has vertex attributes "in-pipe" (hence mesh shaders), and the appearance of a tiler is a misread, it's just a change to the macro-rasterizer about which quads get dispatched first, it's not a true tiler.
The big difference is the end of the pipe, as mentioned; whether you have ROPs or whether your shader cores load/store from a framebuffer segment. Basically, whether or not framebuffer clears are expensive (assuming no fast-clear cheats), or free.
That image gave me flashbacks of gnarly shader debugging I did once. IIRC, I was dividing by zero in some very rare branch of a fragment shader, and it caused those black tiles to flicker in and out of existence. Excruciatingly painful to debug on a GPU.
debugging in situations where there is no ability to halt and step, or in some cases even log, is extremely extremely tricky. Embedded is another domain where that's super common... or drivers or other peripherals.
there probably are tools these days for debugging shaders, potentially commercial packages if Nsight Studio doesn't have it, but yeah, that sort of thing isn't easy.
Seeing how Apple licensed the full PowerVR hardware before, they probably currently have a license for the whatever hardware they based their design on.
They originally claimed they completely redesigned it and announced they were therefore going to drop the PowerVR architecture license - that was the reason for the stock price crash and Imagination Technologies sale in 2017.
Then they have since scrubbed the internet of all such claims and to this day pay for an architecture license. I think it's similar to an ARM architecture license - where it's a license for any derived technology and patents rather than actually being given the RTL for powervr-designed cores.
I worked at PowerVR during that time (I have Opinions, but will try to keep them to myself), and my understanding was that Apple hadn't actually taken new PowerVR RTL for a number of years and had significant internal redesigns of large units (e.g. the shader ISA was rather different from the PowerVR designs of the time), but presumably they still use enough of the derived tech and ideas that paying the architecture license is necessary. This transfer was only one way - we never saw anything internal about Apple's designs, so reverse engineering efforts like this are still interesting.
And as someone who worked on the PowerVR cores (not the Apple derivatives) I can assure you all this discussed in the original post is extremely familiar.
It's a small enough group that I'm likely personally identifiable anyway, so will be vague and try to stick to public info and my personal conjecture.
Let's just say that the legal shenanigans of the time caused me to lose my job (part of the sale of Imagination Technologies required closing some countries offices to avoid more interference from various regulatory bodies). Judge bias accordingly.
And all their noise about "Ground up redesign using no PowerVR tech" kinda conflicts with them still to this day paying for an architecture license - the very thing that they claimed they would be dropping in their press release that caused the imagination technologies share crash and corresponding sale. And this is without even going to court - they issued a press release then immediately relented (and have continued to relent for over 5 years now) at the slightest question. And then scrubbed all mention of that press release.
My general suspicion is apple intended to game the market by intentionally dropping the share price and simply purchase PowerVR at a discount - but in the process pissed off enough people that they rejected the offer, even if it was "better" in terms of value. Or just let them go under and pick everything they want off the resulting fire sale - I heard rumors that apple had already put in an offer to purchase the company that was rejected, and under UK regulation a failed takeover attempt can't be re-attempted for some time, that much of this happened within (again, according to fuzzy scuttlebutt, nothing definite)
That or the legal/C-suite of apple don't actually speak to the engineers of apple anymore - they honestly thought that it was a completely ground-up design that didn't derive anything from PowerVR tech, and just send out the press release thinking "Why are we paying for this??" - then the engineers shuffled in saying that actually they couldn't put together anything better that wasn't a direct derivative, and their noise about a completely internally designed-from-scratch apple GPU was a bit of a stretch.
There's no reason that couldn't be a half-truth - it could be a PowerVR with certain components replaced, or even the entire GPU replaced but with PowerVR-like commands and structure for compatibility reasons. Kind of like how AMD designed their own x86 chip despite it being x86 (Intel's architecture).
Also, if you read Hector Martin's tweets (he's doing the reverse-engineering), Apple replacing the actual logic while maintaining the "API" of sorts is not unheard of. It's what they do with ARM themselves - using their own ARM designs instead of the stock Cortex ones while maintaining ARM compatibility.*
*Thus, Apple has a right to the name "Apple Silicon" because the chip is designed by Apple, and just happens to be ARM-compatible. Other chips from almost everyone else use stock ARM designs from ARM themselves. Otherwise, we might as well call AMD an "Intel design" because its x86 by the same logic.
> Apple replacing the actual logic while maintaining the "API" of sorts is not unheard of.
They did this with ADB, early PowerPC systems contained a controller chip that has the same API that was implemented in software in the 6502 IOP coprocessor in the IIfx/Q900/Q950.
Didn't Apple have a large or even dominant role in the design of the ARM64/AArch64 architecture? I remember reading somewhere that they developed ARM64 and essentially "gave it" to ARM who accepted but nobody could understand at the time why a 64 bit extension to ARM was needed so urgently, and why some of the details of the architecture had been designed the way they had. Years later with Apple Silicon it all became clear.
> arm64 is the Apple ISA, it was designed to enable Apple’s microarchitecture plans. There’s a reason Apple’s first 64 bit core (Cyclone) was years ahead of everyone else, and it isn’t just caches
> Arm64 didn’t appear out of nowhere, Apple contracted ARM to design a new ISA for its purposes. When Apple began selling iPhones containing arm64 chips, ARM hadn’t even finished their own core design to license to others.
> ARM designed a standard that serves its clients and gets feedback from them on ISA evolution. In 2010 few cared about a 64-bit ARM core. Samsung & Qualcomm, the biggest mobile vendors, were certainly caught unaware by it when Apple shipped in 2013.
> > Samsung was the fab, but at that point they were already completely out of the design part. They likely found out that it was a 64 bit core from the diagnostics output. SEC and QCOM were aware of arm64 by then, but they hadn’t anticipated it entering the mobile market that soon.
> Apple planned to go super-wide with low clocks, highly OoO, highly speculative. They needed an ISA to enable that, which ARM provided.
> M1 performance is not so because of the ARM ISA, the ARM ISA is so because of Apple core performance plans a decade ago.
> > ARMv8 is not arm64 (AArch64). The advantages over arm (AArch32) are huge. Arm is a nightmare of dependencies, almost every instruction can affect flow control, and must be executed and then dumped if its precondition is not met. Arm64 is made for reordering.
> > M1 performance is not so because of the ARM ISA, the ARM ISA is so because of Apple core performance plans a decade ago.
This is such an interesting counterpoint to the occasional “Just ship it” screed (just one yesterday I think?) we see on HN.
I have to say, I find this long form delivery of tech to be enlightening. That kind of foresight has to mean some level of technical saaviness at high decision making levels. Whereas many of us are caught at companies with short sighted/tech naive leadership who clamor to just ship it so we can start making money and recoup the money we’re losing on these expensive tech type developers.
I think the "just ship it" method is necessary when you're small and starting out. Unless you are well funded, you couldn't afford to do what Apple did.
AArch64 does not resemble MIPS at all (beyond the fact that both use fixed-length instructions and separate register-register and load-store instruction groups; these RISC principles had already been used in IBM 801 about 5 years before MIPS, and then they have been used in more than a dozen of other CPU architectures, many of which are more similar to AArch64 than MIPS is).
Therefore, there is no basis for saying that AArch64 is a cleaned-up MIPS-like ISA. Only RISC-V is a MIPS-like ISA.
One of the few features of AArch64 that can be said to be similar to MIPS was its main mistake.
In the initial ARMv8.0 version, the only means provided for implementing atomic operations was a load-and-reserve/store-conditional instruction pair.
This kind of instruction has been popularized by MIPS II, but it had not been invented by MIPS, but by Jensen et al. (November 1987), for the S-1 AAP multiprocessor.
While this instruction pair allows the implementation of lock-free/wait-free data structures, it can be extremely inefficient for implementing locks in systems with many cores (because progress is not guaranteed), so in the ARMv8.1 version the initial mistake has been corrected, by adding atomic instructions of the type fetch-and-op, besides the MIPS-like LL/SC pair.
The complex features of ARM64 were not newly designed by and large, they were mostly carried over from ARM32 - mostly to take advantage of shared ARM32/ARM64 implementation. Much of the actual design work involved in ARM64 was simplification and things like adding a zero register to the ISA, which is pretty comparable to MIPS.
A zero register already existed in some computers with vacuum tubes, almost 70 years ago, this is not a new idea that can be attributed to MIPS or RISC.
It is a good feature, which can reduce substantially the number of instructions that must be implemented, because many single-operand operations are just special cases of double-operand operations with one null operand.
This is why it was used in many early computers, which had to be simple due to the limitations of their technology, and then it was used again in most RISC CPUs, which have been simplified intentionally (and not only in MIPS; among the more successful RISC ISAs also IBM POWER has it; only 32-bit ARM does not have it, due to its unusually low number of general-purpose registers, in comparison with the other RISC ISAs).
Apple, Acorn Computers and VLSI were founding partners of ARM, if I remember correctly.
My StrongArm powered RiscPC was amazing for the time. It was strange that the contemporaneous Newton was powered by the same (and in some ways better) processor.
The connection between ARM processors being used in desktop and mobile devices is in its early DNA.
They do, and their microarchitecture is unambiguously, hugely different to anything else (some details in 1). The last Apple Silicon chip to use a standard Arm design was the A5X, whereas they were using customised PowerVR GPUs until I think the A11.
They are one of a handful of companies that hold a license allowing them to both customize the reference core and to implement the Arm ISA through their own silicon design. Everyone else's SoCs all use the same Arm reference mask. Qualcomm also holds such a license, which owes to their Snapdragon SoC, just like Apple's A- and M-series, occupying a performance hierarchy above everything else Arm.
According to Hector Martin (the project lead of Asahi) in previous threads of the subject[0], Apple actually has an "architecture+" license which is completely exclusive to them, thanks to having literally been at the origins of ARM: not only can Apple implement the ISA on completely custom silicon rather than license ARM cores, they can customise the ISA (as in add instructions, as well as opt out of mandatory ISA features).
The only Qualcomm designed 64-bit mobile core so far was the Kyro core in the 820. They then assigned that team to server chips (Centriq) then sacked the whole team when they felt they needed to cut cash flow to stave off Avago/Broadcom. The "Kyro" cores from 835 on are rebadged/adjusted ARM cores.
IMO the Kyro/820 wasn't a major failure, it turned out a lot better than the 810 which had A53/A57 cores.
And then they decided they needed a mobile CPU team again and bought Nuvia for ~US$1 Billion.
Qualcomm did use their own design called Kyro for a little while, but is now focusing on cores designed by Nuvia which they just bought for the future.
As for Apple, they've designed their own cores since the Apple A6 which used the Swift core. If you go to the Wikipedia page, you can actually see the names of their core designs, which they improve every year. For the M1 and A14, they use Firestorm High-Performance Cores and Icestorm Efficiency Cores. The A15 uses Avalanche and Blizzard. If you visit AnandTech, they have deep-dives on the technical details of many of Apple's core designs and how they differ from other core designs including stock ARM.
The Apple A5 and earlier were stock ARM cores, the last one they used being Cortex A9.
For this reason, Apple is about as much an ARM chip as AMD is an Intel chip. Technically compatible, implementation almost completely different. It's also why Apple calls it "Apple Silicon" and it is not just marketing, but actually justified just as much as AMD not calling their chips Intel derivatives.
Kyro started as custom but flopped in the Snapdragon 820 so they moved to a "semi-custom" design, it's unclear how different it really is from the stock Cortex designs.
> Qualcomm did use their own design called Kyro for a little while
Before that, they had Scorpion and Krait, which were both quite successful 32 bit ARM compatible cores at the time.
Kryo started as an attempt to quickly launch a custom 64 bit ARM core and the attempt failed badly enough that Qualcomm abandoned designing their own cores and turned to licensing semi-custom cores from ARM instead.
To be blunt, you're asking about questions that could be solved with a quick google and you are coming off as a bit of a jerk asking for very specific citations with exact specific wording for basic facts like this that, again, could be solved by looking through the wikipedia for "apple silicon" and then bouncing to a specific source. People have answered your question and you're brushing them off because you want it answered in an exact specific way.
> NVIDIA and Samsung, up to this point, have gone the processor license route. They take ARM designed cores (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate them into custom SoCs. In NVIDIA’s case the CPU cores are paired with NVIDIA’s own GPU, while Samsung licenses GPU designs from ARM and Imagination Technologies. Apple previously leveraged its ARM processor license as well. Until last year’s A6 SoC, all Apple SoCs leveraged CPU cores designed by and licensed from ARM.
> With the A6 SoC however, Apple joined the ranks of Qualcomm with leveraging an ARM architecture license. At the heart of the A6 were a pair of Apple designed CPU cores that implemented the ARMv7-A ISA. I came to know these cores by their leaked codename: Swift.
Yes, Apple has been designing and using non-reference cores since the A6 era, and were one of the first to the table with ARMv8 (apple engineers claim it was designed for them under contract to their specifications, but this part is difficult to verify with anything more than citations from individual engineers).
I expect that Apple has said as much in their presentations somewhere, but if you're that keen on finding such an incredibly specific attribution, then knock yourself out. It'll be in an apple conference somewhere, like WWDC. They probably have said "apple-designed silicon" or "custom core" at some point, and that would be your citation - but they also sell products, not hardware, and they don't extensively talk about their architectures since they're not really the product, so you probably won't find a deep-dive like Anandtech from Apple directly where they say "we have 8-wide decode, 16-deep pipeline... etc" sorts of things.
Alyssa's writing style steps you through a technical mystery in a way that remains compelling even if you lack the domain knowledge to solve the mystery yourself.
> Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.
> And with that, we get our bunny.
So what was the configuration that needed to change? Don't leave us hanging!!!
It's been said more than a few times in the past, but I cannot get over just how smart and motivated Alyssa Rosenzweig is - she's currently an undergraduate university student, and was leading the Panfrost project when she was still in high school! Every time I read something she wrote I'm astounded at how competent and eloquent she is.
It's surprising to me the deep contrast between this awe-inspiring deep technical wizardry, and the sometimes-incompetence of driver developers (at least the impression I get from a month spent reverse-engineering Windows drivers) and poor pay of embedded programmers (https://news.ycombinator.com/item?id=31364360). I don't know if striving to develop this kind of deep knowledge myself (though I don't know if I'll ever learn all the skills she has today) is a useful work skill; I get the impression that deep knowledge of how to optimize compilers/compute/apps/servers at the assembly/cache level pays very well (despite being much more similar to embedded compared to web/mobile or backend development).
Does anyone know if she has a proper interview somewhere? I'd love to know how she got so technical in high school to be able to reverse engineer a GPU -- something I would have no idea how to start even with many more years experience (although admittedly I know very little about GPUs and don't do graphics work).
I used to feel this way, too. However, every single one of us has their own unique circumstances.
I can't give too many details unfortunately. But, there's a specific step I took in my career, which was completely random at the time. I was still a student, and I decided not to work somewhere. I resigned two weeks in. Had I not done that, I wouldn't be where I am today. My situation would be totally different.
Yes, some people are very talented. But it does take quite a lot of work and dedication. And yes, sometimes you cannot afford to dedicate your time to learning something because life happens.
Be excited! This means amazing things are coming, from incredibly talented people. And even better when they put out their knowledge in public, in an easy to digest form, letting you learn from them.
I get that. But then I remember at that age, I was only just cobbling together my very first computer from the scrap bin. An honest comparison is nearly impossible.
me too, and idk how to cope up with this...
i see younger guys than me creating os, and i am here achieved nothing in life...
i feel so sad, so depressed, my mood flips so hard, sometimes i feel like just leaving everything and getting away.
i know it's not a competition and i don't want to win this, i just want to point at one thing and say, this is created by me. That's all i want and nothing else... But i have nothing in hand... And this happens every time...
and i get depressed and start crying...
And for me, her existence is enough to keep me of getting depressed about my industry. Whatever she's doing, is keeping my hopes up for computer engineering.
I'm guessing whoever designed the system wasn't sure whether they would ever need to be different, and designed it so that they could be. It turned out that they didn't need to be, but it was either more work than it was worth to change it (considering that simply passing the same parameter twice is trivial), or they wanted to leave the flexibility in the system in case it's needed in future.
I've definitely had APIs like this in a few places in my code before.