Accessing tile memory doesn't need an address translation; it's not part of virtual memory. Accessing global memory does.
I think the claim is that the typical working set can exceed what the TLB can map once the GPU scales to 32 or more cores, so the TLB starts thrashing. Optimizing for TBDR would alleviate this because all the tile memory accesses would bypass the TLB, and also likely reduce the working set because the intermediate buffers don't need VM mapping.
Yeah but a TLB is for user space virtual memory mapping. When your talking about GPU operations those usually go through opaque handles or allocated buffers and the DMA/read/write is done outside of user space(aside from initial data uploads or the case where you have unsynchronized user space memory buffers, but that's not usually the norm). Some of this changes a bit with Vulkan but it's not really clear from the thread how a TLB fits in.
Optimizing for TBR involves reducing drawcalls, buffer target switches and operations that are orthogonal to what a TLB historically does.
Pretty much all modern GPUs, discrete or integrated, use MMUs/TLBs to access out of core memory. I can confirm that the PowerVR has had an MMU since at least the SGX days (the page table management can been seen in the (barely) Linux kernel portion of their driver that they open sourced eons ago). Most of the other integrated GPUs got on board a few years later. Discrete GPUs have had MMUs over a decade since that gets them better perf, as there's less validation of the command buffers needed if worse case you can only crash your own user context. that_was_always_allowed.jpeg Leaning into those ubiquitous MMUs is half the point of Mantle/Vulkan/DX12/Metal, since you can rely on hardware protection and get to remove the older standards' required validation checks if you can only crash your own process's code (ideally, baring bugs that for sure exist).
...there are modern integrated GPUs that don't use virtual memory for all global resources? I expected that was a given to even just have an MMU enforce security when accessing RAM.
I'm not privy to the security mechanisms of the integrated GPUs I've worked with however I've got a decent amount of experience with mostly tile based GPUs and I've never hit TLB misses as a reason the GPU is going slow. On the CPU? Sure, but unless they want to get a lot more specific on what these optimizations are it's a bit hard to piece together what's going on here.
GPUs don't like to read around in memory(neither do CPUs for that matter but then tend to branch and be a lot less predictable) so optimizing reads/writes are something you will do regardless. The painful parts of TBRs usually(but not universally) have to do with drawcalls setup and render target switching. I remember some of the early iPhones had pretty painfully low drawcalls limits associated with the overhead with dispatching the actual calls as opposed to triangle count or fillrate. It's fairly easy to put together the synthetic benchmarks to shake that out.
Well the claim is also that it isn't generally an issue for the M1's architecture until you scale up to 32 cores or more [1], which is beefier than any other current GPU outside of AMD and NVIDIA's offerings. So it's completely believable that the GPUs you worked on could have used virtual memory for everything without it ever becoming a performance issue.
(alternatively: it looks like the M1 Pro only has 24 MB L3, so it's likely that actual cache misses become an issue before TLB misses, but the Max/Ultra have more cache so that could invert)
The claim is that if you don't "optimize" for TBDR then you are gated by the TLB and that software written for immediate mode rendering isn't optimized this way.
However drawcall batching, minimizing render target(and state!) switching are all bread-and-butter performance optimizations you'd make for an immediate mode renderer as well. You can get away with more drawcalls and there are specific things you can do on a per-architecture basis but they don't involve anything having to do with TLBs to the best of my knowledge. If someone wants to spill the beans on how the M1 is somewhat different here I'm all ears but it just doesn't line up with most GPU architectures I've seen.
The big wins between TBR and IMR usually involve getting larger tiles through less usage of rendertargets[1] as it increases the total size of the tiles which drives efficiencies in dispatching the tiles.
I don't know how modern GPUs work exactly, but you could enforce security on physical addresses with a significantly less demanding TLB circuit, and you could use big swaths of memory so the bounds are guaranteed to fit in your memory logic, no buffers or caches needed.
While the AGX does have MMUs and TLBs, unified memory and TLBs are orthogonal. For instance the Raspberry Pi's SoC has unified memory, but no MMU/TLB on the GPU. That IP block can only access physical memory.
I think the claim is that the typical working set can exceed what the TLB can map once the GPU scales to 32 or more cores, so the TLB starts thrashing. Optimizing for TBDR would alleviate this because all the tile memory accesses would bypass the TLB, and also likely reduce the working set because the intermediate buffers don't need VM mapping.