The author of those tweets doesn't seem to understand what TLB is (their terminology is off), doesn't seem to understand how GPUs work (they are explicitly designed to hide memory stalls by switching to a different kernel, which is why they have huge register files), doesn't seem to understand the memory hierarchies (why would a GPU have it's own TLB in the first place?). Besides, we know that M1 comes with a 3072-entry TLB that matches it's maximal 48MB cache size, so where did they come up with the "32MB figure anyway"?
It is entirely possible that M1 Ultra is bandwidth starved for some workloads, but that can be demonstrated using GPU profiling tools (which Apple readily provides) and not by musing about technical concepts that one barely understands.
> why would a GPU have it's own TLB in the first place?
is because modern GPUs have their own MMUs and have for quite a while. More or less ubiquitous MMUs on GPUs are half the reason why Mantle/Vulkan/DX12/Metal came into being.
Thanks for this! Can you elaborate a bit more and maybe give me some pointers where I can read about these things in more depth? I of course know about the MMU on traditional dGPUs but I was not aware that “integrated” GPUs also have a separate MMU. How does it work and why is it necessary given that the GPU shares the last level cache and the memory controller with the rest of the system?
Because the TLBs and other MMU hardware aren't in the last level cache or memory controller on the vast majority of systems, they normally sit about at L1 on each core. This is because you want at least L2 to be speaking completely in physical addresses so the coherency protocol isn't confused by the same page mapped in different ways. L1 is more complex typically being virtually indexed, but physically tagged. So when a access is issued with the virtual address, in parallel to the cache set look up, the TLB translates to a physical address, and then the physical address is compared on the tags of the resulting cache line ways in the set that matched the virtual address in order to select the actual cache line for the op.
It is entirely possible that M1 Ultra is bandwidth starved for some workloads, but that can be demonstrated using GPU profiling tools (which Apple readily provides) and not by musing about technical concepts that one barely understands.