Well the claim is also that it isn't generally an issue for the M1's architecture until you scale up to 32 cores or more [1], which is beefier than any other current GPU outside of AMD and NVIDIA's offerings. So it's completely believable that the GPUs you worked on could have used virtual memory for everything without it ever becoming a performance issue.
(alternatively: it looks like the M1 Pro only has 24 MB L3, so it's likely that actual cache misses become an issue before TLB misses, but the Max/Ultra have more cache so that could invert)
The claim is that if you don't "optimize" for TBDR then you are gated by the TLB and that software written for immediate mode rendering isn't optimized this way.
However drawcall batching, minimizing render target(and state!) switching are all bread-and-butter performance optimizations you'd make for an immediate mode renderer as well. You can get away with more drawcalls and there are specific things you can do on a per-architecture basis but they don't involve anything having to do with TLBs to the best of my knowledge. If someone wants to spill the beans on how the M1 is somewhat different here I'm all ears but it just doesn't line up with most GPU architectures I've seen.
The big wins between TBR and IMR usually involve getting larger tiles through less usage of rendertargets[1] as it increases the total size of the tiles which drives efficiencies in dispatching the tiles.
(alternatively: it looks like the M1 Pro only has 24 MB L3, so it's likely that actual cache misses become an issue before TLB misses, but the Max/Ultra have more cache so that could invert)
[1] https://twitter.com/VadimYuryev/status/1514295707481501700