Looking into it more, AGX2 (like pretty much every fairly high perf modern GPU) ...

monocasa on April 14, 2022 | parent | context | favorite | on: Apple's M1 Ultra comes with a 32MB TLB bottleneck

Looking into it more, AGX2 (like pretty much every fairly high perf modern GPU) is heavily SMT, allowing up to 1024 simultaneous threads per core depending on how many registers each shader invocation needs.

https://rosenzweig.io/blog/asahi-gpu-part-3.html