It sounds like it might be possible, at least at the hardware level. https://github.com/xmrig/xmrig/issues/2060#issuecomment-7702... (bug on some random crypto miner benchmark software, I guess) mentions a 32 MiB page size, although that might be with Linux on M1 rather than macOS on M1.
This "32MB TLB bottleneck" is a weird thing to say. The TLB size is typically stated in terms of pages, right? with huge pages (or "superpages", as macOS may call them?), that should be a lot more than 32MB of total memory.
It absolutely can! On ARM you can just enter a block entree in a non-leaf page table to allocate an entire block of physically contiguous memory in a single TLB entry. From the ARM docs [1], any CPU supporting 16K translation granules will support 32MB L2 blocks.
I suspect, however, that apple can't really allocate that many (if any at all) 32MB regions of memory at runtime due to fragmentation unless they've substantially changed their contiguous memory allocator since I last looked.
> I suspect, however, that apple can't really allocate that many (if any at all) 32MB regions of memory at runtime due to fragmentation unless they've substantially changed their contiguous memory allocator since I last looked.
That's unfortunate but at least is something they could fix in upcoming software versions. Some folks have put a lot of effort into huge pages on Linux, and it's still ongoing. [1] Not too surprising that macOS could have some room for improvement...
This "32MB TLB bottleneck" is a weird thing to say. The TLB size is typically stated in terms of pages, right? with huge pages (or "superpages", as macOS may call them?), that should be a lot more than 32MB of total memory.