I am surprised they allow only 32k tokens when Reformer can have context length of 1M on 16GB VRAM. It seems like they have some ways to optimize it further.
It's not, it uses locality-sensitive hashing to reduce attention complexity from O(n^2) to O(nlogn) while maintaining the same performance in 16GB as a best model that could fit into 100GB but nobody scaled it up to 1000 GPUs as its purpose was the opposite.