I am surprised they allow only 32k tokens when Reformer can have context length ...

geysersam · on March 14, 2023

Is the Reformer as capable as this model? It's a trade-off.

bitL · on March 15, 2023

It's not, it uses locality-sensitive hashing to reduce attention complexity from O(n^2) to O(nlogn) while maintaining the same performance in 16GB as a best model that could fit into 100GB but nobody scaled it up to 1000 GPUs as its purpose was the opposite.