You forgot the corollary. What transformers fundamentally reason about is N x pa...

You forgot the corollary. What transformers fundamentally reason about is N x partition of input x token embedding size (N = number of attention heads). That's the "latent space" between 2 layers of a transformer, that's what attention produces (which is, in almost all transformers the same across all layers except the first and the last). Now if you look at this, you might notice ... that's pretty huge for a latent space. Convolutional AI had latent spaces that gradually decreased to maybe 100 numbers, generally even smaller. The big transformers have a latent space. For GPT-3 it will be 96 x 4096 x 128. That is a hell of a lot of numbers between 2 layers. And it just keeps the entire input (up to that point) in memory (it slowly fills up the "context"). What then reasons about this data is a "layer" of a transformer which is more or less a resnetted deep neural network.

But convnets were fundamentally limited to "thinking" in, the biggest I've seen were 1000 dimensions. Because we couldn't keep their thinking stable with more dimensions. But ... we do know how to do that now.

You could look at this to figure out what transformers do if you radically simplify. Nobody can imageine a 100,000 dimensional space. Just doesn't work, does it? But let's say we have a hypothetical transformer with a context size of 2. Let's call token 1 "x" and token 2 "y". You probably see where I'm going with this. This transformer will learn to navigate a plane in a way similar to what it's seen in the training data. "If near 5,5 go north by 32" might be what one neuron in one layer does. This is not different in 100,000 dimensions, except now everybody's lost.

But ... what happens in a convnet with a latent space of 50,000? 100,000? 1,000,000? What happens, for that matter, in a simple deep neural network (ie. just connected layers + softmax) of that size? This was never really tried for 2 reasons: the hardware couldn't do it at the time, AND the math wouldn't support it (we didn't know how to deal with some of the problems, likely you'd need to "resnet" both convnets and deep neural networks, for example)

Would the "old architectures" just work with such an incredible massive latent space?

And there's the other side as well: improve transformers ... what about including MUCH more in the context? A long list of previous conversations, for example. The entire text of learning books things like a multiplication table, a list of ways triangles can be proven to be congruent, the periodic table, physical contexts, the expansion rules for differential calculus, "Physics for scientists and engineers", the whole thing. Yes that will absolutely blow out the latent space, but clearly we've decided that a billion or 2 of extra investments will still allow us to calculate the output.