That's not how attention works though, it should be perfectly able to figure out...

That's not how attention works though, it should be perfectly able to figure out which parts are important and which aren't, but the problem is that it doesn't really scale beyond small contexts and works on a token to token basis instead of being hierarchical with sentences, paragraphs and sections. The only models that actually do long context do so by skipping attention layers or doing something without attention or without positional encodings, all leading to shit performance. Nobody pretrains on more than like 8k, except maybe Google who can throw TPUs at the problem.