Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's not how attention works though, it should be perfectly able to figure out which parts are important and which aren't, but the problem is that it doesn't really scale beyond small contexts and works on a token to token basis instead of being hierarchical with sentences, paragraphs and sections. The only models that actually do long context do so by skipping attention layers or doing something without attention or without positional encodings, all leading to shit performance. Nobody pretrains on more than like 8k, except maybe Google who can throw TPUs at the problem.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: