When I was doing some NLP stuff a few years ago, I downloaded a few blobs of Common Crawl data, i.e. the kind of thing GPT was trained on. I was sort of horrified by the subject matter and quality: spam, advertisements, flame wars, porn... and that seems to be the vast majority of internet content. (If you've talked to a model without RLHF like one of the base Llama models, you may notice the personality is... different!)
I also started wondering about the utility of spending most of the network memorizing infinite trivia (even excluding most of the content above, which is trash), when LLMs don't really excel at that anyway, and they need to Google it anyway to give you a source. (Aside: I've heard soke people have good luck with "hallucinate then verify" with RAG / Googling...)
i.e. what if we put those neurons to better use? Then I found the Phi-1 paper, which did exactly that. Instead of training the model on slop, they trained it on textbooks! And instead of starting with PhD level stuff, they started with kid level stuff and gradually increased the difficulty.
You can get rid of the trivia by training one model on the slop, then a second model on the first one - called distillation or teacher-student training. But it's not much of a problem because regularization during training should discourage it from learning random noise.
The reason LLMs work isn't because they learn the whole internet, it's because they try to learn it but then fail to, in a useful way.
If anything current models are overly optimized away from this; I get the feeling they mostly want to tell you things from Wikipedia. You don't get a lot of answers that look like they came from a book.
I also started wondering about the utility of spending most of the network memorizing infinite trivia (even excluding most of the content above, which is trash), when LLMs don't really excel at that anyway, and they need to Google it anyway to give you a source. (Aside: I've heard soke people have good luck with "hallucinate then verify" with RAG / Googling...)
i.e. what if we put those neurons to better use? Then I found the Phi-1 paper, which did exactly that. Instead of training the model on slop, they trained it on textbooks! And instead of starting with PhD level stuff, they started with kid level stuff and gradually increased the difficulty.
What will we think of next...