Various other models also think they're ChatGPT or built by OpenAI, or at least ...

Various other models also think they're ChatGPT or built by OpenAI, or at least those are the highest probability tokens when talking about an AI model or an AI company because of the massive prevalence in training data (the internet). It isn't the big reveal that it is often being held to be.

Add that training off of ChatGPT wouldn't reduce their training costs at all, but would actually increase their training costs. Literally all of the same training difficulty, but then add paying OpenAI for an enormous number of API calls. Not really seeing the win.

>The paper describes the corpus only in vague ways.

Anyone who runs a public website has logs absolutely filled by a seemingly infinite number of information aggregators. Just like everyone else they scraped the entire internet, pulled in all of Wikipedia, etc. Probably lots of pirate books, movie transcripts, etc.

The fact that training could be done more effectively is something that intuitively makes absolute sense to everyone in the field, but we just didn't make that leap. Similar to how a human isn't trained to recognize digits by training on 60,000 training digits then suddenly failing if a real world digit is slightly rotated or morphed in some way, we are making these improvements to content ingestion.