I don’t see why it’s unreasonable. Training a model that is an order of magnitude bigger requires (at least) an order of magnitude more data, an order of magnitude more time, hardware, energy, and money.
Getting an order of magnitude more data isn’t easy anymore. From GPT2 to 3 we (only) had to scale up to the internet. Now? You can look at other sources like video and audio, but those are inherently more expensive. So your data acquisition costs aren’t linear anymore, they’re something like 50x or 100x. Your quality will also dip because most speech (for example) isn’t high-quality prose, it contains lots of fillers, rambling, and transcription inaccuracies.
And this still doesn’t fix fundamental long-tail issues. If you have a concept that the model needs to see 10x to understand, you might think scaling your data 10x will fix it. But your data might not contain that concept 10x if it’s rare. It might contain 9 other one-time things. So your model won’t learn it.
Getting an order of magnitude more data isn’t easy anymore. From GPT2 to 3 we (only) had to scale up to the internet. Now? You can look at other sources like video and audio, but those are inherently more expensive. So your data acquisition costs aren’t linear anymore, they’re something like 50x or 100x. Your quality will also dip because most speech (for example) isn’t high-quality prose, it contains lots of fillers, rambling, and transcription inaccuracies.
And this still doesn’t fix fundamental long-tail issues. If you have a concept that the model needs to see 10x to understand, you might think scaling your data 10x will fix it. But your data might not contain that concept 10x if it’s rare. It might contain 9 other one-time things. So your model won’t learn it.