As someone else already mentioned, the scaling laws paint a different story empirically: we haven't hit diminishing returns at all, and there's no end in sight.
But more anecdotally, the first applied neural network paper in 1989 by LeCunn has pretty much the same format as the GPT paper: a large neural network trained on a large dataset (all relative to the era). https://karpathy.github.io/2022/03/14/lecun1989/
It really just seems that there are a certain number of flops you need before certain capabilities can emerge.
But more anecdotally, the first applied neural network paper in 1989 by LeCunn has pretty much the same format as the GPT paper: a large neural network trained on a large dataset (all relative to the era). https://karpathy.github.io/2022/03/14/lecun1989/
It really just seems that there are a certain number of flops you need before certain capabilities can emerge.