If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.
From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.
From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.