First: while it's not technically incorrect to say that they're learning "patterns" in the training data, the word "pattern" here is extremely deep and hides a ton of detail. These aren't simple n-grams like "if the last N tokens were ___, then ___ follows." To generate fluent conversation, new code, or poetry, the model must learn highly abstract structures that start to resemble reasoning, inference, and world-modeling. You can't predict tokens well without starting to build these higher-level capabilities on some level.
Second: Generative AI is about approximating an unknown data distribution. Every dataset - text, images, video - is treated as a sample from such a distribution. Success depends entirely on the model's ability to generalize outside the training set. For example, "This Person Does Not Exist" (https://this-person-does-not-exist.com/en) was trained on a data set of 1024x1024 RGB images. Each image can be thought of as a vector in a 1024x1024x3 = 3145728-dimensional space, and since all coefficients are in [0,1], these vectors are all in the interior of a 3145728-dimensional hypercube. But almost all points in that hypercube are going to be random noise that doesn't look like a person. The ones that do will be on a lower-dimensional manifold embedded in the hypercube. The goal of these models is to infer this manifold is from the training data, and generate a random point on it.
Third: Models do what they're trained to do. Next-token prediction is one of those things, but not the whole story. A model that literally did just memorize exact fragments would not be able to zero-shot new code examples at all. That is, the transformer architecture would have learned some nonlinear transformation that is only good at repeating exact fragments. Instead, they spend a ton of time training it to get good at generalizing to new things, and it learns whatever other nonlinear transformation makes it good at doing that instead.
Second: Generative AI is about approximating an unknown data distribution. Every dataset - text, images, video - is treated as a sample from such a distribution. Success depends entirely on the model's ability to generalize outside the training set. For example, "This Person Does Not Exist" (https://this-person-does-not-exist.com/en) was trained on a data set of 1024x1024 RGB images. Each image can be thought of as a vector in a 1024x1024x3 = 3145728-dimensional space, and since all coefficients are in [0,1], these vectors are all in the interior of a 3145728-dimensional hypercube. But almost all points in that hypercube are going to be random noise that doesn't look like a person. The ones that do will be on a lower-dimensional manifold embedded in the hypercube. The goal of these models is to infer this manifold is from the training data, and generate a random point on it.
Third: Models do what they're trained to do. Next-token prediction is one of those things, but not the whole story. A model that literally did just memorize exact fragments would not be able to zero-shot new code examples at all. That is, the transformer architecture would have learned some nonlinear transformation that is only good at repeating exact fragments. Instead, they spend a ton of time training it to get good at generalizing to new things, and it learns whatever other nonlinear transformation makes it good at doing that instead.