If one were to make a markov chain with the same amount of input data, would the result be the same? Markov chain chatbots have been a thing for years, just on a much more limited set of data.
No. The state size is 50257^2048. The vast majority of states have never been seen and will never be seen in all of humanity.
For an example, if your training set consists of the words rain and thunder used interchangeably a lot, but the word "today" is only used once in the sentence "there is no rain today", then a Markov chain based on the data would never output "there is no thunder today", but a transformer might.
In other words, information compression (eg. equating rain with thunder) isn't just for practicability, it's a necessary requirement for (the current generation of) good language models.