If one were to make a markov chain with the same amount of input data, would the...

n2d4 · on Dec 11, 2022

No. The state size is 50257^2048. The vast majority of states have never been seen and will never be seen in all of humanity.

For an example, if your training set consists of the words rain and thunder used interchangeably a lot, but the word "today" is only used once in the sentence "there is no rain today", then a Markov chain based on the data would never output "there is no thunder today", but a transformer might.

In other words, information compression (eg. equating rain with thunder) isn't just for practicability, it's a necessary requirement for (the current generation of) good language models.

mysterydip · on Dec 11, 2022

Ah, that's what I was missing. Thanks!

igorkraw · on Dec 11, 2022

It is a Markov chain. Your input context is your state.