"just" is doing a lot of heavy lifting. ChatGPT needs a language model and a sel...

"just" is doing a lot of heavy lifting.

ChatGPT needs a language model and a selection model. The language model is a predictive model that given a state generates tokens. For chatGPT it's a decoder model (meaning auto-regressive / causal transformer). The state for the language model is the fixed length window.

For a Markov chain, you need to define what "state" means. In the simplest case you have a unigram where each next token is completely independent of all previously seen tokens. You can have a bi-gram model, where the next state is dependent on the last token, or an n-gram model that uses the last N-1 tokens.

The problem with creating a markov chain with n-token state is that it simply doesn't generalize at all.

The chain may be missing states and can't produce a probability distribution. e.g. since we use a fixed window for the state, our training data can have a state like "AA" that transitions to B, thus the sentence is "AAB". The model however may keep producing stuff, thus we need to get the new state, which is "AB". If "AB" is out of the dataset, well... tough luck, you need to improvise on how to deal with this. Approaches exist but nowhere near as good of a performance as a basic RNN let alone LSTMs and transformers.