I pasted the article into a Markov Chain: exist generative text unfortunately wi...

PartiallyTyped · on March 11, 2023

"just" is doing a lot of heavy lifting.

ChatGPT needs a language model and a selection model. The language model is a predictive model that given a state generates tokens. For chatGPT it's a decoder model (meaning auto-regressive / causal transformer). The state for the language model is the fixed length window.

For a Markov chain, you need to define what "state" means. In the simplest case you have a unigram where each next token is completely independent of all previously seen tokens. You can have a bi-gram model, where the next state is dependent on the last token, or an n-gram model that uses the last N-1 tokens.

The problem with creating a markov chain with n-token state is that it simply doesn't generalize at all.

The chain may be missing states and can't produce a probability distribution. e.g. since we use a fixed window for the state, our training data can have a state like "AA" that transitions to B, thus the sentence is "AAB". The model however may keep producing stuff, thus we need to get the new state, which is "AB". If "AB" is out of the dataset, well... tough luck, you need to improvise on how to deal with this. Approaches exist but nowhere near as good of a performance as a basic RNN let alone LSTMs and transformers.

jfengel · on March 11, 2023

To a first approximation, yes.

The second approximation has significant differences, but that's an ok first pass at it.

skybrian · on March 11, 2023

As a mathematical model, it's almost completely unhelpful, like saying that all computers are technically state machines because they have a finite amount of memory.

Treating every combination of 4k tokens as a separate state with independent probabilities is useless for making probability estimates.

Better to say that it's a stateless function that computes probabilities for the next token and leave Markov out of it.

ar9av · on March 11, 2023

ChatGPT and Markov Chain are both text-generating models, but they use different approaches and technologies. Markov Chain generates text based on probabilities of word sequences in a given text corpus, while ChatGPT is a neural network-based model.

Compared to Markov Chain, ChatGPT is more advanced and capable of producing more coherent and contextually relevant text. It has a better understanding of language structure, grammar, and meaning, and can generate longer and more complex texts.

KyeRussell · on March 11, 2023

I see what you did there.

gverrilla · on March 12, 2023

Me too! But how, exactly?

bitL · on March 11, 2023

RLHF uses Markov chains as its backbone, at least theoretically (deep NN function approximations inside might override any theoretical Markov chain effect though).

TechBro8615 · on March 11, 2023

I guess if you squint, it kinda is, in the sense that it generates one token at a time.

nerdponx · on March 11, 2023

It's not a Markov chain because by definition a Markov chain only looks at the previous word. ChatGPT looks at a long sequence of previous words. But the general idea is still broadly the same.

PartiallyTyped · on March 11, 2023

That's not correct. In a Markov chain, the current state is a sufficient characteristic of the future. For all intents and purposes you can create a state with sufficiently long history to look at a long sequence of words.

nerdponx · on March 11, 2023

Also fair, but then the "current" state would also be a long window/sequence. Maybe that interpretation is valid if you look at the activations inside the network, but I wouldn't know about that.

PartiallyTyped · on March 11, 2023

Yes, the state for both is a long window / sequence. Under this view, for the transformer we do not need to compute anything for the previous tokens as due to the causal nature of the model, the tokens at [0, ... N-1] are oblivious to the token N. For token N we can use the previous computations since they do not change.

xwdv · on March 11, 2023

Hence why you have to squint.

Accujack · on March 11, 2023

Actually, it looks at all the meanings of the tokens within its window.

theGnuMe · on March 11, 2023

It is in the sense that it is Markov chain with an 8k token memory and the "MCMC step" is the DNN.