Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It seems clear that the space of all possible texts to predict is so vast, that the only way to do effective prediction like that is to do actual "understanding".

This makes sense if you think about it from a Kolmogorov complexity point of view. A program that outputs correct colors for all of those boxes, and does all the other things, based on memorization alone, will end up needing a hopelessly gigantic "chinese room" dictionary for every combinatorial situation. Even with all the parameters in these models, it would not be enough. On the other hand, a program that simply does the logic and returns the logically- correct result would be much shorter.

Seems obvious so I'm not sure why this confused argument continues.



Doesn't your argument assume that there are no patterns inherent to language itself that can't be explained by intelligence?

As an extreme example, your argument could be used to support the idea that statistical fitting is impossible. Since any process that outputs the correct answer based on a process of memorization(e.g. fitting the data) would require a hopelessly gigantic "chinese dictionary" for all the possible input:output combinations.

Compression and entropy are super interesting, I don't understand this need to explain them away instead of trying to understand more about why this form of compression is so effective for language models.


Compression and entropy are super interesting! I'm not explaining them away; I'm pointing out that effective compression is likely to end up emulating some of the training data's generative process itself -- the patterns of thought that created text like this -- because it becomes harder and harder to "memorize" as the problem space grows.

If you try to fit samples from the function 5.2*sin(17.3 x + 0.25) using a bunch of piecewise-linear lookup tables as your basis, you'll need a very large table to get good accuracy over any range! A much more effective compression of that data is the function itself. And so if your basis includes sine and cosine functions, you'll get a very accurate fit with those, very quickly and compactly.

Claude Shannon built an early language model which was doing "just" statistics, which he describes in his famous paper A Mathematical Theory of Communication. Compression was exactly his goal. He builds up a Markov model to draws letters from English, including ever-longer correlations. A first order approximation draws random letters according to their frequency of occurrence in English:

OCRO HLI RGWR NMIELWIS...

A second order approximation is based on the probability of transition from one character to the next, for example Q will always be followed by U:

ON IE ANTSOUTINYS ARE T INCTORE...

The next more refined model includes trigram probabilities, for example TH will usually be followed by O or E:

IN NO IST LAT WHEY CRATICT...

It gradually gets more English-like, yes, but you are never going to get GPT-like performance from extending this method to character n-grams of n = 1,000,000. At this point in the paper Shannon starts over, using words and word transition probabilities, to get:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT...

But even then, just extending this Markov model to account for the transition probabilities of n-grams of words won't get you to GPT. And obviously there's no interesting thought happening inside there.

That's the point I'm trying to make here; when some people say it's "just statistics", they seem to be imagining a Markov model extended to be very large, which knows which words follow other words which follow other words, with various transition probabilities...

But that wouldn't work well enough to correctly play spontaneously-invented logic puzzles. The problem space grows too fast for that to work.

Try to create a chess program that uses a Markov model. You can do it in theory, but you'll effectively need to fit the (10^120)-gram of transition probabilities (Shannon's number). Now compare that to a chess program that does an alpha-beta search. Even monkeys on typewriters are more likely recreate Deep Blue's programming than a decent Markov-based chess program.


Thanks for putting that much more intelligently than I could have.

Tangent: I snooped in your profile and found the Eliezer Yudkowsky interview. I just re-posted it in hopes of further discussion and to raise my one point. https://news.ycombinator.com/item?id=35443581


Who said that ChatGPT is effective?

There's an equivalent of a Google web index under the hood.


The most immediate example is consumer451 said so in the grandparent comment. Do you find a solution by Googling for their query?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: