Some background before i explain why your suggestion is "not even wrong". If you can predict the next X bits given previous Y bits of observation you don't need to store the next X bits, the decompressor can just use the same prediction algorithm you just used and write out its predictions without correction. This is the same as general purpose AI at a high level. If you can predict the next X bits given previous Y bits of observation you can make reasoned a decision ("Choose option A now so that the next X bits have a certain outcome").
The above is actually the premise of the competition and the reason it exists. What i've said above is in academic papers in detail by Marcus Hutter et al. who runs this competition. That is lossless compression can be a scorecard of prediction which is the same as AGI.
Now saying "they should just make it lossy" misses the point. Do you know how to turn lossy data into lossless? You store some data everytime you're wrong on your prediction. ie. Store data when you have loss to make it lossy. This is arithmetic coding in a nutshell. You can turn any lossy data into lossless data with arithmetic coding. This will require more data the more loss you have. The lossless requirement gives us a scorecard of how well the lossy prediction worked.
If you ask for this to be lossy compression you throw out that scorecard and bring in an entire amount of subjectivity to this which is unwanted.
Some background before i explain why your suggestion is "not even wrong".
The phrase you quote is generally used to imply a stupid or unscientific suggestion. Your succeeding comments about what you think AGI is carry a certitude that isn't warranted.
It's good that you are trying to supply knowledge where you think it is lacking, and I understand there are fora where this sort of public school lecturing is amusing but I think your tone is misplaced here.
Years long compression challenge with dozens of geniuses participating: exists
Random person on the internet: let me improve this thing I've never heard of by using the one fact I know about compression, there are two kinds
It's absolute hubris and a waste of everyone's time to chime in with low value, trash comments like "they should make it lossy". It's not unreasonable at all to take a snarky tone in response. "not even wrong" absolutely applies here, and they carefully, patiently, and in great detail explained why.
I often feel the same way when discussions pop up here or on other forums, about topics I'm familiar with. Like randos declaring that researchers in deep learning are "obviously doing it wrong" and they should instead do X, where X is like an entire subfield existing for years with a lot of activity, etc.
So I get where you're coming from. But I'd suggest that a place like HN is in fact a place for random people to inject their half-baked takes. It is a just discussion board where lots of the comments will be uninformed or wrong. Take it or leave it. If you want something else, you need to find more niche communities that are - by the nature of it - more difficult to find and less public, including IRL discussion, clubs, conferences etc. But it has its use: we, you and me can jump in any thread and type out what we think after 2 minutes and get some response. But of course someone even more novice might think that we know more than just that 2 minutes consideration, and they learn our junk opinion as if it was the result of long experience. It's unavoidable, since nobody knows who the rest of the commenters are.
Online discussions are incredibly noisy, and often even the people who seem to use the jargon and seem knowledgeable to the outsider can be totally off-base and essentially just imitate how the particular science or field "sounds like". Unfortunately, you only learn this gradually and over a long time. If you learn stuff through forums, Reddit, HN, blogs, substacks etc. it can be very misleading from the first-person experience because you will soak up lots of nonsense as well. Reading actual books and taking real courses is still very much relevant.
HN and co. are more like the cacophony of what the guy on the street thinks. Very noisy, and only supposed to be a small treat over rigorous study. You shouldn't expect to see someone truly breaking new ground in this comment thread. If it disturbs you, you can skip the comments. But trying to "forbid" it, or gatekeep is futile. It's like trying to tell people in a bar not to discuss how bad the soccer team coach is, because they don't really have the relevant expertise. Yeah, sure, but people just wanna chat and throw ideas around. It's on the reader to know not to take it too seriously.
ISWYM but the problem isn't really people making suggestions, but the way they make suggestions that grates. If Mr Internet-random-guy wants to introduce the issue of lossy compression then by all means ask why not, but don't say they should.
It comes across as arrogance, probably because it is, then it sucks up plenty of the time of others who do actually know the subject, putting something right.
Even more bloody annoying is when people ask when even the most immediate web search would get the answer. Wikipedia is usually a very good place to start. I guess that for these people, the cost is externalising it to other's wasted time.
Then again, we all take turns at being the stupid one, so am I to complain.
It's inherent in reading comments. And it's also inherent in encountering mere mortals in the real world. And remember how the most annoying and stupid people keep going on about how all the people they meet are stupid and annoying. There's no point in piling on another layer. Close the tab, or comment constructively and charitably. Else you end up with stuff like the badX subreddits (badhistory, badphilosophy) who get their adrenaline/dopamine fix by seeking explanations they see as ignorant/naive/arrogant and sneering at it while self-aggrandizing and feeling like they are in the inner circle who know it all.
The other thing is, you never see all the people who do go to Wikipedia, google or check a book. They won't comment "Hello I'm not commenting now because I went to Wikipedia". They just don't comment.
And Cunningham's Law states "the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer."
People are more prone to comment out of frustration than other feelings.
You may find the book "Structure of Scientific Revolutions" interesting if you have not read it. The author posits that it isn't people entrenched in the field that will offer breakthrough advancements, but it is instead outsiders looking in.
More often than not, they are aggressively rebutted, which leads to the belief that science progresses one funeral at a time. Perhaps it is you who needs to guard against hubris?
I wish I could claim the logic, but I am a nobody who knows nothing. The author of the book, however, has impressed people for over 70 years with this line of thought and I too agree with him.
I did read the thread. I'm objecting to the idea that stacking lossless corrections onto lossy compression and then measuring the total is a good way to measure what we want to measure here, wrt human knowledge. It may be the best we have, but it's not good.
Why should we care what you think though? Im not being nasty, but unless you have a reputation in that field you have to give some cogent argument or its just some random possibly-uninformed opinion.
There is nothing wrong with jumping in and saying stuff that probably isn't right on an internet forum! That doesn't make it valuable or insightful, it's just how forums work. Your comment was cool with me and you shouldn't feel like you need to change at all.
That said, the response to your comment was insightful and made interesting points. You did in fact kick off a very interesting conversation!
I feel your tone-policing is misplaced here. When someone is that ignorant or foolish they should be told so so they can hopefully recalibrate or something. There's enough inchoate nonsense on the Internet that I appreciate the efforts to keep discussion on a little higher level here.
> There's enough inchoate nonsense on the Internet
I agree 100%. But the top-level comment is not an example of such.
However, the reply in question – and your comment – are certainly examples of the kind of tone-deaf, needlessly aggressive, hostile, confrontational, and borderline malicious posts I wish I could cleanse the Internet of wholesale.
> tone-deaf, needlessly aggressive, hostile, confrontational, and borderline malicious posts I wish I could cleanse the Internet of wholesale.
I have never said this before, but maybe you're a little too sensitive?
In any event, I feel your characterization of my comment borders on ad hominem and certainly it seems to violate the site guideline to interpret comments charitably.
Banter by definition is teasing, which insulting. That, along with being humorous in that context and not to be taken seriously is what makes it banter rather than any other form of exchange.
Lets just go to the dictionary:
Banter (n): the playful and friendly exchange of teasing remarks.
Teasing (adj): intended to provoke or make fun of someone in a playful way.
Is there any evidence that arithmetic coding works at the level of concepts?
As a thought experiment, suppose Copernicus came up with this Hutter prize idea and declared that he would provide the award to whoever could compress the text of his book on the epicycle-based movement of planets around the sun (De revolutionibus orbium coelestium).
Today we can explain the actual motion to high accuracy with a single sentence that would have been understandable in that age: "A line drawn from the sun to any planet sweeps out equal areas in equal time"
This however is mostly useless in attempting to win the Copernican Hutter prize. Predicting the wording of some random human's choosing (especially at length) is very far removed from ability to predict in general.
Arithmetic coding isn't the key thing here. That's just a bit wise algorithm. You predict the next bit is a 1 with 90% certainty? you don't need to store much data with arithmetic coding. That's all arithmetic coding is here.
What your getting at is the 'predictor' that feeds into the arithmetic coder and that's wide open and can work any way you want it to. LLMs absolutely have context which is similar to what your asking and they are good predictors of output given complex input (pass gpt a mathematical series and ask it what comes next. If it's right then it's really helpful in compression as you wouldn't need to store the whole series in full).
So you don’t think that writing a wiki article about this could be made smaller by encoding this info in a few logical steps, and adding some metadata on what kind of sentence should follow what? That part about decompressing it is the AI part. Where to place a comma can be added in constant cost between any contender programs.
Sure my point is more that this adding of commas business dwarfs any real prediction.
Suppose there were a superintelligence that figured out the theory of everything for the universe. It's unclear that would actually help with this task. You could likely easily derive things like gravitaiton, chemistry, etc but the vast majority of your bits would still be used attempting to match the persona and wording of the various wikipedia authors.
This superintelligence would be masked by some LLM that is slightly better at faking human wording.
But that comma will have the exact same price between 2 contending lossy compressions. In fact, it is a monotonic function of the difference, so the better your lossy compression is, the better your arithmetic one will be — making you measure the correct thing in an objective way.
It’s like, smart people have spent more than 3 minutes on this problem already.
> But that comma will have the exact same price between 2 contending lossy compressions.
Why do you think that? Do you have proof of that?
> making you measure the correct thing
If we're trying to measure knowledge, then the exact wording is not part of being correct.
Very often you will have to be less correct to match wikipedia's working. A better lossy encoding of knowledge would have a higher cost to correct it into a perfect match of the source.
> Why do you think that? Do you have proof of that?
You want to encode “it’s cloudy, so it’ll rain”. Your lossy, intelligent algorithm comes up with “it is cloudy so it will rain”.
You save the diff and apply it. If another, worse algorithm can only produce “it’s cloudy so sunny”, it will have to pay more in the diff, which scales with the number of differences between the produced and original string.
You can be less correct, if that cumulatively produces better results, that’s the beauty of the problem - the last “mile” difference is the same cost for everyone as a factor of the difference.
How about "it is cloudy so it will rain" and "it's cloudy, so sunny"? Then since we're looking at the commas for this argument, the second algorithm is paying less for comma correction even though it's much wronger.
You seem to be assuming that a less intelligent algorithm is worse at matching the original text in every way, and I don't think that assumption is warranted.
I'll rephrase the last line from my earlier post: What if wikipedia is using the incorrect word in a lot of locations, and the smart algorithm predicts the correct word? That means the smart algorithm is a better encoding of knowledge, but it gets punished for it.
In that case the last mile cost is higher for a smart algorithm.
And even when the last mile cost is roughly the same, the bigger of a percentage it becomes, the harder it is to measure anything else.
And it shuns any algorithm that's (for example) 5% better at knowledge and 2% worse at the last mile, even though such a result should be a huge win. There are lots of possible ways to encode knowledge that will drag things just a bit away from the original arbitrary wording. So even if you use the same sub-algorithm to do the last mile, it will have to spend more bits. I don't think this is an unlikely scenario.
> Then since we're looking at the commas for this argument, the second algorithm is paying less for comma correction even though it's much wronger.
And? It will surely have to be on average more correct than another competitor, otherwise its size will be much larger.
> What if wikipedia is using the incorrect word in a lot of locations,
Then you write s/wrongword/goodword for a few more bytes. It won't be a deciding factor, but to beat trivial compressions you do have to be more smart than plain looking at the data - that's the point.
> And it shuns any algorithm that's (for example) 5% better at knowledge and 2% worse at the last mile
That's not how it works. With all due respect, much smarter people than us has been thinking about it for many years - let's not try to make up why it's wrong after thinking about it badly for 3 minutes.
> And? It will surely have to be on average more correct than another competitor, otherwise its size will be much larger.
It's possible to have an algorithm that is consistently closer in meaning but also consistently gets commas (or XML) wrong and pays a penalty every time.
Let's say both that algorithm and its competitor are using 80MB at this stage, before fixups.
Which one is more correct?
If you say "the one that needs fewer bytes of fixups is more correct", then that is a valid metric but you're not measuring human knowledge.
A human knowledge metric would say that the first one is a more correct 80MB lossy encoding, regardless of how many bytes it takes to restore the original text.
> Then you write s/wrongword/goodword for a few more bytes. It won't be a deciding factor
You can't just declare it won't be a deciding factor. If different algorithms are good at different things, it might be a deciding factor.
> That's not how it works. With all due respect, much smarter people than us has been thinking about it for many years - let's not try to make up why it's wrong after thinking about it badly for 3 minutes.
Prove it!
Specifically, prove they disagree with what I'm saying.
There’s no way I could reproduce a calculus textbook verbatim, but I can probably prove all the important theorems in it.
Even then, given half of any sentence in the book, I don’t rate my chances of reproducing the next half. That’s more a question of knowing the author’s style than knowing calculus itself.
Let’s say you come up with the exact same text given an initial seed (of you ate this and that that day). Then it really is just the arithmetic function of the difference, as it is deterministic.
Now, who will get closer to the algorithm with that added arithmetic coding? You, knowing the proofs, or a random guy that doesn’t even speak the language? Does it then measure intelligence all else being equals!
If it's 99.99% accurate arithmetic coding would have next to no data stored.
Arithmetic coding is optimal in turning probabilistic data into lossless data. There's provably no way to do it more efficiently than arithmetic coding. The data it needs for corrections is smaller the better the predictions are.
So given this why even dwell on ways that add any form of subjectivity. Arithmetic coding is there. It's a simple algorithm.
There's a section in the link above "Further Recommended Technical Reading relevant to the Compression=AI Paradigm" and they define it in a reasonably precise mathematical way. It's well accepted at this point. If you can take input, predict what will happen given some options you can direct towards a certain goal. This ability to direct towards a goal effectively defines AGI. "Make paperclips" and the AI observes the world, what decisions needed to be made to optimize for output paperclips and then starts taking decisions to output paperclips is essentially what we mean by AGI and prediction is a piece of this.
I have no stake in this btw, I've just had a crack at the above challenge in my younger days. I failed but i want to get back into it. In theory a small LLM model without any existing training data (for size) that trains itself on the input as it passes predictions to an arithmetic coder that optimally compresses and the same process on the decompression side should work really well here. But i don't have the time these days. Sigh.
This ability to direct towards a goal effectively defines AGI
No it doesn't, though it may be argued to be a requirement.
That's the point of the previous commenter - that you are making unjustified assertions using an extrapolation of the views of some researchers. Reiterating it with a pointer to why they believe that to be the case doesn't make it more so.
If that's your favoured interpretation, fine, but that's all it is at this point.
Go argue with the scientists who state pretty much what i just said verbatim including full links with proofs in http://prize.hutter1.net/hfaq.htm#ai :)
>One can prove that the better you can compress, the better you can predict; and being able to predict [the environment] well is key for being able to act well. Consider the sequence of 1000 digits "14159...[990 more digits]...01989". If it looks random to you, you can neither compress it nor can you predict the 1001st digit. If you realize that they are the first 1000 digits of π, you can compress the sequence and predict the next digit. While the program computing the digits of π is an example of a one-part self-extracting archive, the impressive Minimum Description Length (MDL) principle is a two-part coding scheme akin to a (parameterized) decompressor plus a compressed archive. If M is a probabilistic model of the data X, then the data can be compressed (to an archive of) length log(1/P(X|M)) via arithmetic coding, where P(X|M) is the probability of X under M. The decompressor must know M, hence has length L(M). One can show that the model M that minimizes the total length L(M)+log(1/P(X|M)) leads to best predictions of future data. For instance, the quality of natural language models is typically judged by its Perplexity, which is equivalent to code length. Finally, sequential decision theory tells you how to exploit such models M for optimal rational actions. Indeed, integrating compression (=prediction) into sequential decision theory (=stochastic planning) can serve as the theoretical foundations of super-intelligence (brief introduction, comprehensive introduction, full treatment with proofs.
Whether or not you agree, a lot of people do. There is a trivial sense in which a perfect compression algorithm is a perfect predictor (if it ever mispredicted anything, that error would make it a sub-optimal compressor for a corpus that included that utterance), and there are plenty of ways to prove that a perfect predictor can be used as an optimal actor (if you ever mispredicted the outcome of an event worse than what might be fundamentally necessary due to limited observations or quantum shenanigans, that would be a sub-optimal prediction and hence you would be a sub-optimal compressor), a.k.a. an AGI.
Where a lot of us get off the fence is when we remove "perfect" from the mix. I don't personally think that performance on a compression task correlates very strongly with what we'd generally consider as intelligence. I suspect good AGIs will function as excellent compression routines, but I don't think optimizing on compression ratio will necessarily be fruitful. And I think it's quite possible that a more powerful AGI could perform worse at compression than a weaker one, for a million reasons.
If you had a perfect lossless compressor (one that could compress anything down to it's fundamental kolmogorov complexity), you would also definitionally have an oracle that could compute any computable function.
Intelligence would be a subset of the capabilities of such an oracle.
No, because instructions to compute any computable function for an utm only requires a tiny set of instructions on how to generate every possible permutation over forever increasing lengths of tape.
It will run forever, and I would agree that in that set there will be an infinite number of functions that when run would be deemed intelligent, but that does not make the computer itself intelligent absent first stumbling on one of those specific programs.
EDIT: Put another way, if the potential to be made to compute in a way we would deem intelligent is itself intelligence, then a lump of random particles is intelligent because it could be rearranged into a brain.
The analogy is not a random lump of particles but all particles in all configurations, of which you are a subset. Is the set of you plus a rock intelligent?
This is a fundamentally flawed argument, because a computer is not in all the states it can execute at once, and naively iterating over the set of possible states might well not have found a single intelligence before the heat death of the universe.
If we were picking me out of an equally large set of object, then I'd argue that no, the set is not meaningfully intelligent, because the odds of picking me would be negligible enough that it'd be unreasonable in the extreme to assign the set any of my characteristics.
Sorry but i hope my long-winded explanation explains it.
In computer science we have a way to score how good lossy data is. That way is to make it lossless and look at how much data the arithmetic coder needed to correct it from lossy to lossless.
This is a mathematically perfect way to judge (you can't correct lossy data any more efficiently than an arithmetic coder). All the entries here do in fact make probabilistic predictions on the data and they do all use arithmetic coding. So the suggestion misses a key point of CS involved here. I don't mean to be rude about it but the idea does need correcting.
Only if you're using a very particular and honestly circular-sounding definition of "good".
Some deviations are more important than others, even if you're looking at deviations that take the same amount of data to correct.
Think about film grain. Some codecs can characterize it, remove it when compressing, and then synthesize new visually matching grain when decompressing.
Let's say it takes a billion bytes to turn the lossy version back into the lossless version.
The version with synthetic film grain still needs a billion bytes or maybe even slightly more bytes, even if the synthetic grain is 95% as good as real grain. The cost to turn it lossless is the wrong metric.
The above is actually the premise of the competition and the reason it exists. What i've said above is in academic papers in detail by Marcus Hutter et al. who runs this competition. That is lossless compression can be a scorecard of prediction which is the same as AGI.
Now saying "they should just make it lossy" misses the point. Do you know how to turn lossy data into lossless? You store some data everytime you're wrong on your prediction. ie. Store data when you have loss to make it lossy. This is arithmetic coding in a nutshell. You can turn any lossy data into lossless data with arithmetic coding. This will require more data the more loss you have. The lossless requirement gives us a scorecard of how well the lossy prediction worked.
If you ask for this to be lossy compression you throw out that scorecard and bring in an entire amount of subjectivity to this which is unwanted.