Good lossy compression can be used to achieve lossless compression.
(information theory)
The more accurate the lossy compression is, the smaller the difference between the actual data (lossless) and the approximation. The smaller the difference, the fewer bits required to restore the original data.
So a naive approach is use the LLM to approximate the text (this would need to be deterministic --zero temp with a preset seed), then subtract that from the original data. Store this diff then add back to restore the original data bit-for-bit.
In psychoacoustics, one of the most important aspects of how lossy compression works is that they throw away sounds that a human can't even hear, because it's too short or subtle to be noticed. This is the difference between data and knowledge. Someone with perfect pitch and a good memory knows exactly what Don't Stop Me Now by Queen sounds like, and it's a lot smaller than digitizing an LP record in mint condition.
This person can cover that song, and everyone will be happy, because they have reproduced everything that makes that song what it is. If anything we might be disappointed because it's a verbatim reproduction (for some reason we prefer cover songs to introduce their own flavor).
If I ask you to play Don't Stop Me Now and you sound like alcoholic Karaoke, you haven't satisfied the request. You've lost. Actually we've all lost, please stop making that sound, and never do that again.
But that’s the good thing about this competition: it is fair to everyone. The perfect pitch version can just use a tiny amount of additional data to decompress to the same thing, while the alcoholic will have to correct for loads of differences, making that version be much larger.
This difference is the same monotonically increasing function in both cases so you can basically don’t have to care about it - you can fairly compare the lossy versions as well, you will get the same amount.
So the more advanced version wins, and the competition remains fair and non-ambiguous (otherwise, is perfect pitch A or perfect pitch B had the better cover?)
Exactly right. There would be no confusion if the Hutter prize was a competition for compressing human data.
This is a parallel issue to the ones in conversations around understanding. There's a massive gulf between being able to build an ontology about a thing and having an understanding of it: the former requires only a system of categorizing and relating elements, and has nothing to do with correctness per se; the latter is an effective (note: lossy!) compression of causal factors which generate the categories and relations. It's the difference between "Mercury, a star that doesn't twinkle, orbits Earth, the only planet in existence, and we have inferred curves that predict why it goes backwards in its orbit sometimes" and "Mercury and Earth, planets, both orbit the Sun, a star, and all stable orbits are elliptical".
I think Wikipedia is human data. But to say it's human knowledge (that is, that data is synonymous with knowledge in any context) is actually a pretty hardline epistemological stance to take, one that needs to be carefully examined and justified.
Completely agree, and I think this is where a lot of the confusion in the thread is coming from. The prize claims to compress “human knowledge” but it’s actually a fairly rigid natural language text compression challenge.
I’d conjecture that the algorithms that do well (as measured by compression ratio) on this particular corpus will also do well on corpuses that contain minimal ‘knowledge’.
Compression algorithms that do well might encode a lot of data about syntax, word triple frequencies, and Wikipedia editorial style, but I really doubt they’ll encode much “knowledge”.
while you are simply assuming (without a good reason) that I am saying
{data} ∩ {knowledge} = ∅
If you look up the chain with the example of singing a Queen song, maybe that will help you better understand the angle that that commenter and I share about the differences between data and knowledge.
One of the big problems in this use case is that a dump of wikipedia contains both knowledge and arbitrary noise. And lossless compression has to preserve every single bit of both. It's hard to tease out "good lossy compression" from the mess, because better and better lossy compression doesn't get you arbitrarily close to the original, it only gets you somewhat close.
That's true until the lossy compression alters the data in a way that actually makes it harder to represent. As an example, a bitstream that 'corrects' a decompressed .MP3 file to match the original raw waveform data would be almost as difficult to compress as the raw file itself would be.
It wouldn't be a matter of simply computing and compressing deltas at each sample, because frequency-domain compression moves the bits around.
(information theory)
The more accurate the lossy compression is, the smaller the difference between the actual data (lossless) and the approximation. The smaller the difference, the fewer bits required to restore the original data.
So a naive approach is use the LLM to approximate the text (this would need to be deterministic --zero temp with a preset seed), then subtract that from the original data. Store this diff then add back to restore the original data bit-for-bit.