Good lossy compression can be used to achieve lossless compression. (information...

hinkley · on Sept 13, 2023

In psychoacoustics, one of the most important aspects of how lossy compression works is that they throw away sounds that a human can't even hear, because it's too short or subtle to be noticed. This is the difference between data and knowledge. Someone with perfect pitch and a good memory knows exactly what Don't Stop Me Now by Queen sounds like, and it's a lot smaller than digitizing an LP record in mint condition.

This person can cover that song, and everyone will be happy, because they have reproduced everything that makes that song what it is. If anything we might be disappointed because it's a verbatim reproduction (for some reason we prefer cover songs to introduce their own flavor).

If I ask you to play Don't Stop Me Now and you sound like alcoholic Karaoke, you haven't satisfied the request. You've lost. Actually we've all lost, please stop making that sound, and never do that again.

kaba0 · on Sept 14, 2023

But that’s the good thing about this competition: it is fair to everyone. The perfect pitch version can just use a tiny amount of additional data to decompress to the same thing, while the alcoholic will have to correct for loads of differences, making that version be much larger.

This difference is the same monotonically increasing function in both cases so you can basically don’t have to care about it - you can fairly compare the lossy versions as well, you will get the same amount.

So the more advanced version wins, and the competition remains fair and non-ambiguous (otherwise, is perfect pitch A or perfect pitch B had the better cover?)

uoaei · on Sept 14, 2023

Exactly right. There would be no confusion if the Hutter prize was a competition for compressing human data.

This is a parallel issue to the ones in conversations around understanding. There's a massive gulf between being able to build an ontology about a thing and having an understanding of it: the former requires only a system of categorizing and relating elements, and has nothing to do with correctness per se; the latter is an effective (note: lossy!) compression of causal factors which generate the categories and relations. It's the difference between "Mercury, a star that doesn't twinkle, orbits Earth, the only planet in existence, and we have inferred curves that predict why it goes backwards in its orbit sometimes" and "Mercury and Earth, planets, both orbit the Sun, a star, and all stable orbits are elliptical".

kaba0 · on Sept 14, 2023

How is the sum total of wikipedia not human data?

uoaei · on Sept 14, 2023

I think Wikipedia is human data. But to say it's human knowledge (that is, that data is synonymous with knowledge in any context) is actually a pretty hardline epistemological stance to take, one that needs to be carefully examined and justified.

soVeryTired · on Sept 14, 2023

Completely agree, and I think this is where a lot of the confusion in the thread is coming from. The prize claims to compress “human knowledge” but it’s actually a fairly rigid natural language text compression challenge.

I’d conjecture that the algorithms that do well (as measured by compression ratio) on this particular corpus will also do well on corpuses that contain minimal ‘knowledge’.

Compression algorithms that do well might encode a lot of data about syntax, word triple frequencies, and Wikipedia editorial style, but I really doubt they’ll encode much “knowledge”.

kaba0 · on Sept 14, 2023

It is not the sum-total of human knowledge surely. But that it is part of that knowledge is not a controversial statement in my opinion.

uoaei · on Sept 14, 2023

I didn't say it's the sum total anywhere nor did I imply that in any way. I also didn't say it doesn't represent some aspect of human knowledge.

In set notation, what I am saying in above comments is

    ({data} ∪ {knowledge}) \ ({data} ∩ {knowledge}) ≠ ∅

while you are simply assuming (without a good reason) that I am saying

    {data} ∩ {knowledge} = ∅

If you look up the chain with the example of singing a Queen song, maybe that will help you better understand the angle that that commenter and I share about the differences between data and knowledge.

Dylan16807 · on Sept 14, 2023

One of the big problems in this use case is that a dump of wikipedia contains both knowledge and arbitrary noise. And lossless compression has to preserve every single bit of both. It's hard to tease out "good lossy compression" from the mess, because better and better lossy compression doesn't get you arbitrarily close to the original, it only gets you somewhat close.

Xcelerate · on Sept 13, 2023

Who downvoted you? This is correct; the better the lossy compression, the better the lossless compression as well.

CamperBob2 · on Sept 14, 2023

That's true until the lossy compression alters the data in a way that actually makes it harder to represent. As an example, a bitstream that 'corrects' a decompressed .MP3 file to match the original raw waveform data would be almost as difficult to compress as the raw file itself would be.

It wouldn't be a matter of simply computing and compressing deltas at each sample, because frequency-domain compression moves the bits around.