Sorry, I didn't get back to this until now. I was thinking of lossless compressi...

Sorry, I didn't get back to this until now.

I was thinking of lossless compression, lossy is another can of worms on top of lossless. Lossless compression works, in principle, by the realization that the probability distribution of the all possible input strings is not flat. There are some strings that are more probable, and strings that are less probable. The reason some strings are more probable is that they result not from random processes, but there is some process or algorithm that generate them. Thus, they have inherent structure to them. A random string would be a string that has no structure detectable by any algorithm smaller than the string itself. If the probability distribution of the input strings is not flat, then we can use entropy coding to describe the common case - non-random data – in less bits than the input.

However, the difference between some specific compression algorithms and a "general compression algorithm" is the assumptions they do. Most compression algorithms don't consider the probability distribution of the "full" set of input strings, but they rather divide the input string into a predictable sized chunks, and consider the distribution of those chunks. This is way simpler, and yields to being able to compress somewhat well, while still having rather "static" distribution (like morse code), or only simple algorithms (like adaptive Huffman coding) to adapt the distribution to the input data.

But if we don't restrict ourselves to the world of "compressing a stream, message by message", but enable "intelligent" compression that is allowed to use any computable means to achieve smaller compressed size and can adapt to the whole context of the input, we can see that "message-by-message" entropy coding is only a subset of what we can do. (And we of course _also_ have that subset in our toolbox.) But the true challenge of the compression now evolves into being able to find and represent approximations to the "full" distribution of the input data. That involves things like identifying structure in weather data! If the input is large enough to start with, the more complex model might be worth it. And if it isn't then we can intelligently just decide to use DEFLATE.

> what room is there for "intelligence" in a rigid, statically-defined, system?

But as we can see from above, surely intelligence that is tasked with compressing stuff, doesn't need to be "rigid"? The intelligence is in the compressor, not in the compressed result. The compressed result might be assembly code that unpacks itself by doing arbitrary computations, and to achieve good results, the generator of that code must be sufficiently intelligent.