While not exactly the idea in your post, this recent paper by Dubois, Bloem-Reddy, Ullrich, and Maddison on "Lossy Compression for Lossless Prediction" (https://arxiv.org/abs/2106.10800) kinda gets at this idea.
From the abstract:
Most data is automatically collected and only ever "seen" by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations...
Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than 1000× on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.
A sort-of related idea is: How small can an image of a human face be and still be recognizable? I remember X-face images from mail headers years ago -- these were just 48 x 48 x 1 pixel images, i.e., just 288 bytes (yes, bytes, not KB or MB). I marveled that the faces were indeed recognizable. I wish I could find an example of an X-face of a famous person (or even an unknown person) to give as an example here, but this seems to be one of those things that are ungoogleable.
There's a chart in the tiny face paper [.] that compares face size against accuracy. The paper concerns on detecting any faces instead of recognizing a single face though. Seems we can shrink down to 24x19 and still maintain 50% accuracy.
I do wonder if all the new AI-based "upscaling" algorithms, which can fill in detail to make an image larger, can be used to produce a compression-decompression algorithm: Perform a heavily lossy compression, then "upscale" the result and see if it returns to a good-enough match. If so, compress even further, until your upscaling no longer gives you a good-enough match.
Like regular JPG encoding, this can be set to an arbitrary level of "good enough."
This has been tried in research and works, with typical artefacts such as horses getting zebra-stripes if they stand in tall grass (or something like that ...)
> Dürer's Rhinoceros is the name commonly given to a woodcut executed by German painter and printmaker Albrecht Dürer in 1515.[1] The image is based on a written description and brief sketch by an unknown artist of an Indian rhinoceros that had arrived in Lisbon in 1515.[2] Dürer never saw the actual rhinoceros, which was the first living example seen in Europe since Roman times.
This works, and there's also a fun variant where you then losslessly compress the residual between ground truth and the up-sampled image. Since most of the information was encoded in the NN and already exists at both ends, the data requirements in between are modest.
Sufficiently good compression is necessarily dimensionality reduction. Out of compression AI can be jerryrigged, and out of AI compression can be jerryrigged. This truth may be relevant to the evolutionary history of natural intelligence.
Why stop at the 'AI's limit? We can do better than that. Keep going until a human can no longer recognise the image! (There's always a mechanical Turk to do the work.)
And why stop when the image is unrecognisable? Why not translate the image of (say) a cat into an image of the word 'cat'? And if you've done that compression, take it a step further - translate the image of the cat into the text string 'cat'. Now you're getting seriously good compression ratios :)
A long time ago I thought that this kind of compression could be interesting for voice chat/phone calls. Locally convert speech to text, send raw text over the wire, and convert text back to speech.
You could fine tune some existing text to speech model to sound like you. Once you transfer that to your friends/frequent contacts you could have voice conversations over very low bandwidth channels.
There are obviously plenty of caveats to this, like communicating sounds that aren’t speech. It could be a useful fallback e.g. if your network quality is seriously degraded then your client switches to TTS mode.
I don't think I've ever had a functional connection slower than that. You need packets to be making it through in a timely manner if you want a voice conversation, after all.
There are speech codecs that essentially do this but with a phonetic encoding, instead of "normal" text.
Yes, they are only used for extremely bandwidth-starved situations, but they do work. An example is Codec2, which works down to 0.7 kbit/s.
That's "dictionary" compression still. AI checks for hints to lookup in well... dictionary. If the dictionary changes, your compression results change too.
I mean what the words "cat on laptop" mean to the AI - that "dictionary" - or more like that state of things where F("cat on laptop") returns on specific thing, and then after some time might return another, unless you control that state.
PixelCNN++ does this, kinda. Residual blocks that use downsampled versions of the image. Though in that context, it gets used for lossless compression by entropy coding the likelihoods of each pixel.
This idea - trying to investigate behavior at different scale parameters - is also related to renormalization. But I think you can go further than this - currently you are just preserving that which remains consistent. You should also somehow store the complement of that information - the difference between the original and what you get after downscaling then upscaling the downscaled version again.
I imagine it's going to be mostly the noisy data local to certain regions of the image, some kind of high frequency details. But if you keep track of both, you can make each rescaling bijective, making your 'compression' invertible, hence lossless, in some sense. Whether you get any size savings doing that is another question altogether :^)
> What would image compression look like if designed around other things perceptions?
For computers, the question is meaningless because they only interact with our outside reality in the form of a "Chinese room" translation task. There's nothing inherently salient about image data to a computer, because the "ground floor" reality of computers is instructions for moving electrical charges around on some chips. But, given computers with built in cameras and having some use for discerning its reality of objects in 3d space, subject to Newtonian physics, and projected down to a 2d CCD array, the ideal compression would be subject to the same pressures as influenced the evolution of the human eyes and brain. Thus it should be similar.
Isn't it funny how similar compression artifacts can be to the visual distortion of psychedelics.
> But, given computers with built in cameras and having some use for discerning its reality of objects in 3d space, subject to Newtonian physics, and projected down to a 2d CCD array, the ideal compression would be subject to the same pressures as influenced the evolution of the human eyes and brain. Thus it should be similar.
You have a good point! Though I wouldn't say the 'same' pressures. We can actually identify differences in those pressure, and use those to speculate.
Eg human computational capacity does well in terms of total operations performed, but individual units are really, really slow compared to electronics. That means human hardware can't run anything that requires a lot of 'serial depth', everything has to be almost embarrassingly parallel to work, or involve a lot of lookup from caches.
Also the design of humans is constrained by what evolution can do. Eg we still have blind spots in our eyes, mostly because evolution can only take 'small steps' (small in the space of genes), and inverting the eye so that the light receptive cells are in front instead of behind the nerves and blood vessels is not a small step in that space.
The power constraints on humans and computer are likely to be different. Humans are also not networked with sub-second latencies to the rest of world.
(Just to be clear, the examples I gave are second-order corrections to your much more important first-order idea!)
I'll take it one step further. The arrangement of rods and cones in the eye are optimized precisely around the task of capturing the salient details of a projected image. The distribution has an exponential form radiating from the center precisely because the brain is working in log polar coordinates (or similar). CCD arrays are stuck with the naive 2d grid arrangement.
Though: I can believe that our arrangement is a local optimum, sure. (Local in some genetic sense.)
That still doesn't make me believe that are anywhere close to a global optimum. Behold the octopus:
> Do these design problems [of optical blind spots] exist because it is impossible to construct an eye that is wired properly, so that the light-sensitive cells face the incoming image? Not at all. Many organisms have eyes in which the neural wiring is neatly tucked away behind the photoreceptor layer. The squid and the octopus, for example, have a lens-and-retina eye quite similar to our own, but their eyes are wired right-side out, with no light-scattering nerve cells or blood vessels in front of the photoreceptors, and no blind spot.
Reminds me when I as a 14 year old, barely two years after I started learning programming and having just learned about cyclic redundancy check (CRC), wondered why they couldn't be used for compression. Take a block of say 16 bytes, calculate the CRC code and store that.
To decompress you then just had to find the string of bytes that generated a matching CRC code.
It seemed so brilliant, but something told me it was too good to be true, and surely someone would have come up with this before me... So I tried figuring out why it wouldn't work. Still, took me a day or so before I figured out how long "just" was and that collisions would render it moot anyway.
Learned a lot from that though, as an aspiring programmer.
Reminds me of my genius compression algorithm from when I was around 12. Information on the computer is just numbers right? So why not just divide the number by two over and over until its in the single digits. Infinite compression. The only thing left to figure out is how to compress all the remainders from the numbers that aren't divisible by two.
Needless to say, I never could solve that problem.
> formats like JPEG are designed around human perception
This isn't really true. JPEG is fourier or wavelet based. It will compress any image by discarding fourier coefficients, regardless if the input image makes sense to humans.
This reminded me of a weird idea that I had several years ago (which is kind of opposite of this):
What if there was a format that doesn't encode pixels but generates a "description" of the images using AI, of simple shapes to higher-level objects, and saves that in its own format, with a "decoder" that takes this input description format and constructs an image similar to the original, at least to some extent.
Of course this would have many drawbacks like encoding micro details and expressions of people, but still, I'd love to see how far the idea can go.
>Fleet Central refused the full video link coming from the Out of Band … Kjet had to settle for a combat link: The screen showed a color image with high resolution. Looking at it carefully, one realized the thing was a poor evocation…. Kjet recognized Owner Limmende and Jan Skrits, her chief of staff, but they looked several years out of style: old video matched with the transmitted animation cues. The actual communication channel was less than four thousand bits per second
>The picture was crisp and clear, but when the figures moved it was with cartoonlike awkwardness. And some of the faces belonged to people Kjet knew had been transferred before the fall of Sjandra Kei. The processors here on the Ølvira were taking the narrowband signal from Fleet Central, fleshing it out with detailed (and out of date) background and evoking the image shown.
This could be useful to reduce Disney’s bandwidth costs. Marvel and Star Wars productions are highly repetitive anyway, so maybe they could ship a preloaded Marvel neural network in the player app and design new movies to take maximum advantage of the clichés already encoded there.
They used to literally reuse the animation in older Disney films, just with different characters on top. You could send that information, the scene and the characters, in significantly fewer bits than it would take as video
If I’m not mistaken, this is more generalizable as Silly HashMap Algorithm Idea, which is probably a worthy pursuit. The idea is to find the minimal meaningful key with the highest likelihood of resolving to the correct value. Compression is meaningless to the former but valuable for the latter.
the smallest compressor is just the (differentiable) classier itself. The compressed image is just the output class, and the de-compressor generates a random image then back propagates the class error until it gets an image that has the right class. Maybe you toss in a random seed in the format, so the compressor can spend some more time to guarantee the de-compressor will complete in a reasonable amount of time.
he a little confused but he got the right idea. there is a very natural and straightforward application of ML to compression: the better your model of P(data) gets, the better your compression ratio
From the abstract:
Most data is automatically collected and only ever "seen" by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations...
Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than 1000× on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.
Pretty cool stuff!