Hacker News new | past | comments | ask | show | jobs | submit login
Silly Image Compression Idea (snufk.in)
87 points by sn_fk_n on April 10, 2022 | hide | past | favorite | 56 comments



While not exactly the idea in your post, this recent paper by Dubois, Bloem-Reddy, Ullrich, and Maddison on "Lossy Compression for Lossless Prediction" (https://arxiv.org/abs/2106.10800) kinda gets at this idea.

From the abstract:

Most data is automatically collected and only ever "seen" by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations...

Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than 1000× on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.

Pretty cool stuff!


haha! This is exactly what I was going for, thank you :)


A sort-of related idea is: How small can an image of a human face be and still be recognizable? I remember X-face images from mail headers years ago -- these were just 48 x 48 x 1 pixel images, i.e., just 288 bytes (yes, bytes, not KB or MB). I marveled that the faces were indeed recognizable. I wish I could find an example of an X-face of a famous person (or even an unknown person) to give as an example here, but this seems to be one of those things that are ungoogleable.


After some digging ("x-face image usenet" on DDG yielded a link to a page on curlie.org), I found this collection: https://ace.home.xs4all.nl/X-Faces/


There's a chart in the tiny face paper [.] that compares face size against accuracy. The paper concerns on detecting any faces instead of recognizing a single face though. Seems we can shrink down to 24x19 and still maintain 50% accuracy.

[.] https://openaccess.thecvf.com/content_cvpr_2017/papers/Hu_Fi... Figure 9


I do wonder if all the new AI-based "upscaling" algorithms, which can fill in detail to make an image larger, can be used to produce a compression-decompression algorithm: Perform a heavily lossy compression, then "upscale" the result and see if it returns to a good-enough match. If so, compress even further, until your upscaling no longer gives you a good-enough match.

Like regular JPG encoding, this can be set to an arbitrary level of "good enough."


This has been tried in research and works, with typical artefacts such as horses getting zebra-stripes if they stand in tall grass (or something like that ...)


That sounds very similar to Dürer's Rhinoceros. Or in general, learned people trying to draw from even detailed descriptions.

https://en.wikipedia.org/wiki/D%C3%BCrer%27s_Rhinoceros

> Dürer's Rhinoceros is the name commonly given to a woodcut executed by German painter and printmaker Albrecht Dürer in 1515.[1] The image is based on a written description and brief sketch by an unknown artist of an Indian rhinoceros that had arrived in Lisbon in 1515.[2] Dürer never saw the actual rhinoceros, which was the first living example seen in Europe since Roman times.


That's pretty much how AI image compression works - you're describing standard autoencoder (encoder/decoder) architecture with some adaptiveness.


This works, and there's also a fun variant where you then losslessly compress the residual between ground truth and the up-sampled image. Since most of the information was encoded in the NN and already exists at both ends, the data requirements in between are modest.


Sufficiently good compression is necessarily dimensionality reduction. Out of compression AI can be jerryrigged, and out of AI compression can be jerryrigged. This truth may be relevant to the evolutionary history of natural intelligence.


Why stop at the 'AI's limit? We can do better than that. Keep going until a human can no longer recognise the image! (There's always a mechanical Turk to do the work.)

And why stop when the image is unrecognisable? Why not translate the image of (say) a cat into an image of the word 'cat'? And if you've done that compression, take it a step further - translate the image of the cat into the text string 'cat'. Now you're getting seriously good compression ratios :)


A long time ago I thought that this kind of compression could be interesting for voice chat/phone calls. Locally convert speech to text, send raw text over the wire, and convert text back to speech.

You could fine tune some existing text to speech model to sound like you. Once you transfer that to your friends/frequent contacts you could have voice conversations over very low bandwidth channels.

There are obviously plenty of caveats to this, like communicating sounds that aren’t speech. It could be a useful fallback e.g. if your network quality is seriously degraded then your client switches to TTS mode.


Well there's already speech-oriented codecs that can go down below a kilobyte per second, and there's this one that throws a neural net at the problem to help squeeze bytes: https://ai.googleblog.com/2021/08/soundstream-end-to-end-neu...

I don't think I've ever had a functional connection slower than that. You need packets to be making it through in a timely manner if you want a voice conversation, after all.


> I don't think I've ever had a functional connection slower than that.

I _may_ still have a 300 baud modem in a box in the garage. Possibly a 1200/75 one too. (1985 sometimes doesn't seem _that_ long ago...)


Well, you could have just talked on the line that modem was using. That's not really the kind of limit I'm worried about.


There are speech codecs that essentially do this but with a phonetic encoding, instead of "normal" text. Yes, they are only used for extremely bandwidth-starved situations, but they do work. An example is Codec2, which works down to 0.7 kbit/s.


Great idea! In fact look at what nvidia is doing with videocalls (spoiler exactly what you’re saying, but video)

https://blogs.nvidia.com/blog/2020/10/05/gan-video-conferenc...



That's "dictionary" compression still. AI checks for hints to lookup in well... dictionary. If the dictionary changes, your compression results change too.


You could use an English dictionary and compress the image to words. "Cat on laptop" for example.


I mean what the words "cat on laptop" mean to the AI - that "dictionary" - or more like that state of things where F("cat on laptop") returns on specific thing, and then after some time might return another, unless you control that state.


PixelCNN++ does this, kinda. Residual blocks that use downsampled versions of the image. Though in that context, it gets used for lossless compression by entropy coding the likelihoods of each pixel.

This idea - trying to investigate behavior at different scale parameters - is also related to renormalization. But I think you can go further than this - currently you are just preserving that which remains consistent. You should also somehow store the complement of that information - the difference between the original and what you get after downscaling then upscaling the downscaled version again.

I imagine it's going to be mostly the noisy data local to certain regions of the image, some kind of high frequency details. But if you keep track of both, you can make each rescaling bijective, making your 'compression' invertible, hence lossless, in some sense. Whether you get any size savings doing that is another question altogether :^)


> What would image compression look like if designed around other things perceptions?

For computers, the question is meaningless because they only interact with our outside reality in the form of a "Chinese room" translation task. There's nothing inherently salient about image data to a computer, because the "ground floor" reality of computers is instructions for moving electrical charges around on some chips. But, given computers with built in cameras and having some use for discerning its reality of objects in 3d space, subject to Newtonian physics, and projected down to a 2d CCD array, the ideal compression would be subject to the same pressures as influenced the evolution of the human eyes and brain. Thus it should be similar.

Isn't it funny how similar compression artifacts can be to the visual distortion of psychedelics.


> But, given computers with built in cameras and having some use for discerning its reality of objects in 3d space, subject to Newtonian physics, and projected down to a 2d CCD array, the ideal compression would be subject to the same pressures as influenced the evolution of the human eyes and brain. Thus it should be similar.

You have a good point! Though I wouldn't say the 'same' pressures. We can actually identify differences in those pressure, and use those to speculate.

Eg human computational capacity does well in terms of total operations performed, but individual units are really, really slow compared to electronics. That means human hardware can't run anything that requires a lot of 'serial depth', everything has to be almost embarrassingly parallel to work, or involve a lot of lookup from caches.

Also the design of humans is constrained by what evolution can do. Eg we still have blind spots in our eyes, mostly because evolution can only take 'small steps' (small in the space of genes), and inverting the eye so that the light receptive cells are in front instead of behind the nerves and blood vessels is not a small step in that space.

The power constraints on humans and computer are likely to be different. Humans are also not networked with sub-second latencies to the rest of world.

(Just to be clear, the examples I gave are second-order corrections to your much more important first-order idea!)


I'll take it one step further. The arrangement of rods and cones in the eye are optimized precisely around the task of capturing the salient details of a projected image. The distribution has an exponential form radiating from the center precisely because the brain is working in log polar coordinates (or similar). CCD arrays are stuck with the naive 2d grid arrangement.

https://www.math.utah.edu/~bresslof/publications/01-3.pdf


Interesting!

Though: I can believe that our arrangement is a local optimum, sure. (Local in some genetic sense.)

That still doesn't make me believe that are anywhere close to a global optimum. Behold the octopus:

> Do these design problems [of optical blind spots] exist because it is impossible to construct an eye that is wired properly, so that the light-sensitive cells face the incoming image? Not at all. Many organisms have eyes in which the neural wiring is neatly tucked away behind the photoreceptor layer. The squid and the octopus, for example, have a lens-and-retina eye quite similar to our own, but their eyes are wired right-side out, with no light-scattering nerve cells or blood vessels in front of the photoreceptors, and no blind spot.

From https://www.pbs.org/wgbh/evolution/change/grand/page05.html


Compression with no plan in sight for decompression... at least it's billed as "silly" first and foremost.


I've seen worse; someone presented code for decompression of a fictional data format with no plan in sight for compression.

When I asked how the compression part would be implemented, the author called me stupid for asking this question and told me the project was inspired by https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System


Reminds me when I as a 14 year old, barely two years after I started learning programming and having just learned about cyclic redundancy check (CRC), wondered why they couldn't be used for compression. Take a block of say 16 bytes, calculate the CRC code and store that.

To decompress you then just had to find the string of bytes that generated a matching CRC code.

It seemed so brilliant, but something told me it was too good to be true, and surely someone would have come up with this before me... So I tried figuring out why it wouldn't work. Still, took me a day or so before I figured out how long "just" was and that collisions would render it moot anyway.

Learned a lot from that though, as an aspiring programmer.


Reminds me of my genius compression algorithm from when I was around 12. Information on the computer is just numbers right? So why not just divide the number by two over and over until its in the single digits. Infinite compression. The only thing left to figure out is how to compress all the remainders from the numbers that aren't divisible by two.

Needless to say, I never could solve that problem.


No no, you need to store the CRC and the output of an object recogniser. A 16 byte CRC plus the word "cat" will have way way fewer collisions!


You are joking, but something a bit like this can actually work with some extra assumptions.

Have a look at compressed sensing!


That is simply amazing, thanks for sharing :)


Compression doesn’t imply decompression… see JPEG, H264, etc…


Maybe you could elaborate… what do you mean? Both of those examples (JPEG, H264) have decompressors for viewing, right?


I think the point is that some compression algorithms are lossy (the original cannot be reproduced by decompression) and others are not.

You can’t restore a jpeg to its original image.


> formats like JPEG are designed around human perception

This isn't really true. JPEG is fourier or wavelet based. It will compress any image by discarding fourier coefficients, regardless if the input image makes sense to humans.


> What would image compression look like if designed around other things perceptions?

If the other creature's vision is in a different wavelength range, then RGB images are not applicable to it at all, in the first place.


This reminded me of a weird idea that I had several years ago (which is kind of opposite of this):

What if there was a format that doesn't encode pixels but generates a "description" of the images using AI, of simple shapes to higher-level objects, and saves that in its own format, with a "decoder" that takes this input description format and constructs an image similar to the original, at least to some extent.

Of course this would have many drawbacks like encoding micro details and expressions of people, but still, I'd love to see how far the idea can go.


Venor Vinge, A Fire Upon The Deep (1992):

>Fleet Central refused the full video link coming from the Out of Band … Kjet had to settle for a combat link: The screen showed a color image with high resolution. Looking at it carefully, one realized the thing was a poor evocation…. Kjet recognized Owner Limmende and Jan Skrits, her chief of staff, but they looked several years out of style: old video matched with the transmitted animation cues. The actual communication channel was less than four thousand bits per second

>The picture was crisp and clear, but when the figures moved it was with cartoonlike awkwardness. And some of the faces belonged to people Kjet knew had been transferred before the fall of Sjandra Kei. The processors here on the Ølvira were taking the narrowband signal from Fleet Central, fleshing it out with detailed (and out of date) background and evoking the image shown.


This could be useful to reduce Disney’s bandwidth costs. Marvel and Star Wars productions are highly repetitive anyway, so maybe they could ship a preloaded Marvel neural network in the player app and design new movies to take maximum advantage of the clichés already encoded there.


They used to literally reuse the animation in older Disney films, just with different characters on top. You could send that information, the scene and the characters, in significantly fewer bits than it would take as video


Not entirely unlike fractal compression[1]. Was a lot of hype around it way back, but clearly it didn't quite pan out.

[1]: https://en.wikipedia.org/wiki/Fractal_compression


that's roughly the idea for gan-based image compression, check out: https://data.vision.ee.ethz.ch/aeirikur/extremecompression


It's called SVG :)


Not really. But vectorized shapes might definitely help rasterizing the descriptions.


that's like heraldry, rebuilding an image from a precise/regimented verbal description


sounds like vector graphics?


Nope. But vector graphics can be of great utility when rasterizing the description into a bitmap.


If I’m not mistaken, this is more generalizable as Silly HashMap Algorithm Idea, which is probably a worthy pursuit. The idea is to find the minimal meaningful key with the highest likelihood of resolving to the correct value. Compression is meaningless to the former but valuable for the latter.


the smallest compressor is just the (differentiable) classier itself. The compressed image is just the output class, and the de-compressor generates a random image then back propagates the class error until it gets an image that has the right class. Maybe you toss in a random seed in the format, so the compressor can spend some more time to guarantee the de-compressor will complete in a reasonable amount of time.


Congrats on reinventing the autoencoder!


This isn't an autoencoder.


IMO it is half of one, just the feed forward encoder and a bottleneck, without the decoder. What would you suggest it to be?


he a little confused but he got the right idea. there is a very natural and straightforward application of ML to compression: the better your model of P(data) gets, the better your compression ratio




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: