These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.
For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)
As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.
If you want to know what tokens you want to obtain _exactly_ Mona Lisa, or any other image, you take the image and put it through your image tokenizer aka encode it, and if you have the sequence of token you can decode it to an image.
The whole encoding-decoding process is reversible, and you only lose some imperceptible "details", the process can be either trained with a L2Loss, or a perceptual loss depending what you value.
The point being that images which occurs naturally are not really information rich and can be compressed a lot by neural networks of a few GB that have seen billions of pictures. With that strong prior, aka common knowledge, we can indeed paint with words.
Maybe I’m not able to articulate my thought well enough.
Taking an existing image and reversing the process to get the tokens that led to it then redoing that doesn’t seem the same as inserting token to get a precise novel image.
Especially since, as you said, we’d lose some details, it suggests that not all images can be perfectly described and recreated.
I suppose I’ll need to play around with some of those techniques.
After encoding the models are usually cascaded either with a LLM or a diffusion model.
Natural Image-> Sequence of token, but not all possible sequence of token will be reachable. Like plenty of letters put together form non-sensical words.
Sequence of token -> Natural Image : if the initial sequence of token is unsensical the Natural image will be garbage.
So usually you then modelize the sequence of token so that it produce sensical sequences of token, like you would with a LLM, and you use the LLM to generate more tokens. It also gives you a natural interface to control the generation of token. You can express with words what modifications to the image you should do. Which will allow you to find the golden sequence of token which correspond to the mona-lisa by dialoguing with the LLM, which has been trained to translate from english to visual-word sequence.
Alternatively instead of a LLM you can use a diffusion model, the visual words usually are continuous, but you can displace them iteratively with text using things like "controlnet" (stable diffusion).
These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.
For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)
As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.