Peekaboo: Text to Image Diffusion Models Are Zero-Shot Segmentors

l33tman · on Dec 27, 2022

The paper is not super-well written, it kind of hides the underlying idea (which is pretty simple) and doesn't show it clearly on page 1. What ever happened to the good old article-writing rules that you need to present a summary of your idea clearly in Figure 1 already? :)

My take is that the they guess at an alpha-mask first, then iterate by generating a composite image of the input image (to be segmented) with a uniform background using the proposed alpha, then update the mask using gradient descent on the score of the composite image related to the text-conditioning.

They add some extra scoring functions to try to suppress "bad" alpha masks.

fxtentacle · on Dec 27, 2022

My reading is that they add noise and then denoise the query image with SD using the text prompt. The part that denoises closest to the original file must be the part that is specified by the text conditioning. The stuff that gets denoised to a random different thing was apparently not specified by the text prompt.

cs702 · on Dec 27, 2022

Thank you!

Your explanation was instantly understandable and (in hindsight) obvious to me.

wokwokwok · on Dec 27, 2022

tldr;

> We ask the question, can pre-trained diffusion models’ un- derstanding of separate visual concepts be leveraged to as- sociate natural language to relevant spatial regions of an image?

> An alpha-channel compositing process is used to iteratively generate segmentations for the regions of interest in a given image. The zero-shot functionality is endowed by our score distillation mechanism operating in latent space, that con- nects the language and vision modalities.

I mean, this is basically pretty cool.

They're masking out region in the image, then varying the background and running it through to find the best text classification for the region of the image.

Overall, using off the shelf stable diffusion they can generate reasonably good automatic image segmentation (ie. this bit is a 'teddy bear', this bit is a 'hand', etc.) on images.

I totally did not understand the bit about magically generating the alpha masks honestly (anyone care to explain how the MLP magically turns into a mask of a cat?) but over all, very very cool being able to use the existing model to segment images like this without having to use a massive dataset of labelled images!

w-ll · on Dec 27, 2022

My reading is that the mask is just an early step(s) that than gets avg'd and matted out. Backgrounds dont change much after ~30 steps, so u an take those and diff the final for the mask?