Hacker News new | past | comments | ask | show | jobs | submit login
Peekaboo: Text to Image Diffusion Models Are Zero-Shot Segmentors (arxiv.org)
106 points by tosh on Dec 26, 2022 | hide | past | favorite | 5 comments



The paper is not super-well written, it kind of hides the underlying idea (which is pretty simple) and doesn't show it clearly on page 1. What ever happened to the good old article-writing rules that you need to present a summary of your idea clearly in Figure 1 already? :)

My take is that the they guess at an alpha-mask first, then iterate by generating a composite image of the input image (to be segmented) with a uniform background using the proposed alpha, then update the mask using gradient descent on the score of the composite image related to the text-conditioning.

They add some extra scoring functions to try to suppress "bad" alpha masks.


My reading is that they add noise and then denoise the query image with SD using the text prompt. The part that denoises closest to the original file must be the part that is specified by the text conditioning. The stuff that gets denoised to a random different thing was apparently not specified by the text prompt.


Thank you!

Your explanation was instantly understandable and (in hindsight) obvious to me.


tldr;

> We ask the question, can pre-trained diffusion models’ un- derstanding of separate visual concepts be leveraged to as- sociate natural language to relevant spatial regions of an image?

> An alpha-channel compositing process is used to iteratively generate segmentations for the regions of interest in a given image. The zero-shot functionality is endowed by our score distillation mechanism operating in latent space, that con- nects the language and vision modalities.

I mean, this is basically pretty cool.

They're masking out region in the image, then varying the background and running it through to find the best text classification for the region of the image.

Overall, using off the shelf stable diffusion they can generate reasonably good automatic image segmentation (ie. this bit is a 'teddy bear', this bit is a 'hand', etc.) on images.

I totally did not understand the bit about magically generating the alpha masks honestly (anyone care to explain how the MLP magically turns into a mask of a cat?) but over all, very very cool being able to use the existing model to segment images like this without having to use a massive dataset of labelled images!


My reading is that the mask is just an early step(s) that than gets avg'd and matted out. Backgrounds dont change much after ~30 steps, so u an take those and diff the final for the mask?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: