EditAnything: Segment Anything + ControlNet + BLIP2 + Stable Diffusion

jonplackett · on April 10, 2023

Saw this coming as soon as I saw Segment Anything, but it doesn’t make it any less magical

RockyMcNuts · on April 10, 2023

can someone ELI5 what it does and why it's magical?

d23 · on April 10, 2023

Yeah I'm having trouble understanding why this isn't something that's already accomplishable using SD + ControlNet. I guess Segment Anything lets you not need to manually segment an image? The example images don't seem to be particularly better than what one could generate without segmenting using autogenerated canny or scribble or depth controlnet images to me though.

ulrikhansen54 · on April 11, 2023

Segment Anything is orders of magnitude more powerful than anything we've ever seen before within the real of image segmentation and the first 'foundation model' in that regard, so this is a real step-change in background removal/editing, etc. -- hell, we might not even need green screens anymore!

jonplackett · on April 10, 2023

Segmenting really well, like how a human would do it has been for some reason really really hard for a computer to do. I guess because it requires understanding of what objects actually are. So that part is amazing that it works so well.

And then the control net part seems more magical just because you have a REALLY accurate input

MitPitt · on April 10, 2023

I don't understand what this does. Examples mention "human prompt" but I don't see it anywhere.

okamiueru · on April 10, 2023

The readme.md has some image examples. If I interpret them correctly, the first one is the input image, the second the segmented output, and the rest are example outputs using the prompt text shown above the collage.

The visual quality of the output images is not particularly impressive compared to what we've become used to.

What (IMO) it attempts to showcase is how the input image segmentation is used to guide the final image generation. That part is quite impressive. The shapes, and "segments" are very well preserved from input to output.

sp332 · on April 10, 2023

"The human prompt and BLIP2 generated prompt build the text instruction." The examples only show the BLIP2 prompt.

GaggiX · on April 10, 2023

It's a controlnet model trained on SAM segmentation maps. The final model takes a prompt and a segmentation map as input and generates an image conditioned on them.

Sharlin · on April 10, 2023

And I thought I was moderately "with it" with regard to AI… Could I get that in ELI5?

GaggiX · on April 10, 2023

https://github.com/lllyasviel/ControlNet

Controlnet is a neural network added to an already trained model so they can be conditioned on new stuff like canny edge, depth map, segmentation map. Controlnet let you train this model on the new condition "easily", without catastrophic forgetting and without a huge dataset. In the repo linked by OP, they have trained a controlnet model on the segmentation map generated by SAM: https://segment-anything.com/

nonbirithm · on April 10, 2023

Sounds like a good idea. Some recent Stable Diffusion toolkits like ComfyUI are modular so in theory you'd be able to build a pipeline like this inside the application instead of being constrained to one workflow.

i_like_apis · on April 10, 2023

Example 1: ok. Examples 2 and three look like failures.

55555 · on April 10, 2023

Where did META get the Segment Anything training dataset from? Or, if they made it, how did they generate such a massive dataset of accurate masks?

xiphias2 · on April 10, 2023

Actually it is quite small dataset by current standards (millions of images), so I guess it was using humans in the loop.

fkarg · on April 10, 2023

the relevant part is the masks, and so far it's the biggest dataset in existence, with about 400x more masks than the next smaller one.

: excluding non-public ones

fkarg · on April 10, 2023

they described it in their paper: they created it themselves. they had/have a loop of having humans annotate images their system was uncertain about, but the released one was fully automated.

jack_riminton · on April 10, 2023

I wonder if Segment Anything can work on things that aren't 'real looking' e.g. if I created a semi abstract piece in the style of Matisse on MidJourney, would it be able to segment the wiggly vase that doesn't look like a vase?

teruakohatu · on April 10, 2023

It works fine on a Matisse-like image of a vase, such as this one by Kay Gallwey:

https://www.artsy.net/article/artsy-editorial-guide-painting...

runnerup · on April 10, 2023

It works fine for me on very abstract stuff. Seems quite well generalized. Just wish they might release the CLIP encoder for it so you can prompt it with text prompts.

jack_riminton · on April 10, 2023

Cool thanks. Yeah hopefully since Meta open-sourced the segmenter MidJourney is already working on it

forgingahead · on April 10, 2023

Nice work! Did you figure out the training/fine-tuning code for Segment Anything? There's an on-going discussion on the official repo[0] but no clear results yet.

[0]https://github.com/facebookresearch/segment-anything/issues/...

Weatebob · on April 10, 2023

I'm wondering about this too.

But I will try it out with label studio and this alone with a classic training workflow should speed up the process tremendously anyway.

dgellow · on April 10, 2023

Great, that was obviously coming! Stable diffusion with controlnet was already so good, with segment anything we start to really have a fantastic experience! Really cool, I will definitely play with this during the week, thanks for sharing