Hacker News new | past | comments | ask | show | jobs | submit login
EditAnything: Segment Anything + ControlNet + BLIP2 + Stable Diffusion (github.com/sail-sg)
214 points by pizza on April 10, 2023 | hide | past | favorite | 24 comments



Saw this coming as soon as I saw Segment Anything, but it doesn’t make it any less magical


can someone ELI5 what it does and why it's magical?


Yeah I'm having trouble understanding why this isn't something that's already accomplishable using SD + ControlNet. I guess Segment Anything lets you not need to manually segment an image? The example images don't seem to be particularly better than what one could generate without segmenting using autogenerated canny or scribble or depth controlnet images to me though.


Segment Anything is orders of magnitude more powerful than anything we've ever seen before within the real of image segmentation and the first 'foundation model' in that regard, so this is a real step-change in background removal/editing, etc. -- hell, we might not even need green screens anymore!


Segmenting really well, like how a human would do it has been for some reason really really hard for a computer to do. I guess because it requires understanding of what objects actually are. So that part is amazing that it works so well.

And then the control net part seems more magical just because you have a REALLY accurate input


I don't understand what this does. Examples mention "human prompt" but I don't see it anywhere.


The readme.md has some image examples. If I interpret them correctly, the first one is the input image, the second the segmented output, and the rest are example outputs using the prompt text shown above the collage.

The visual quality of the output images is not particularly impressive compared to what we've become used to.

What (IMO) it attempts to showcase is how the input image segmentation is used to guide the final image generation. That part is quite impressive. The shapes, and "segments" are very well preserved from input to output.


"The human prompt and BLIP2 generated prompt build the text instruction." The examples only show the BLIP2 prompt.


It's a controlnet model trained on SAM segmentation maps. The final model takes a prompt and a segmentation map as input and generates an image conditioned on them.


And I thought I was moderately "with it" with regard to AI… Could I get that in ELI5?


https://github.com/lllyasviel/ControlNet

Controlnet is a neural network added to an already trained model so they can be conditioned on new stuff like canny edge, depth map, segmentation map. Controlnet let you train this model on the new condition "easily", without catastrophic forgetting and without a huge dataset. In the repo linked by OP, they have trained a controlnet model on the segmentation map generated by SAM: https://segment-anything.com/


Sounds like a good idea. Some recent Stable Diffusion toolkits like ComfyUI are modular so in theory you'd be able to build a pipeline like this inside the application instead of being constrained to one workflow.


Example 1: ok. Examples 2 and three look like failures.


Where did META get the Segment Anything training dataset from? Or, if they made it, how did they generate such a massive dataset of accurate masks?


Actually it is quite small dataset by current standards (millions of images), so I guess it was using humans in the loop.


the relevant part is the masks, and so far it's the biggest dataset in existence, with about 400x more masks than the next smaller one.

: excluding non-public ones


they described it in their paper: they created it themselves. they had/have a loop of having humans annotate images their system was uncertain about, but the released one was fully automated.


I wonder if Segment Anything can work on things that aren't 'real looking' e.g. if I created a semi abstract piece in the style of Matisse on MidJourney, would it be able to segment the wiggly vase that doesn't look like a vase?


It works fine on a Matisse-like image of a vase, such as this one by Kay Gallwey:

https://www.artsy.net/article/artsy-editorial-guide-painting...


It works fine for me on very abstract stuff. Seems quite well generalized. Just wish they might release the CLIP encoder for it so you can prompt it with text prompts.


Cool thanks. Yeah hopefully since Meta open-sourced the segmenter MidJourney is already working on it


Nice work! Did you figure out the training/fine-tuning code for Segment Anything? There's an on-going discussion on the official repo[0] but no clear results yet.

[0]https://github.com/facebookresearch/segment-anything/issues/...


I'm wondering about this too.

But I will try it out with label studio and this alone with a classic training workflow should speed up the process tremendously anyway.


Great, that was obviously coming! Stable diffusion with controlnet was already so good, with segment anything we start to really have a fantastic experience! Really cool, I will definitely play with this during the week, thanks for sharing




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: