Yeah I'm having trouble understanding why this isn't something that's already accomplishable using SD + ControlNet. I guess Segment Anything lets you not need to manually segment an image? The example images don't seem to be particularly better than what one could generate without segmenting using autogenerated canny or scribble or depth controlnet images to me though.
Segment Anything is orders of magnitude more powerful than anything we've ever seen before within the real of image segmentation and the first 'foundation model' in that regard, so this is a real step-change in background removal/editing, etc. -- hell, we might not even need green screens anymore!
Segmenting really well, like how a human would do it has been for some reason really really hard for a computer to do. I guess because it requires understanding of what objects actually are. So that part is amazing that it works so well.
And then the control net part seems more magical just because you have a REALLY accurate input
The readme.md has some image examples. If I interpret them correctly, the first one is the input image, the second the segmented output, and the rest are example outputs using the prompt text shown above the collage.
The visual quality of the output images is not particularly impressive compared to what we've become used to.
What (IMO) it attempts to showcase is how the input image segmentation is used to guide the final image generation. That part is quite impressive. The shapes, and "segments" are very well preserved from input to output.
It's a controlnet model trained on SAM segmentation maps. The final model takes a prompt and a segmentation map as input and generates an image conditioned on them.
Controlnet is a neural network added to an already trained model so they can be conditioned on new stuff like canny edge, depth map, segmentation map. Controlnet let you train this model on the new condition "easily", without catastrophic forgetting and without a huge dataset. In the repo linked by OP, they have trained a controlnet model on the segmentation map generated by SAM: https://segment-anything.com/
Sounds like a good idea. Some recent Stable Diffusion toolkits like ComfyUI are modular so in theory you'd be able to build a pipeline like this inside the application instead of being constrained to one workflow.
they described it in their paper: they created it themselves. they had/have a loop of having humans annotate images their system was uncertain about, but the released one was fully automated.
I wonder if Segment Anything can work on things that aren't 'real looking' e.g. if I created a semi abstract piece in the style of Matisse on MidJourney, would it be able to segment the wiggly vase that doesn't look like a vase?
It works fine for me on very abstract stuff. Seems quite well generalized. Just wish they might release the CLIP encoder for it so you can prompt it with text prompts.
Nice work! Did you figure out the training/fine-tuning code for Segment Anything? There's an on-going discussion on the official repo[0] but no clear results yet.
Great, that was obviously coming! Stable diffusion with controlnet was already so good, with segment anything we start to really have a fantastic experience! Really cool, I will definitely play with this during the week, thanks for sharing