Hacker News new | past | comments | ask | show | jobs | submit login

Basically, from what I have read about Stable Diffusion, a model can be created to replicate a particular style of output by incorporating images illustrating that style into the model space. Once that is complete SD can use that model to create new images in that style because it has a huge textual-image trained model space to use where features in images are tagged to provide contextual clues to SD so that when the user inputs "A cartoon drawing of a grinning red-headed boy with a gap in his front teeth" the process will know how to parse the requested image parameters. In short, it already understands what a cartoon drawing is from having multiple images tagged as cartoon drawings incorporated. It also knows the difference between a grin and a frown or a look of disapproval, can recognize colors, differentiate gender, along with the other parms in the text prompt used to build the image.

However, if it has never seen Alfred E. Neuman it is very unlikely to be able to produce an output that resembles him. This is part of the reason for the huge popularity of SD. Not only is it open source and free to use, very fast since it limits image size to 512x512 (from what I read) but it also allows tuning (training) of existing models with new images so that the user can easily train it to produce images with specific features or characteristics that may not have been in the original training set. You can steer it to produce variations of any type of image that you can train it on.

As the author mentioned in the article, he trained SD to insert his image by into the outputs by adding only 30 properly configured images of himself to the model set. It worked great after that. Without those images it did not work because the model had no context for part of his prompt.

>Yesterday, I used a simple YouTube tutorial and a popular Google Colab notebook >to fine-tune Stable Diffusion on 30 cropped 512×512 photos of me. The entire >process, start to finish, took about 20 minutes and cost me about $0.40. (You >can do it for free but it takes 2-3 times as long, so I paid for a faster Colab >Pro GPU.) > >The result felt like I opened a door to the multiverse, like remaking that >scene from Everything Everywhere All at Once, but with me instead of Michelle >Yeoh.

Without knowing anything about him it would not be able to produce images of him. He uses Garfield as an example where it does not do well.

>...it really struggles with Garfield and Danny DeVito. It knows that Garfield’s >an orange cartoon cat and Danny DeVito’s general features and body shape, but >not well enough to recognizably render either of them.

SD is a model-based image optimization process which uses a very deep (100 languages recognized) training dataset of millions of tagged, open source images scraped from the internet to produce a relatively light-weight and thus, fast image generation tool. It has to know what something looks like from contextualized a priori data in order to be able to create an image that uses the object, characteristic, feature, etc in the output.

Thanks for spurring me to look into this. It looks very interesting though I am unlikely to find time to work with it myself.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: