Some truly impressive results. I'll pick my usual point here when a fancy new (generative) model comes out, and I'm sure some of the other commenters have alluded to this. The examples shown are likely from a set of well-defined (read: lots of data, high bias) input classes for the model. What would be really interesting is how the model generalizes to /object concepts/ that have yet to be seen, and which have abstract relationships to the examples it has seen. Another commenter here mentioned "red square on green square" working, but "large cube on small cube", not working. Humans are able to infer and understand such abstract concepts with very few examples, and this is something AI isn't as close to as it might seem.
It seems unlikely the model has seen "baby daikon radishes in tutus walking dogs," or cubes made out of porcupine textures, or any other number of examples the post gives.
It might not have seen that specific combination, but finding an anthropomorphized radish sure is easier than I thought: type "大根アニメ" in your search engine and you'll find plenty of results
Image search “大根 擬人化” do return similar results to the AI-generated pictures, e.g. 3rd from top[0] in my environment, but sparse. “大根アニメ” in text search actually gives me results about an old hobbyist anime production group[1], some TV anime[2] with the word in title...hmm
Then I found these[3][4] in Videos tab. Apparently there’s a 10-20 year old manga/merch/anime franchise of walking and talking daikon radish characters.
So the daikon part is already figured in the dataset. The AI picked up the prior art and combined it with the dog part, which is still tremendous but maybe not “figuring out the daikon walking part on its own” tremendous.
(btw anyone knows how best to refer to anime art style in Japanese? It’s a bit of mystery to me)
> anyone knows how best to refer to anime art style in Japanese?
The term mangachikku (漫画チック, マンガチック, "manga-tic") is sometimes used to refer to the art style typical of manga and anime; it can also refer to exaggerated, caricatured depictions in general. Perhaps anime fū irasuto (アニメ風イラスト, anime-style illustration), while a less colorful expression, would be closer to what you're looking for.
At least for certain types of art, sites such as pixiv and danbooru are useful for training ML models: all the images on them are tagged and classified already.
If you type in different plants and animals into GIS, you don’t even get the right species half the time. If GPT-3 has solved this problem, that would be substantially more impressive than drawing the images.
This is a spot on point. My prediction is that it wouldn't be able to. Given its difficulty to generate correct counts of glasses, it seems as though it still struggles with systematic generalization and compositionality. As a point of reference, cherrypicking aside, it could model obscure but probably well-defined baby daikon radish in tutu walking dog, but couldn't model red on green on blue cubes. Maybe more sequential perception, action, video data or system-2 like paradigm, but it remains to be seen.
Yes, I don't really see impressive language (i.e. GPT3) results here? It seems to morph the images of the nouns in the prompt in an aesthetically-pleasing and almost artifact-free way (very cool!).
But it does not seem 'understand' anything like some other commenters have said. Try '4 glasses on a table' and you will rarely see 4 glasses, even though that is a very well-defined input. I would be more impressed about the language model if it had a working prompt like: "A teapot that does not look like the image prompt."
I think some of these examples trigger some kind of bias, where we think: "Oh wow, that armchair does look like an avocado!" - But morphing an armchair and an avocado will almost always look like both because they have similar shapes. And it does not 'understand' what you called 'object concepts', otherwise it should not produce armchairs where you clearly cannot sit in due to the avocado stone (or stem in the flower-related 'armchairs').
What I meant is that 'not' is in principal an easy keyword to implement 'conservatively'. But yes, having this in a language model has proven to be very hard.
Edit: Can I ask, what do you find impressive about the language model?
Perhaps the rest of the world is less blasé - rightly or wrongly. I do get reminded of this: https://www.youtube.com/watch?v=oTcAWN5R5-I when I read some comments. I mean... we are telling the computer "draw me a picture of XXX" and it's actually doing it. To me that's utterly incredible.
I'm in the open ai beta for GPT-3, and I don't see how to play with DALL-E. Did you actually try "4 glasses on a table"? If so, how? Is there a separate beta? Do you work for open ai?
Sounds like the perfect case for a new captcha system. Generate a random phrase to search an image for, show the user those results, ask them to select all images matching that description.