I'm incredibly deep in the image / video / diffusion / comfy space. I've read the papers, written controlnets, modified architectures, pretrained, finetuned, etc. All that to say that I've been playing with 4o for the past day, and my opinions on the space have changed dramatically.
4o is a game changer. It's clearly imperfect, but its operating modalities are clearly superior to everything else we have seen.
Have you seen (or better yet, played with) the whiteboard examples? Or the examples of it taking characters out of reflections and manipulating them? The prompt adherence, text layout, and composing capabilities are unreal to the point this looks like it completely obsoletes inpainting and outpainting.
I'm beginning to think this even obsoletes ComfyUI and the whole space of open source tools once the model improves. Natural language might be able to accomplish everything outside of fine adjustments, but if you can also supply the model with reference images and have it understand them, then it can do basically everything. I haven't bumped into anything that makes me question this yet.
They just need to bump the speed and the quality a little. They're back at the top of image gen again.
I'm hoping the Chinese or another US company releases an open model capable of these behaviors. Because otherwise OpenAI is going to take this ball and run far ahead with it.
Yeah if we get an open model that one could apply a LoRA (or similarly cheap finetuning) to, then even problems like reproducing identity would (most likely) be solved, as they were for diffusion models. The coherence not just to the prompt but to any potential input image(s) is way beyond what I've seen in diffusion models.
I do think they run a "traditional" upscaler on the transformer output since it seems to sometimes have errors similar to upscalers (misinterpreted pixels), so probably the current decoded resolution is quite low and hopefully future models like GPT-5 will improve on this.