Can you find me a single official source from OpenAI that claims that GPT 4o is generating images pixel-by-pixel inside of the context window?
There are lots of clues that this isn't happening (including the obvious upscaling call after the image is generated - but also the fact that the loading animation replays if you refresh the page - and also the fact that 4o claims it can't see any image tokens in its context window - it may not know much about itself but it can definitely see its own context).
I read the post, and I can't see anything in the post which says that the model is not multi-modal, nor can I see anything in the post that suggests that the images are being processed in-context.
And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:
> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.