I read the post, and I can't see anything in the post which says that the model is not multi-modal, nor can I see anything in the post that suggests that the images are being processed in-context.
And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:
> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.