I read the post, and I can't see anything in the post which says that the model ...

Tadpole9181 · 2025-03-27T22:58:16 1743116296

I think you're confusing "modal" with "model".

And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:

> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.

wegfawefgawefg · 2025-03-27T05:58:39 1743055119

4o is multimodal, thats the whole point of 4o