Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know that ChatGPT's voice mode is using audio as a transformer input directly.

It could just be using speech to text (e.g. Whisper) on your input, and then using its text model on the text of your words. Or has OpenAI said that they aren't doing this?



OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:

> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.

From https://help.openai.com/en/articles/8400625-voice-mode-faq




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: