The older formant-based (vs speech sample based) speech sythesizers like DECTalk...

The older formant-based (vs speech sample based) speech sythesizers like DECTalk could do this too. You could select one of a half dozen voices (some male, some female), but also select the speed, word pronunciation/intonation, get it to sing, etc, because these are all just parameters feeding into the synthesizer.

It would be interesting to hear the details, but what OpenAI seem to have done is build a neural net based speech synthesizer which is similarly flexible because it it generating the audio itself (not stitching together samples) conditioned on the voice ("Sky", etc) it is meant to be mimicking. Dialing the emotion up/down is basically affecting the prosody and intonation. The singing is mostly extending vowel sounds and adding vibrato, but it'd be interesting to hear the details. In the demo Brockman refers to the "singing voice", so not clear if they can make any of the 5 (now 4!) voices sing.

In any case, it seems the audio is being generated by some such flexible tts, not just decoded from audio tokens generated by the model (which anyways would imply there was something - basically a tts - converting text tokens to audio tokens). They also used the same 5 voices in the previous ChatGPT which wasn't claiming to be omnimodal, so maybe basically the same tts being used.