I'm the author of FakeYou.com, so I have a little experience in this area. (We u...

pdamoc · on Dec 28, 2021

Long ago I found an approach to 3D modeling [1] that used a morphable model that was then morphed into the desired shape. Would something like this be possible for voice? A voice model obtained from a gigantic set of samples, that can be manually tuned to sound more masculine/feminine, higher/lower pitched and that can be morphed into the timbre of various samples.

- [1] https://www.youtube.com/watch?v=pSRA8GpWIrA

weinzierl · on Dec 28, 2021

By any chance, do you know which technology was used for the TikTok voice? While the voice itself is somewhat annoying I find the quality stunning. Any chance to reach this level with any of the models you mentioned?

Gigachad · on Dec 28, 2021

Is the quality that good? It seemed to me that text to speech was not a particularly hard problem when you can get someone to read out all the sounds of the language. And that the new advancement we have now is just being able to dump our normal speech audio and build a model based on that.

weinzierl · on Dec 28, 2021

Not sure. So far, I found every text to speech application quite uncanny. The ones that come with Windows, MacOS, Android, iOS are good, but not quite there. The TikTok one is the first to sound quite convincing for me.

Siri and Alexa are also good but I think they don't count, because they probably are not purely text to speech but presumably use a lot of prerecorded phrases, especially for the common answers.

JohnHaugeland · on Dec 28, 2021

i actively dislike the tiktok voice. i find the microsoft and google cloud voices to be much better.

weinzierl · on Dec 29, 2021

You are not alone. Especially when it is used in contexts that do not fit its overly enthusiastic upbeat mood the results range from slightly comical to deeply unsettling. Still, in my opinion it is really good from a technical point of view.

kragen · on Dec 28, 2021

By "lots of [high quality data]" do you mean seconds, kiloseconds, or megaseconds of high-quality voice recordings?

echelon · on Dec 28, 2021

Hours of audio comprised of clean spoken sentences, zero noise, and uniform microphone quality is ideal.

Some of the predominant base data sets used for transfer learning, such as LJSpeech [1], are unfortunately noisy and non-uniform.

[1] https://keithito.com/LJ-Speech-Dataset/

kragen · on Dec 29, 2021

Thank you very much!

iaml · on Dec 28, 2021

This paper[0] from this year seems to make do with a couple of minutes.

[0] https://davidyao.me/projects/text2vid/

oefrha · on Dec 28, 2021

IIRC enterprise solutions from the big clouds usually ask for at least hours of studio quality voice recordings for a custom voice model.

kragen · on Dec 28, 2021

Thanks! Do you think they're using models as good as the ones echelon uses at FakeYou? Maybe he can get by with less data.

echelon · on Dec 28, 2021

The best sounding limited data models have at least thirty minutes of audio data and have a similar pitch and timbre to a base data set. You can get by with less, but it requires finesse.