I'm the author of FakeYou.com, so I have a little experience in this area. (We used to train GlowTTS models ourselves before turning it over to our users, which has had mixed results in terms of quality.)
This appears to be a repackaging of RealTimeVoiceCloning [1], albeit with a few additions, such as GSTs.
No matter what the repo claims, your results will depend on high quality data. Lots of it, and with ample fine tuning. Demo videos are absolutely cherry picked.
If you're picking this up for a project, HiFi-Gan is pretty much the best vocoder right now. Tacotron still produces great results, though there are lots of other interesting model architectures.
Long ago I found an approach to 3D modeling [1] that used a morphable model that was then morphed into the desired shape. Would something like this be possible for voice? A voice model obtained from a gigantic set of samples, that can be manually tuned to sound more masculine/feminine, higher/lower pitched and that can be morphed into the timbre of various samples.
By any chance, do you know which technology was used for the TikTok voice? While the voice itself is somewhat annoying I find the quality stunning. Any chance to reach this level with any of the models you mentioned?
Is the quality that good? It seemed to me that text to speech was not a particularly hard problem when you can get someone to read out all the sounds of the language. And that the new advancement we have now is just being able to dump our normal speech audio and build a model based on that.
Not sure. So far, I found every text to speech application quite uncanny. The ones that come with Windows, MacOS, Android, iOS are good, but not quite there. The TikTok one is the first to sound quite convincing for me.
Siri and Alexa are also good but I think they don't count, because they probably are not purely text to speech but presumably use a lot of prerecorded phrases, especially for the common answers.
You are not alone. Especially when it is used in contexts that do not fit its overly enthusiastic upbeat mood the results range from slightly comical to deeply unsettling. Still, in my opinion it is really good from a technical point of view.
The best sounding limited data models have at least thirty minutes of audio data and have a similar pitch and timbre to a base data set. You can get by with less, but it requires finesse.
This appears to be a repackaging of RealTimeVoiceCloning [1], albeit with a few additions, such as GSTs.
No matter what the repo claims, your results will depend on high quality data. Lots of it, and with ample fine tuning. Demo videos are absolutely cherry picked.
If you're picking this up for a project, HiFi-Gan is pretty much the best vocoder right now. Tacotron still produces great results, though there are lots of other interesting model architectures.
[1] https://github.com/CorentinJ/Real-Time-Voice-Cloning