Hacker News new | past | comments | ask | show | jobs | submit login

I'm the author of FakeYou.com, so I have a little experience in this area. (We used to train GlowTTS models ourselves before turning it over to our users, which has had mixed results in terms of quality.)

This appears to be a repackaging of RealTimeVoiceCloning [1], albeit with a few additions, such as GSTs.

No matter what the repo claims, your results will depend on high quality data. Lots of it, and with ample fine tuning. Demo videos are absolutely cherry picked.

If you're picking this up for a project, HiFi-Gan is pretty much the best vocoder right now. Tacotron still produces great results, though there are lots of other interesting model architectures.

[1] https://github.com/CorentinJ/Real-Time-Voice-Cloning




Long ago I found an approach to 3D modeling [1] that used a morphable model that was then morphed into the desired shape. Would something like this be possible for voice? A voice model obtained from a gigantic set of samples, that can be manually tuned to sound more masculine/feminine, higher/lower pitched and that can be morphed into the timbre of various samples.

- [1] https://www.youtube.com/watch?v=pSRA8GpWIrA


By any chance, do you know which technology was used for the TikTok voice? While the voice itself is somewhat annoying I find the quality stunning. Any chance to reach this level with any of the models you mentioned?


Is the quality that good? It seemed to me that text to speech was not a particularly hard problem when you can get someone to read out all the sounds of the language. And that the new advancement we have now is just being able to dump our normal speech audio and build a model based on that.


Not sure. So far, I found every text to speech application quite uncanny. The ones that come with Windows, MacOS, Android, iOS are good, but not quite there. The TikTok one is the first to sound quite convincing for me.

Siri and Alexa are also good but I think they don't count, because they probably are not purely text to speech but presumably use a lot of prerecorded phrases, especially for the common answers.


i actively dislike the tiktok voice. i find the microsoft and google cloud voices to be much better.


You are not alone. Especially when it is used in contexts that do not fit its overly enthusiastic upbeat mood the results range from slightly comical to deeply unsettling. Still, in my opinion it is really good from a technical point of view.


By "lots of [high quality data]" do you mean seconds, kiloseconds, or megaseconds of high-quality voice recordings?


Hours of audio comprised of clean spoken sentences, zero noise, and uniform microphone quality is ideal.

Some of the predominant base data sets used for transfer learning, such as LJSpeech [1], are unfortunately noisy and non-uniform.

[1] https://keithito.com/LJ-Speech-Dataset/


Thank you very much!


This paper[0] from this year seems to make do with a couple of minutes.

[0] https://davidyao.me/projects/text2vid/


IIRC enterprise solutions from the big clouds usually ask for at least hours of studio quality voice recordings for a custom voice model.


Thanks! Do you think they're using models as good as the ones echelon uses at FakeYou? Maybe he can get by with less data.


The best sounding limited data models have at least thirty minutes of audio data and have a similar pitch and timbre to a base data set. You can get by with less, but it requires finesse.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: