Making a TTS model with 1 minute of speech samples within 10 minutes

crowbahr · on March 17, 2018

The Kate sounds a lot better than the Nick (who has that robotic gravel to his voice) but I think that there is some question as to how much better would it actually be with more data:

If trained on longer samples does it start to hallucinate? Learning programs are notorious for rough approximations failing to scale into useful detail.

Very cool work though!

spullara · on March 17, 2018

If you go to the original you can hear what it sounds like when it is trained over a lot more data.

arbie · on March 19, 2018

The Notes section neatly showcases the characteristics of most of the publish-fast-without-reproducibility "research" hitting arxiv starting mid-'17:

The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.

The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.

I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.

The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much faster than Tacotron as it uses only convolution layers.

The paper didn't mention dropouts. I applied them as I believe it helps for regularization.

Crazyontap · on March 17, 2018

Speaking as a person who doesn't know much about programming how much time does it take to generate 1 min of audio (or say 50 words)? Is it possible to integrate this in my browser?

I heard a Google TTS demo (I think it was called deepmind?) that sounds extremely human-like and I was wondering if it can be used to turn webpages into speech (I have few extension in Chrome but those voices sound very robotic and it's hard to hear them after 2 mins).

Anyway congrats for making this. I'm nowhere near as smart to understand how it works just that it's getting better and more human like everyday!

yorwba · on March 18, 2018

If it's possible to fine-tune on 1 minute of audio in 10 minutes (which very likely requires more than 10 passes), it should be possible to run this model with real-time throughput (at least using a decent GPU).

The technology is definitely intended for reading websites to users. I'm actually not sure why Google hasn't integrated it into Chrome yet. Maybe they prefer leaving the task to the OS-level accessibility tools.

londons_explore · on March 18, 2018

The text to speech demos google was doing are relatively computationally expensive.

They couldn't afford for everyone to be text-to-speaking every web page...

petee · on March 17, 2018

All considering, the samples sound great. Maybe not exactly like the real people, but it definitely captures a human feeling, versus my android's robot-lady.

kyubyong · on March 20, 2018

I've updated the repo with additional demo samples trained on speech samples by Modern Family Celebs.