The Kate sounds a lot better than the Nick (who has that robotic gravel to his voice) but I think that there is some question as to how much better would it actually be with more data:
If trained on longer samples does it start to hallucinate? Learning programs are notorious for rough approximations failing to scale into useful detail.
The Notes section neatly showcases the characteristics of most of the publish-fast-without-reproducibility "research" hitting arxiv starting mid-'17:
The paper didn't mention normalization, but without normalization I couldn't get it to work. So I added layer normalization.
The paper fixed the learning rate to 0.001, but it didn't work for me. So I decayed it.
I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess separating those two networks mitigates the burden of training.
The authors claimed that the model can be trained within a day, but unfortunately the luck was not mine. However obviously this is much faster than Tacotron as it uses only convolution layers.
The paper didn't mention dropouts. I applied them as I believe it helps for regularization.
Speaking as a person who doesn't know much about programming how much time does it take to generate 1 min of audio (or say 50 words)? Is it possible to integrate this in my browser?
I heard a Google TTS demo (I think it was called deepmind?) that sounds extremely human-like and I was wondering if it can be used to turn webpages into speech (I have few extension in Chrome but those voices sound very robotic and it's hard to hear them after 2 mins).
Anyway congrats for making this. I'm nowhere near as smart to understand how it works just that it's getting better and more human like everyday!
If it's possible to fine-tune on 1 minute of audio in 10 minutes (which very likely requires more than 10 passes), it should be possible to run this model with real-time throughput (at least using a decent GPU).
The technology is definitely intended for reading websites to users. I'm actually not sure why Google hasn't integrated it into Chrome yet. Maybe they prefer leaving the task to the OS-level accessibility tools.
All considering, the samples sound great. Maybe not exactly like the real people, but it definitely captures a human feeling, versus my android's robot-lady.
If trained on longer samples does it start to hallucinate? Learning programs are notorious for rough approximations failing to scale into useful detail.
Very cool work though!