The "chirping" artifacts are quite distracting. I wonder if they can be avoided by randomizing the alignment of the final piecing together of the audio?
Post-processing by convolving with a short noise burst removes them pretty well, as that randomizes phase vs. frequency.
If I understood it correctly, the datasets being used are all in the order of ~20 hours. Could the results be improved with a different dataset? Say, Mozilla's Common Voice[1]? They are already at 350 hours of labeled speech.
This is scary good. The only way you can tell these aren’t real is because they don’t (yet) model the lung capacity, so some phrases are longer than a human would be able to pronounce on one breath. I can’t wait for this stuff to become mainstream.
I think when Tacotron-level speech synthesis is feasible to create on mobiles, offline, some really interesting opportunities for new apps open up. Right now you wouldn't want to listen to a long-read article on the web read by speech synthesis, but the moment a system can create realistic, emotionally-accurate speech (especially if you can match quotes/dialogue to correct, gendered voices), you'd probably consider it when you were on the go.
Unless you write down that password or use it elsewhere, that can be considered secret. Compare that to the difficulty of sampling someone's voice by simply eavesdropping.
Big data-set, probably audiobooks in your language with full transcript. Then fiddling around with your choice of model for training, this might be a good place to start: https://github.com/Kyubyong/tacotron
Just know that the voice will be similar to what Kyubyong or others managed to train meaning it will sound eerily synthetic. Might fit your purposes but it's probably not enough for consumer-applications. Also from what I played around with it optimizing the synthesization is going to be big hurdle if you want it done quickly. Or not I didn't dig that deep into it.
Post-processing by convolving with a short noise burst removes them pretty well, as that randomizes phase vs. frequency.