Hacker News new | past | comments | ask | show | jobs | submit login
An open source implementation of DeepVoice 3: 2000-Speaker Neural Text-to-Speech (r9y9.github.io)
166 points by allenleein on April 26, 2018 | hide | past | favorite | 24 comments



The "chirping" artifacts are quite distracting. I wonder if they can be avoided by randomizing the alignment of the final piecing together of the audio?

Post-processing by convolving with a short noise burst removes them pretty well, as that randomizes phase vs. frequency.


I was surprised that the inflection was so good but the chirping was so bad. I'm no expert, but the latter seems like a much less difficult problem.


IIRC, I think wavenet deals with residual hissing and chirping through a causal dilated convolution layer.


Would it be possible to upload the outputs of this convolution? I'd like to hear how it sounds.


Here's the second example convolved with 30ms white noise:

https://drive.google.com/file/d/1WwmCNwWwukhtYXRT4mDHxCUa0sQ...

But it also makes it sound like she's in a closet, so it would be better to fix it at source if there's a way to do that.


If I understood it correctly, the datasets being used are all in the order of ~20 hours. Could the results be improved with a different dataset? Say, Mozilla's Common Voice[1]? They are already at 350 hours of labeled speech.

[1]: https://voice.mozilla.org


Maybe, but Google's tacotron2 also used about 24 hours of speech and gives results that are indistinguishable from humans.


Recently I started digging into text-to-speech implementations of which Google seems to have the best know-how.

https://google.github.io/tacotron/publications/global_style_...

Those samples sound already frighteningly human.

Not to disparage the DeepVoice3 in anyway! There must have been a lot of work put into it.


This is scary good. The only way you can tell these aren’t real is because they don’t (yet) model the lung capacity, so some phrases are longer than a human would be able to pronounce on one breath. I can’t wait for this stuff to become mainstream.


>Google seems to have the best know-how.

Funny how they publish their "research papers", yet no one else is able to implement their engine with even remotely comparable results.


Oh, is it not verifiable?


Wait, are you saying that the very first audio sample on that page is NOT a human?


I think when Tacotron-level speech synthesis is feasible to create on mobiles, offline, some really interesting opportunities for new apps open up. Right now you wouldn't want to listen to a long-read article on the web read by speech synthesis, but the moment a system can create realistic, emotionally-accurate speech (especially if you can match quotes/dialogue to correct, gendered voices), you'd probably consider it when you were on the go.


Does this mean all banks that are using their customers’ voices as passwords are going to have a big problem?

https://www.nuance.com/en-gb/omni-channel-customer-engagemen...


A bigger problem than banks using 8 character strings as passwords?


8? When did they double it? And since when can I use anything but digits? /s


Unless you write down that password or use it elsewhere, that can be considered secret. Compare that to the difficulty of sampling someone's voice by simply eavesdropping.


Hi HN readers, What do I need to learn to make text to speech in my own language?


Big data-set, probably audiobooks in your language with full transcript. Then fiddling around with your choice of model for training, this might be a good place to start: https://github.com/Kyubyong/tacotron

Just know that the voice will be similar to what Kyubyong or others managed to train meaning it will sound eerily synthetic. Might fit your purposes but it's probably not enough for consumer-applications. Also from what I played around with it optimizing the synthesization is going to be big hurdle if you want it done quickly. Or not I didn't dig that deep into it.


Thanks


Not bad, for IVRs and the like a smoother, more neutral voice (esp. one that can be tuned based on area) for TTS is extremely handy.


Saw this from Deepmind and thought I would share here as found interesting.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Suppose to be using 16k samples a second through a NN which seems hard to believe. But gets you a pretty incredible result.


The URL says Pytorch by the installation instructions say Tensorflow.

It seems to be all Pytorch in the code though.


It it just me that Apple's new Siri voice sound much more human-sounding then Google's wavenet? By a huge margin.

https://machinelearning.apple.com/2017/08/06/siri-voices.htm...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: