An open source implementation of DeepVoice 3: 2000-Speaker Neural Text-to-Speech

dspig · on April 26, 2018

The "chirping" artifacts are quite distracting. I wonder if they can be avoided by randomizing the alignment of the final piecing together of the audio?

Post-processing by convolving with a short noise burst removes them pretty well, as that randomizes phase vs. frequency.

_bxg1 · on April 26, 2018

I was surprised that the inflection was so good but the chirping was so bad. I'm no expert, but the latter seems like a much less difficult problem.

joshumax · on April 26, 2018

IIRC, I think wavenet deals with residual hissing and chirping through a causal dilated convolution layer.

skykooler · on April 26, 2018

Would it be possible to upload the outputs of this convolution? I'd like to hear how it sounds.

dspig · on April 26, 2018

Here's the second example convolved with 30ms white noise:

https://drive.google.com/file/d/1WwmCNwWwukhtYXRT4mDHxCUa0sQ...

But it also makes it sound like she's in a closet, so it would be better to fix it at source if there's a way to do that.

rglullis · on April 26, 2018

If I understood it correctly, the datasets being used are all in the order of ~20 hours. Could the results be improved with a different dataset? Say, Mozilla's Common Voice[1]? They are already at 350 hours of labeled speech.

[1]: https://voice.mozilla.org

IshKebab · on April 26, 2018

Maybe, but Google's tacotron2 also used about 24 hours of speech and gives results that are indistinguishable from humans.

tekkk · on April 26, 2018

Recently I started digging into text-to-speech implementations of which Google seems to have the best know-how.

https://google.github.io/tacotron/publications/global_style_...

Those samples sound already frighteningly human.

Not to disparage the DeepVoice3 in anyway! There must have been a lot of work put into it.

throwaway84742 · on April 26, 2018

This is scary good. The only way you can tell these aren’t real is because they don’t (yet) model the lung capacity, so some phrases are longer than a human would be able to pronounce on one breath. I can’t wait for this stuff to become mainstream.

romaniv · on April 26, 2018

>Google seems to have the best know-how.

Funny how they publish their "research papers", yet no one else is able to implement their engine with even remotely comparable results.

mkagenius · on April 26, 2018

Oh, is it not verifiable?

heywire · on April 26, 2018

Wait, are you saying that the very first audio sample on that page is NOT a human?

thom · on April 26, 2018

I think when Tacotron-level speech synthesis is feasible to create on mobiles, offline, some really interesting opportunities for new apps open up. Right now you wouldn't want to listen to a long-read article on the web read by speech synthesis, but the moment a system can create realistic, emotionally-accurate speech (especially if you can match quotes/dialogue to correct, gendered voices), you'd probably consider it when you were on the go.

pmuk · on April 26, 2018

Does this mean all banks that are using their customers’ voices as passwords are going to have a big problem?

https://www.nuance.com/en-gb/omni-channel-customer-engagemen...

imustbeevil · on April 26, 2018

A bigger problem than banks using 8 character strings as passwords?

madmulita · on April 26, 2018

8? When did they double it? And since when can I use anything but digits? /s

wimagguc · on April 26, 2018

Unless you write down that password or use it elsewhere, that can be considered secret. Compare that to the difficulty of sampling someone's voice by simply eavesdropping.

zakki · on April 26, 2018

Hi HN readers, What do I need to learn to make text to speech in my own language?

tekkk · on April 26, 2018

Big data-set, probably audiobooks in your language with full transcript. Then fiddling around with your choice of model for training, this might be a good place to start: https://github.com/Kyubyong/tacotron

Just know that the voice will be similar to what Kyubyong or others managed to train meaning it will sound eerily synthetic. Might fit your purposes but it's probably not enough for consumer-applications. Also from what I played around with it optimizing the synthesization is going to be big hurdle if you want it done quickly. Or not I didn't dig that deep into it.

zakki · on April 26, 2018

Thanks

StudentStuff · on April 26, 2018

Not bad, for IVRs and the like a smoother, more neutral voice (esp. one that can be tuned based on area) for TTS is extremely handy.

jacksmith21006 · on April 26, 2018

Saw this from Deepmind and thought I would share here as found interesting.

https://cloudplatform.googleblog.com/2018/03/introducing-Clo...

Suppose to be using 16k samples a second through a NN which seems hard to believe. But gets you a pretty incredible result.

nl · on April 26, 2018

The URL says Pytorch by the installation instructions say Tensorflow.

It seems to be all Pytorch in the code though.

just_a_fella · on April 26, 2018

It it just me that Apple's new Siri voice sound much more human-sounding then Google's wavenet? By a huge margin.

https://machinelearning.apple.com/2017/08/06/siri-voices.htm...