The simplest way would be downloading something like MaryTTS [1], read the docum...

The simplest way would be downloading something like MaryTTS [1], read the documentation, and train your own voice model. It won't be perfect, but shouldn't be too hard.

The best results would probably be achieved by implementing DeepMind's WaveNet paper [2], but it might be too much for what you need.

I'm not really sure what to suggest in between those two. Some kind of convolutional NN, I guess?

[1] mary.dfki.de

[2] arxiv.org/abs/1609.03499