My reading of the generator diagram (figure 6) isn't that it is generating waveforms, but that it is generating phoneme probabilities.
You can train a similar system to produce audio on the output of wav2vec, though it probably won't sound similar to the input audio (accent/voice) unless you expose more features of the input than phonemes.
You can train a similar system to produce audio on the output of wav2vec, though it probably won't sound similar to the input audio (accent/voice) unless you expose more features of the input than phonemes.