There was Adobe Voco, which seemed kind of a forerunner: https://www.youtube.com...

There was Adobe Voco, which seemed kind of a forerunner: https://www.youtube.com/watch?v=I3l4XLZ59iw, https://en.wikipedia.org/wiki/Adobe_Voco. It purportedly could edit speech like an audio editor and looked like a destroyer of authenticity. And then nothing was heard of it anymore.

(Edit: Wikipedia says that VoCo takes “approximately 20 minutes of the desired target's speech”, and that it was a research prototype.)

There was a thing called Tacotron from a team at Google, in 2018: https://google.github.io/tacotron/publications/speaker_adapt... (In fact, the OP repo and the original CorentinJ/Real-Time-Voice-Cloning apparently rely on Tacotron.)

The latter two seem to need more samples than pure real-time editing.

Overall, to me a layman, this space appears quieter than ‘deep-faking’ videos. Which makes me wonder if I haven't missed something.