(Edit: Wikipedia says that VoCo takes “approximately 20 minutes of the desired target's speech”, and that it was a research prototype.)
There was a thing called Tacotron from a team at Google, in 2018: https://google.github.io/tacotron/publications/speaker_adapt... (In fact, the OP repo and the original CorentinJ/Real-Time-Voice-Cloning apparently rely on Tacotron.)
And there was something from 2019: https://www.ohadf.com/projects/text-based-editing/, https://news.stanford.edu/2019/06/05/edit-video-editing-text...
The latter two seem to need more samples than pure real-time editing.
Overall, to me a layman, this space appears quieter than ‘deep-faking’ videos. Which makes me wonder if I haven't missed something.
Maybe big tech orgs(including Adobe) don't want to risk the liability/PR fallout.
(Edit: Wikipedia says that VoCo takes “approximately 20 minutes of the desired target's speech”, and that it was a research prototype.)
There was a thing called Tacotron from a team at Google, in 2018: https://google.github.io/tacotron/publications/speaker_adapt... (In fact, the OP repo and the original CorentinJ/Real-Time-Voice-Cloning apparently rely on Tacotron.)
And there was something from 2019: https://www.ohadf.com/projects/text-based-editing/, https://news.stanford.edu/2019/06/05/edit-video-editing-text...
The latter two seem to need more samples than pure real-time editing.
Overall, to me a layman, this space appears quieter than ‘deep-faking’ videos. Which makes me wonder if I haven't missed something.