Hacker News new | past | comments | ask | show | jobs | submit login
Mocking Bird – Realtime Voice Clone for Chinese (github.com/babysor)
163 points by axiomdata316 on Dec 28, 2021 | hide | past | favorite | 64 comments



Some may have missed the news in October (2021) about the 35 million dollars swindle, between UAE and Hong Kong, involving the bank director in HK receiving a call where the voice "of the company director" requested a transfer...

https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-...

https://www.unite.ai/deepfaked-voice-enabled-35-million-bank...

Not only the voice (simulated through machine learning) was trusted: also (even) emails had been unverified.


https://www.hsbc.com.hk/ways-to-bank/phone/voice-id/ HSBC Hong Kong allows you to use your voice to log in to phone banking without a PIN. What could go wrong.


HSBC everywhere pushes it "for security". This has been insecure since the invention of the tape recorder. HSBC, not even once.


I refuse to accept the terms for Schwab's 'voice authentication' it asks me to every time I login.


Well, with the previous voice-faking tech releases I kinda expected that the scams are gonna evolve from garbled “Hi ma, I've been in a road accident, need to pay off, please send money to this number”. But they escalated quicker than I could guess.


My 90 year old grandma had a fraudster call her last month claiming that her son had been in a car accident and needed money to pay for his emergency surgery

If that call had instead been a pitch-perfect replication of my dad’s voice - or my aunt - I guarantee you it would have worked.


I never thought about this scenario but now I am kind of scared. I guess that they would need a lot of data from my own voice to train a model like that. I am a kind of happy now that my podcast was taken down due to copyright infringment :)


Sir, we called you to have a extended sampling conversation about your cars extended warranty.


I don't have a car, thanks :D


So it already started: a new generation of swindling using computer generated voice, video and other artifacts... It didn't take much time to misuse the newest tech...


The issue is with the workers believing to live in a past distant century and sitting on the idea that current possibilities are "science fiction" - often criminally calibrating the risk on the abilities of the "man in the street", instead of the interested actor: the swindler. As in "But who could and would ever do it?"¹; the answer clearly being: the one really interested in doing it.

I have heard, literally: "Well, if I cannot trust one's email nor one's telephone number [than a scam is like fate]"... From a bank employee. This is dangerously placed ignorance, in subcultures where unduly laxity was accepted.

(¹ Again, an actual reply given to security experts showing systemic faults to managers.)


This didn't exist 10 years ago as a github project for anyone to download. How is it from a century in the distant past?

Also, email and phone numbers are LITERALLY standard methods of 2FA. So what are you expecting bank employees to use?


2FA is an abbreviation for "two-factor authentication" means that to authenticate a client, you require two of the three factors: something secret that they know, something they physically possess, and something they are (biometrics). It's important that you require both of the two factors to authenticate, not either of them. Email and phone numbers are not even one of these three factors, so not only are they not "standard methods of 2FA", they aren't methods of 2FA at all.

Someone who claims they are using 2FA, but actually authenticates with email and/or phone numbers, is committing fraud.

Even if, for example, a phone number were something you physically possessed, authenticating with only the phone number, or with the phone number plus any number of additional physical possessions, wouldn't be 2FA, because you're still only using one factor: "something you have".

Historically, voice-based biometrics were a valid form of biometrics, even without a trusted path: you could prompt someone to say something they hadn't said before so that an attacker couldn't play back a recording. That is no longer the case. As https://news.ycombinator.com/item?id=29712024 pointed out, Tacotron made this a plausible threat already in 02018.

What do I expect bank employees to use? Well, starting 34 years ago in 01987, classified voice communications used a STU-III, which authenticates both parties with public key certificates. PGP made that level of security available to everybody 30 years ago in 01991; Git uses it to sign tags, and Debian uses it to sign packages since 02005: https://wiki.debian.org/SecureApt. Every HTTPS website uses something similar, though browsers routinely trust untrustworthy CAs, which vitiates the security of the scheme.

While we can't expect bank employees to be as technically sophisticated as Debian volunteers, I do think it's reasonable to expect them to be less than 15 years behind, particularly when tens of millions of dollars are at stake. I don't believe that this will actually happen with the existing banking institutions; instead, I believe that they will fail and demand bailouts, which will just expand the scope of the disaster.


> Someone who claims they are using 2FA, but actually authenticates with email and/or phone numbers, is committing fraud.

OK. I'll send you the list of all the companies that have done 2FA with me via email or a telephone number, and you can hit them up for fraud. Good luck! /s


Send the list to the FTC, not to me. Or file suit against them yourself. I don't have standing to do so because I haven't been defrauded.

The fact that some people get away with telling a lie isn't generally a very strong argument that it's not a lie.


It was a joke! Good luck! /s


HN is a place where sarcasm is sometimes intentionally taken at face value because it leads to more interesting discussions.

People who use sarcasm tend to be very proud of it, but the aggressive-defensive undertone hinders productive conversations.


I understand your point of view, which is why I tagged my comment with a "/s".


I'm trying to figure out why you write dates with a leading zero. I can't think of any practical reason. Am I missing something or is it purely a stylistic choice?


I associate it with the Long Now Foundation[1], I think it's used more broadly as a nudge to think long(er) term.

1: https://longnow.org/


Maybe it's octal?


Because you've been able to record peoples voices for literally more than a century now.


Your expression is heavily confused. In short: to say something like "But it was his voice, I recognized it, of course I trusted it" is inexcusable and even more inexcusable today, because there must be awareness of technology - for the extra possibilities (e.g. easier forgery) and for those expectations which are instead not granted (e.g. limits in authentication).

If there exists e.g. a bank which accepts printed credit tokens, and a technology comes to exist that allows for simplified forgery of those printouts, the bank must actively inform itself and act accordingly. The service provider must know "they are not in the century before Gutenberg, those centuries are past" (figuratively: the idea is about the press, not the movable type). If a service provider accepted instructions through postcards, they must know about the ease of forging signatures. If electronic postcards come to exist (they do: it's the E-Mail), all involved parts must know that postcards have no sender authentication, and act accordingly.

That someone seem to have access to John's telephone, it corroborates that such someone is John, it does not prove it. If someone seems to be John and also has access to John's telephone, that increases the chances that it is John. If only the access to John's telephone is there, it is far from granted that that is John. If an employee asks you for two documents, that is increased security; if the process becomes that "now an alternative document suffices", that is decreased security.

The employee is supposed first of all to be reliable, which means careful as opposed to lax. In the context of banking, the employee must be aware of lack of sender authentication in emails, of SIM swapping, and yes even of the possibility of "deepfakes" - the same way a hired shepherd should be aware of wolves, it's the job.

Its organization must enforce decent, commonsensical security - not "good sense" but really "common sense", because it would be unacceptable that the typical (unfortunately) faults one sees would not be evident as faults to a majority, when exposed. I have seen web-based services stipulated in their form through a preliminary contract which enables them (they could be limited according to the security the client wanted), and then modifiable in the user's control panel - you may request by contract a read only access - for reporting instead of operation -, and then once in front of the user interface one is two clicks away from granting the user full access. Everyone went facepalm when shown that, everyone but the "carefully" selected "soldiers" of the enterprise.


OK!


this is good


I'm the author of FakeYou.com, so I have a little experience in this area. (We used to train GlowTTS models ourselves before turning it over to our users, which has had mixed results in terms of quality.)

This appears to be a repackaging of RealTimeVoiceCloning [1], albeit with a few additions, such as GSTs.

No matter what the repo claims, your results will depend on high quality data. Lots of it, and with ample fine tuning. Demo videos are absolutely cherry picked.

If you're picking this up for a project, HiFi-Gan is pretty much the best vocoder right now. Tacotron still produces great results, though there are lots of other interesting model architectures.

[1] https://github.com/CorentinJ/Real-Time-Voice-Cloning


Long ago I found an approach to 3D modeling [1] that used a morphable model that was then morphed into the desired shape. Would something like this be possible for voice? A voice model obtained from a gigantic set of samples, that can be manually tuned to sound more masculine/feminine, higher/lower pitched and that can be morphed into the timbre of various samples.

- [1] https://www.youtube.com/watch?v=pSRA8GpWIrA


By any chance, do you know which technology was used for the TikTok voice? While the voice itself is somewhat annoying I find the quality stunning. Any chance to reach this level with any of the models you mentioned?


Is the quality that good? It seemed to me that text to speech was not a particularly hard problem when you can get someone to read out all the sounds of the language. And that the new advancement we have now is just being able to dump our normal speech audio and build a model based on that.


Not sure. So far, I found every text to speech application quite uncanny. The ones that come with Windows, MacOS, Android, iOS are good, but not quite there. The TikTok one is the first to sound quite convincing for me.

Siri and Alexa are also good but I think they don't count, because they probably are not purely text to speech but presumably use a lot of prerecorded phrases, especially for the common answers.


i actively dislike the tiktok voice. i find the microsoft and google cloud voices to be much better.


You are not alone. Especially when it is used in contexts that do not fit its overly enthusiastic upbeat mood the results range from slightly comical to deeply unsettling. Still, in my opinion it is really good from a technical point of view.


By "lots of [high quality data]" do you mean seconds, kiloseconds, or megaseconds of high-quality voice recordings?


Hours of audio comprised of clean spoken sentences, zero noise, and uniform microphone quality is ideal.

Some of the predominant base data sets used for transfer learning, such as LJSpeech [1], are unfortunately noisy and non-uniform.

[1] https://keithito.com/LJ-Speech-Dataset/


Thank you very much!


This paper[0] from this year seems to make do with a couple of minutes.

[0] https://davidyao.me/projects/text2vid/


IIRC enterprise solutions from the big clouds usually ask for at least hours of studio quality voice recordings for a custom voice model.


Thanks! Do you think they're using models as good as the ones echelon uses at FakeYou? Maybe he can get by with less data.


The best sounding limited data models have at least thirty minutes of audio data and have a similar pitch and timbre to a base data set. You can get by with less, but it requires finesse.


There was Adobe Voco, which seemed kind of a forerunner: https://www.youtube.com/watch?v=I3l4XLZ59iw, https://en.wikipedia.org/wiki/Adobe_Voco. It purportedly could edit speech like an audio editor and looked like a destroyer of authenticity. And then nothing was heard of it anymore.

(Edit: Wikipedia says that VoCo takes “approximately 20 minutes of the desired target's speech”, and that it was a research prototype.)

There was a thing called Tacotron from a team at Google, in 2018: https://google.github.io/tacotron/publications/speaker_adapt... (In fact, the OP repo and the original CorentinJ/Real-Time-Voice-Cloning apparently rely on Tacotron.)

And there was something from 2019: https://www.ohadf.com/projects/text-based-editing/, https://news.stanford.edu/2019/06/05/edit-video-editing-text...

The latter two seem to need more samples than pure real-time editing.

Overall, to me a layman, this space appears quieter than ‘deep-faking’ videos. Which makes me wonder if I haven't missed something.


I imagine that it's inherently easier to create a fake voice that's convincing over a grainy phone line, than it is to make a convincing deep fake video.

Maybe big tech orgs(including Adobe) don't want to risk the liability/PR fallout.


This is very interesting. People who have had a laryngectomy can't speak post-operation, and being able to say something with their 'real' voice would be psychologically helpful.


Would it? Unless the sound is produced from inside their body in the same manner as natural, no it will sound weird just like playing back your own voice.


Not necessarily - people close to me were complaining about what they saw as losing a part of their identity ( and grandchildren definitely were not happy about the change).

Even for those who learn to speak with an oesophagal voice, being able to real time map it to something more natural would be great.

To be more specific, they consider themselves to be 'voice mutilated', even when able to speak, so I guess that anything closer to their original voice would be a net benefit.


I suspect that it would be immensely better to sound sort of like yourself than like a standard text to speech engine, and those are used extensively


This is going to be a very abusive advertising technology. We already know the "cocktail party effect" where one can hear their name spoken by a familiar voice in a crowded space with lots of other voices. Expect to hear vaguely familiar voices using your name and personal information to craft advertising that pierces your awareness all the time, at odd times, and distracting you when you're busy.


Supports only Mandarin, so description is slightly disingenuous.


For anyone else wondering, the original title was "Clone a voice in 5 seconds to generate arbitrary speech in real-time"


Yes. I reckon “Mocking Bird – Realtime Voice Clone for Mandarin” might’ve been a more appropriate change, but it’s always easy to criticise after the fact…


It's forked from an English version


Interesting, what’s the upstream? Must’ve missed the link.



Ah, looks like it!


12% of the world speaks Mandarin, so it's a good first choice!


This is not about some sort of language leaderboard. If you see the title (as phrased) written in %language%, you’d expect the proposed thing to work in that language.

Anyway, as someone else pointed out it’s a fork, so it’s understandable if docs weren’t fully updated from upstream.


If the title were written in Mandarin I would have no clue at all what this software is about.

I too was slightly disappointed to find in the repo that this version is for Mandarin, but the amount of life time lost by clicking on a Github link does not make me want to complain about it. I got over it pretty quickly.


Speak for yourself if you experienced disappointment. I just clicked through, looked at it and left a note that technical merits aside the title is technically clickbait, that’s all.


I'm very interested in this space. Forgive my ignorance, but what makes this fit for Chinese voices, while unfit for English voices?


From README:

> This repository is forked from Real-Time-Voice-Cloning which only support English.

https://github.com/CorentinJ/Real-Time-Voice-Cloning


One of the big things about these projects is training sound that's paired to text.

The base project spoke English because it had been trained on English text paired with English recordings. This speaks Mandarin because it has been trained on paired Mandarin text and recordings.

Amusingly, if you take one trained on English text/recording pairs and feed it French text, it will speak French with an English accent.


Is there an easy way to make this work for a non-programmer, i.e. without installing a whole python environment? I'm less interested in cloning specific voices than in getting a high-quality text-to-speech program that'll read me arbitrary Mandarin input, which it seems like this should be able to do.


ya, microsoft sells it saas under the azure label. you can easily play with their web version, then hit a URL to get it on your credit card.

https://azure.microsoft.com/en-us/services/cognitive-service...


Does this mean voice assistants and GPS will one day be made consumer friendly and support any voice/name we want them to have, rather than this gimicky brand driven marketing malware known as Alexa or Google or Siri?


I have checked the video, the result is still not satisfying




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: