The next logical evolution is to make your own Vocaloid using publically available speeches, and synthesize any soundbite you want.
Coming to you 2020: president candidate X may or may not have said that he hates women, kicks babies, love hitler, and plans to nuke Wales. You don't even care any more because your airwaves have been saturated with both candidates (and their Vocaloids) saying all kinds of crazy shit in disinformation and counter-disinformation and counter-counter-information campaigns and everybody's desensitized.
Someone actually did make soundbanks based on syllables/morae from Trump and Sanders speeches, for the freeware (sadly not open-source) version of Vocaloid [1]. While it is technically extremely difficult to use this technology to approximate English speech (which has a much wider array of possible phonemes than Japanese), the soundbanks can already "cover" some of the more famous Japanese Vocaloid songs [2].
It gets even better when combined with Face2Face, a live-editing software that can transfer facial expressions including lip movements from actors to e.g. politicians on TV: https://www.youtube.com/watch?v=ohmajJTcpNk
I just wanted to check that none of us have ever said something among friends that we wouldn't want random people or employers hearing about even years later? Even one sentence that could be taken out of context? And we're totally cool about being fired over it? Just checking.
Bridgewater is [in]famous for making radical internal transparency their policy by recording, distributing, and referencing all internal conversations. This tech, if up to their standards, would fit right in there. And it shouldn't be surprising that it's one of the most intense, type-A-attracting environments even compared to other hedge funds.
I just wanted to check if you truly believe he has changed over the last few years or if he would say it against if there was nothing important at stake. Even taken with all the context you can? Just checking.
You think if this technology becomes widely adopted it will only be used on people you dislike? Or by organizations or governments you support? Would you be "safe" if it were used on you? In all circumstances?
We will judge each circumstance accordingly, we are humans not robots after all. Like for any judgment, we will need people to be a bit skeptical.
And so far it has been used for people who knew they were being recorded, so don't talk like its recording private conversations and therefore over-exaggerating the issue.
But nobody uses "has never said anything regrettable" as the litmus test for Presidency.
On the other hand "audio searches reveal the presidential candidate is telling a massive lie about always having expressed profoundly opposition to Controversial Thing X, and actually repeatedly ridiculed the cause in not well-reported meetings prior to becoming the candidate" could and should be an electoral issue.
Yes and no. If I were ever running for POTUS I'd rather not anything I might or might not have said about radical Marxism while I was a student coming to define my candidacy. Because this isn't about the one specific case of Trump, once this cat is out of the bag there's no putting it back.
How about a middle-aged career politician who is his major party's candidate to run the economy telling a small and sympathetic audience that he was a Marxist and the 2007/8 financial crash was an opportunity he'd been waiting for for a generation? That's an actual example from British politics, albeit an MP so known for making unfortunate remarks that there were probably journalists willing to manually trawl through weeks worth of footage of fringe meetings he attends to find the most outrageous ones, but there's no doubt a tool like this could yield better results.
For better and (mostly) for worse, dredging up muck from student days has been around since the dawn of politics, and at least in the UK we're comfortable enough about the fact people change for even a Conservative Party chairman to be willing to volunteer they were a radical Marxist in their teens. Being able to rapidly cross-reference everything a politician has said on record on a given subject in recent years is a new and far more useful way of subjecting them to scrutiny.
This is less about the specific case of Trump's sex remarks (which was obviously a case of somebody knowing their secret video was dynamite and waiting for the right time to light the fuse) and more about the difficulty of establishing whether Trump was telling the truth about having opposed the Iraq war all along, which is where efficient search of stuff that's already in the public domain but not necessarily in the public consciousness comes in really, really useful.
This tech, of course, has utility for many professions beyond journalism. Given the expansive nature of our laws, it is impossible for anyone not to break a law at some point. When it's possible to simply search a massive database for any transgression, enforcement becomes an arbitrary endeavor.
My point isn't that the speech is illegal. It's that there are vast huge classes of criminal conduct which, merited or otherwise could almost certainly be applied against virtually anyone you cared to, with pervasive audio monitoring and search.
Cardinal Richelieu wrote some six lines or so on this subject. You should look them up.
(Or perhaps not, the authenticity is, as many things are, disputed. But he carries the blame. Which is perhaps the most poetic justice of this point. Or is it injustice. Words, words, words.)
I searched for oil, mexico, automatic weapons, isis... it found "crisis", a reference to oil without actually being on the correct time stamp, mexico got several 50% results none of which were isis. automatic weapons got "nuclear weapons". so basically 0/4 searches.
pretty sure that most networks use the closed caption data to find clips. NBC had an indexed system for all of their tapes in prototype in 98 for their whole back catalog. Not sure if it ever went online.
OK, throw this into the mix: many of you reading this have smartphones which are voice controlled, and for which voice control is activated at all times. In the case of Google, that processing must take place on Google's centralised servers. Siri may or may not do centralised processing (and can operate in standalone modes). Microsoft's Cortana, Facebook's "M" (IIRC) and Amazon's Aero are all various stylings of "Stasi in a Glade form factor", as Maciej Czeglowski so memorably put it.[1]
Voice stores distressingly cheaply in terms of space, and with the Internet of (broken) Things (that spy on you), odds of finding yourself surrounded by microphones in the most unexpected locations,[2] controlled by a wide variety of quite probably competing interests.[3] And if they cannot find what they're looking for in the surveillance tape itself, they'll simply manufacture their own evidence using your own phonemes[4] and video.[5]
> In the case of Google, that processing must take place on
> Google's centralised servers
Doesn't recognition of the initialization phrase "OK, google" take place on the phone? Sending a continuous stream of audio back to google servers sounds expensive.
TL;DR A deep learning algorithm for searching words ("freedom", "Obama" etc.) in a recorded audio/video clip.
The title is clickbait, but the claim is rather interesting. It says that it has 80 % accuracy for transcribing an audio clip compared to 20 % for speech-to-text.
Hey, I'm the blog post author (Deepgram cofounder too)! Sorry if the accuracy wasn't clear enough. The numbers quoted are for search accuracy, not for speech-to-text accuracy. It's a subtle thing but makes a lot of difference (and it is the thing that matters most when you are in trying-to-find-something mode).
If you ran phone calls or tape recorder audio through a speech-to-text engine then the word accuracy rate is like 10-50% (i.e. abysmal). When you try to search for keyphrases like "frolicking kitten" the likelihood of a text match with STT is ~20%.
If you ran that same search with Deepgram then 80% of the time you'd find what you are looking for since Deepgram doesn't have to guess at what is being said, it takes the inverse approach and matches 'how it sounds' using deep learning voodoo magic™.
Is it intentional that your website has no pricing information or specific details what one would get after signing up for an API key? Some blogposts reference pricing, but no idea if those are current and what the APIs look like. Not very inviting to play with stuff :/
There are a number of papers tackling a similar task [0][1][2] for anyone who is interested. There isn't enough information to tell exactly what is going on with Deepgram, but one way to approach this would be to construct a shared embedding space for words/phrases and speech. These types of embedding spaces are powerful [3][4][5], but not magic.
Cool demo, looking forward to seeing more detail about what is going on. However I would quibble with the STT WER quoted above. Maybe in noisy environments with unknown speakers (and no voice normalization) this is accurate, but the kinds of clean speech in the demo perform really well in modern recognition engines (on benchmark data, to be fair c.f. MSR 6.3% and IBM at ~6.9%).
Most word searches over speech to text work over soft matches (or ideally beam search over most likely partial phoneme/word part matches), rather than hard matches so it seems like a bit of a straw man comparison in this case.
I thought the figures sound a little suspect too - if automated text-to-speech was really that bad, then presumably products like Dragon Dictate etc would never have been successful. Obviously they have the benefit of training, but still, 20% sounds ludicrously low.
I think you need to take into consideration that this is designed to be used on voice where the speaker is not aware that they are speaking to a computer (or that they are being recorded at all).
When people talk to Siri, Cortana, or Dragon, they take unnatural care in the clarity of their speech compared to normal talk only meant for humans. Also the speaker may not have been speaking directly into a microphone, lots of background noise, etc.
All of these factors probably combine for a much lower accuracy than what Apple, Microsoft, and Google are going to be dealing with in usual cases. Also keep in mind they all have incentive to inflate their own products accuracy score. Not that the same incentive doesn't also exist for this company/product.
I had a friend working in speech recognition at Nuance for several years. I'm trying to remember what sort of accuracy he quoted to me about two years ago. It was something large like 80% or something and that getting the last stretch of accuracy was really hard with the techniques they employed back then. That figure may have been for a very specific application, which was a lot of phone menu stuff. He then moved on to Siri at Apple.
20% is very low. We use machine transcription at work, and although the per-word confidence of machine algorithms varies wildly given the quality of the audio and the speaker's mannerisms. It can easily get into the 80% range with good audio, especially televised audio where people are close to microphones and there is not a lot of non-speech noise.
There's little evidence the trump clip was found using 'new tech'. Most likely somebody in Access Hollywood remembered recording it and it went from there.
After watching the video, Billy Bush seemed more lecherous than Trump after they got off the bus. Not defending Trump, but since there's no context at the start, I wonder if Bush steered the conversation down that road in the first place.
> Marriage has historic, religious and moral content that goes back to the beginning of time, and I think a marriage is as a marriage has always been, between a man and a woman
Coming to you 2020: president candidate X may or may not have said that he hates women, kicks babies, love hitler, and plans to nuke Wales. You don't even care any more because your airwaves have been saturated with both candidates (and their Vocaloids) saying all kinds of crazy shit in disinformation and counter-disinformation and counter-counter-information campaigns and everybody's desensitized.