Deepgram – Find Damning Soundbites

yakult · on Oct 10, 2016

The next logical evolution is to make your own Vocaloid using publically available speeches, and synthesize any soundbite you want.

Coming to you 2020: president candidate X may or may not have said that he hates women, kicks babies, love hitler, and plans to nuke Wales. You don't even care any more because your airwaves have been saturated with both candidates (and their Vocaloids) saying all kinds of crazy shit in disinformation and counter-disinformation and counter-counter-information campaigns and everybody's desensitized.

Kronopath · on Oct 10, 2016

Already exists, in some form:

http://talkobamato.me/synthesize.py?speech_key=77ff00cb8af50...

btown · on Oct 10, 2016

Someone actually did make soundbanks based on syllables/morae from Trump and Sanders speeches, for the freeware (sadly not open-source) version of Vocaloid [1]. While it is technically extremely difficult to use this technology to approximate English speech (which has a much wider array of possible phonemes than Japanese), the soundbanks can already "cover" some of the more famous Japanese Vocaloid songs [2].

[1] https://en.wikipedia.org/wiki/Utau - which incidentally was the software that created the voice for that erstwhile Nyan Cat meme

[2] http://store.vocallective.net/album/trumploid-vs-sanderpoid-...

RGamma · on Oct 10, 2016

It gets even better when combined with Face2Face, a live-editing software that can transfer facial expressions including lip movements from actors to e.g. politicians on TV: https://www.youtube.com/watch?v=ohmajJTcpNk

elcapitan · on Oct 10, 2016

"""

Greetings Professor Falken

Hello

A strange game.

The only winning move is not to play.

"""

gaius · on Oct 10, 2016

I just wanted to check that none of us have ever said something among friends that we wouldn't want random people or employers hearing about even years later? Even one sentence that could be taken out of context? And we're totally cool about being fired over it? Just checking.

btown · on Oct 10, 2016

Bridgewater is [in]famous for making radical internal transparency their policy by recording, distributing, and referencing all internal conversations. This tech, if up to their standards, would fit right in there. And it shouldn't be surprising that it's one of the most intense, type-A-attracting environments even compared to other hedge funds.

http://www.businessinsider.com/bridgewater-records-conversat...

http://www.businessinsider.com/ray-dalios-bridgewater-manage...

notgood · on Oct 10, 2016

I just wanted to check if you truly believe he has changed over the last few years or if he would say it against if there was nothing important at stake. Even taken with all the context you can? Just checking.

gaius · on Oct 10, 2016

You think if this technology becomes widely adopted it will only be used on people you dislike? Or by organizations or governments you support? Would you be "safe" if it were used on you? In all circumstances?

notgood · on Oct 10, 2016

We will judge each circumstance accordingly, we are humans not robots after all. Like for any judgment, we will need people to be a bit skeptical.

And so far it has been used for people who knew they were being recorded, so don't talk like its recording private conversations and therefore over-exaggerating the issue.

jklinger410 · on Oct 10, 2016

The problem is not so much the transparency but the irrational nature of American society.

pjc50 · on Oct 10, 2016

Of course we're not cool about that, but I don't think that applies to presidents. The president isn't an "employee".

philh · on Oct 10, 2016

I think it very much applies to presidents. There are vastly more important considerations for a president than "has never said anything regrettable".

notahacker · on Oct 10, 2016

But nobody uses "has never said anything regrettable" as the litmus test for Presidency.

On the other hand "audio searches reveal the presidential candidate is telling a massive lie about always having expressed profoundly opposition to Controversial Thing X, and actually repeatedly ridiculed the cause in not well-reported meetings prior to becoming the candidate" could and should be an electoral issue.

gaius · on Oct 10, 2016

Yes and no. If I were ever running for POTUS I'd rather not anything I might or might not have said about radical Marxism while I was a student coming to define my candidacy. Because this isn't about the one specific case of Trump, once this cat is out of the bag there's no putting it back.

notahacker · on Oct 10, 2016

How about a middle-aged career politician who is his major party's candidate to run the economy telling a small and sympathetic audience that he was a Marxist and the 2007/8 financial crash was an opportunity he'd been waiting for for a generation? That's an actual example from British politics, albeit an MP so known for making unfortunate remarks that there were probably journalists willing to manually trawl through weeks worth of footage of fringe meetings he attends to find the most outrageous ones, but there's no doubt a tool like this could yield better results.

For better and (mostly) for worse, dredging up muck from student days has been around since the dawn of politics, and at least in the UK we're comfortable enough about the fact people change for even a Conservative Party chairman to be willing to volunteer they were a radical Marxist in their teens. Being able to rapidly cross-reference everything a politician has said on record on a given subject in recent years is a new and far more useful way of subjecting them to scrutiny.

This is less about the specific case of Trump's sex remarks (which was obviously a case of somebody knowing their secret video was dynamite and waiting for the right time to light the fuse) and more about the difficulty of establishing whether Trump was telling the truth about having opposed the Iraq war all along, which is where efficient search of stuff that's already in the public domain but not necessarily in the public consciousness comes in really, really useful.

golemotron · on Oct 10, 2016

Actually, the President is an employee. In the US system it's important to remember that.

gaius · on Oct 10, 2016

Billy Bush is tho'.

tspike · on Oct 10, 2016

This tech, of course, has utility for many professions beyond journalism. Given the expansive nature of our laws, it is impossible for anyone not to break a law at some point. When it's possible to simply search a massive database for any transgression, enforcement becomes an arbitrary endeavor.

What checks can be put in place for this?

hx87 · on Oct 10, 2016

If nothing else, mutually assured destruction.

sverige · on Oct 10, 2016

We could stop making certain kinds of speech illegal. (Not including "yelling 'fire' in a crowded theater.")

dredmorbius · on Oct 10, 2016

This goes far beyond speech.

Fraud, misrepresentation, drug use, sexual harassment, conspiracy, aiding and abetting, stolen property, just off the top of my head.

Drawn from financial records, statements, documents, phone calls, social graphs, etc.

You're coming across as rather too narrowly focused.

jimmytidey · on Oct 10, 2016

The speech isn't illegal.

In this case someone is indicating that in the past they've sexually assaulted someone.

The speech is just evidence pointing towards the act, and the act is the problem.

dredmorbius · on Oct 10, 2016

My point isn't that the speech is illegal. It's that there are vast huge classes of criminal conduct which, merited or otherwise could almost certainly be applied against virtually anyone you cared to, with pervasive audio monitoring and search.

Cardinal Richelieu wrote some six lines or so on this subject. You should look them up.

https://en.wikiquote.org/wiki/Cardinal_Richelieu

(Or perhaps not, the authenticity is, as many things are, disputed. But he carries the blame. Which is perhaps the most poetic justice of this point. Or is it injustice. Words, words, words.)

lmm · on Oct 10, 2016

> Cardinal Richelieu wrote some six lines or so on this subject.

Which tells you that this problem is not new by any means.

dredmorbius · on Oct 10, 2016

Scales and rates matter.

salimmadjd · on Oct 10, 2016

No real "journalist" should be looking for soundbites. It's the job of special interest groups and super PACs.

grogenaut · on Oct 10, 2016

Cursory use of the app failed on all searches:

I searched for oil, mexico, automatic weapons, isis... it found "crisis", a reference to oil without actually being on the correct time stamp, mexico got several 50% results none of which were isis. automatic weapons got "nuclear weapons". so basically 0/4 searches.

pretty sure that most networks use the closed caption data to find clips. NBC had an indexed system for all of their tapes in prototype in 98 for their whole back catalog. Not sure if it ever went online.

mobiuscog · on Oct 10, 2016

'Journalists' don't need to find damning soundbites - they're given them by the opposition.

dredmorbius · on Oct 10, 2016

Fair point that Opposition Research is a thing.

https://en.wikipedia.org/wiki/Opposition_research

dredmorbius · on Oct 10, 2016

OK, throw this into the mix: many of you reading this have smartphones which are voice controlled, and for which voice control is activated at all times. In the case of Google, that processing must take place on Google's centralised servers. Siri may or may not do centralised processing (and can operate in standalone modes). Microsoft's Cortana, Facebook's "M" (IIRC) and Amazon's Aero are all various stylings of "Stasi in a Glade form factor", as Maciej Czeglowski so memorably put it.[1]

Voice stores distressingly cheaply in terms of space, and with the Internet of (broken) Things (that spy on you), odds of finding yourself surrounded by microphones in the most unexpected locations,[2] controlled by a wide variety of quite probably competing interests.[3] And if they cannot find what they're looking for in the surveillance tape itself, they'll simply manufacture their own evidence using your own phonemes[4] and video.[5]

________________________________

Notes:

1. https://twitter.com/pinboard/status/732985370204233728

2. http://www.inquisitr.com/3097029/government-surveillance-in-...

3. http://www.locusmag.com/Perspectives/2016/09/cory-doctorowth...

4. http://www.theatlantic.com/technology/archive/2016/09/hackin...

5. https://www.youtube.com/watch?v=ohmajJTcpNk

AlexCoventry · on Oct 10, 2016

  > In the case of Google, that processing must take place on
  > Google's centralised servers

Doesn't recognition of the initialization phrase "OK, google" take place on the phone? Sending a continuous stream of audio back to google servers sounds expensive.

dredmorbius · on Oct 10, 2016

AFAIU (which is little), "OK, Google" is processed locally. Whatever follows is processed remotely.

I should have mentioned voice-activated televisions as a whole 'nother class of attack.

wwweston · on Oct 10, 2016

"Every proclamation guaranteed free ammunition for your enemies..."

manifestsilence · on Oct 10, 2016

"Why do you act like you're the smartest in the room, why do you act like you're the smartest in the room..."

gh1 · on Oct 10, 2016

TL;DR A deep learning algorithm for searching words ("freedom", "Obama" etc.) in a recorded audio/video clip.

The title is clickbait, but the claim is rather interesting. It says that it has 80 % accuracy for transcribing an audio clip compared to 20 % for speech-to-text.

stephensonsco · on Oct 10, 2016

Hey, I'm the blog post author (Deepgram cofounder too)! Sorry if the accuracy wasn't clear enough. The numbers quoted are for search accuracy, not for speech-to-text accuracy. It's a subtle thing but makes a lot of difference (and it is the thing that matters most when you are in trying-to-find-something mode).

If you ran phone calls or tape recorder audio through a speech-to-text engine then the word accuracy rate is like 10-50% (i.e. abysmal). When you try to search for keyphrases like "frolicking kitten" the likelihood of a text match with STT is ~20%.

If you ran that same search with Deepgram then 80% of the time you'd find what you are looking for since Deepgram doesn't have to guess at what is being said, it takes the inverse approach and matches 'how it sounds' using deep learning voodoo magic™.

detaro · on Oct 10, 2016

Is it intentional that your website has no pricing information or specific details what one would get after signing up for an API key? Some blogposts reference pricing, but no idea if those are current and what the APIs look like. Not very inviting to play with stuff :/

garysieling · on Oct 10, 2016

I've been using Deepgram for building https://www.findlectures.com - I don't know about the marketing site but there is pricing within the app.

mdrzn · on Oct 10, 2016

If you sign up it's 5 hours for free then 5$ for 6.7 hours. I figured it out just now.

detaro · on Oct 12, 2016

So their old blogposts are inaccurate...

kastnerkyle · on Oct 10, 2016

There are a number of papers tackling a similar task [0][1][2] for anyone who is interested. There isn't enough information to tell exactly what is going on with Deepgram, but one way to approach this would be to construct a shared embedding space for words/phrases and speech. These types of embedding spaces are powerful [3][4][5], but not magic.

Cool demo, looking forward to seeing more detail about what is going on. However I would quibble with the STT WER quoted above. Maybe in noisy environments with unknown speakers (and no voice normalization) this is accurate, but the kinds of clean speech in the demo perform really well in modern recognition engines (on benchmark data, to be fair c.f. MSR 6.3% and IBM at ~6.9%).

Most word searches over speech to text work over soft matches (or ideally beam search over most likely partial phoneme/word part matches), rather than hard matches so it seems like a bit of a straw man comparison in this case.

[0] http://research.google.com/pubs/pub42543.html

[1] https://arxiv.org/abs/1510.01032

[2] https://sigport.org/sites/default/files/gloveNNLM_kaudhkhasi...

[3] https://arxiv.org/abs/1502.03044

[4] http://www-personal.umich.edu/~reedscot/files/icml2016.pdf

[5] https://arxiv.org/abs/1411.2539

abrookewood · on Oct 10, 2016

I thought the figures sound a little suspect too - if automated text-to-speech was really that bad, then presumably products like Dragon Dictate etc would never have been successful. Obviously they have the benefit of training, but still, 20% sounds ludicrously low.

CoryG89 · on Oct 10, 2016

I think you need to take into consideration that this is designed to be used on voice where the speaker is not aware that they are speaking to a computer (or that they are being recorded at all).

When people talk to Siri, Cortana, or Dragon, they take unnatural care in the clarity of their speech compared to normal talk only meant for humans. Also the speaker may not have been speaking directly into a microphone, lots of background noise, etc.

All of these factors probably combine for a much lower accuracy than what Apple, Microsoft, and Google are going to be dealing with in usual cases. Also keep in mind they all have incentive to inflate their own products accuracy score. Not that the same incentive doesn't also exist for this company/product.

jxramos · on Oct 10, 2016

I had a friend working in speech recognition at Nuance for several years. I'm trying to remember what sort of accuracy he quoted to me about two years ago. It was something large like 80% or something and that getting the last stretch of accuracy was really hard with the techniques they employed back then. That figure may have been for a very specific application, which was a lot of phone menu stuff. He then moved on to Siri at Apple.

dsyko · on Oct 10, 2016

20% is very low. We use machine transcription at work, and although the per-word confidence of machine algorithms varies wildly given the quality of the audio and the speaker's mannerisms. It can easily get into the 80% range with good audio, especially televised audio where people are close to microphones and there is not a lot of non-speech noise.

draugadrotten · on Oct 10, 2016

Microsoft claims 93-94 percent correct, or in other words "6.3% word error rate"

http://www.zdnet.com/article/microsofts-newest-milestone-wor...

theparanoid · on Oct 10, 2016

There's little evidence the trump clip was found using 'new tech'. Most likely somebody in Access Hollywood remembered recording it and it went from there.

ojbyrne · on Oct 10, 2016

The article seems fairly careful to not claim that.

throwanem · on Oct 10, 2016

The article is a marketing piece for a technology sold by its publisher.

R_haterade · on Oct 10, 2016

Some say Billy Bush himself was involved, beyond just being in the recording.

sverige · on Oct 10, 2016

After watching the video, Billy Bush seemed more lecherous than Trump after they got off the bus. Not defending Trump, but since there's no context at the start, I wonder if Bush steered the conversation down that road in the first place.

tomtoise · on Oct 10, 2016

Well then it backfired horribly.

http://money.cnn.com/2016/10/09/media/billy-bush-nbc-today-s...

R_haterade · on Oct 19, 2016

He's a bush. He'll land on his feet, I promise.

thr328982 · on Oct 10, 2016

I found something really sexist:

> Women have always been the primary victims of war. Women lose their husbands, their fathers, their sons in combat.

thr328982 · on Oct 10, 2016

Here is another one:

> Marriage has historic, religious and moral content that goes back to the beginning of time, and I think a marriage is as a marriage has always been, between a man and a woman