Researchers achieve speech recognition milestone

josho · on Sept 14, 2016

For those not familiar with the NIST 2000 Switchboard evaluation[1] it is a series of 8kHz audio recordings (ie. crappy phone quality samples) of a conversation, including things like "uhhuh" and other pause words. So, 6% seems pretty good.

[1] http://www.itl.nist.gov/iad/mig/tests/ctr/2000/h5_2000_v1.3....

cs702 · on Sept 14, 2016

Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.

People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.

strombofulous · on Sept 14, 2016

The difference is that people are OK with a human asking for clarification, but systems like Siri need to have a near-zero error rare before people will consider them good (a person who has to repeat themselves once every 20 times will consider it bad, or at least not good enough)

a3_nm · on Sept 14, 2016

I'm not sure people expect super-human performance out of Siri. An important difference is that a human who doesn't understand will say so, and ask to repeat the relevant part (or to choose between two alternatives), conversationally; or it will pick an interpretation which is not the intended one but was an understandable misunderstanding.

Contrast this with speech recognition, which will often substitute words that are nonsensical in context, making it look silly from a human perspective...

click170 · on Sept 15, 2016

I think another important difference is that humans won't get stuck in a loop asking you for clarification the same way several times, after 2 or 3 times they'll typically change behaviors. Eg they'll ask you to spell the word or respond with the not-understood word with a questioning tone to signal that they don't understand what that word means.

jobigoud · on Sept 15, 2016

This could be implemented though. Based on the part of the sentence that is understood, figure out most likely words for the missing part and ask a specific question about it to fill the gap.

legolas2412 · on Sept 15, 2016

See, it's not about hard coding such behavior. I would say that it reaches human level of understanding if it automatically learns these ways of solving the problem. Asking relevant questions can be hard coded, but it doesn't equal "understanding" the problem.

I think the chinese room experiment overlooks this part of "understanding"

cptskippy · on Sept 14, 2016

Exactly, when SR has a low confidence level it needs to ask for you to repeat yourself. Not just choose the highest confidence match and hope for the best.

randyrand · on Sept 15, 2016

Siri underlines words its not sure about. Then if you click it, it gives you a menu of potential other candidates.

Seems like a good approach.

tarikjn · on Sept 15, 2016

That's a good start but a probably the wrong interface for it, "non-native" in the context, a command initialized by voice should present the options by voice.

zaroth · on Sept 15, 2016

It's a valid HCI solution to a technical failure mode. Once the software has advanced to the point where the AI is truly conversational, it is a watershed moment.

cptskippy · on Sept 15, 2016

That's fine for dictation but of little use when driving or other eyes free scenarios.

ams6110 · on Sept 15, 2016

Also. When. People. Talk. To. Siri. They. Speak. Very. Distinctly. With. Clear. Separation. Between. Words.

Or that is my observation, anyway. I don't use it myself.

dasboth · on Sept 15, 2016

I bet Siri's great at understanding what William Shatner says.

taneq · on Sept 15, 2016

The important thing here, IMO, is going to be how the system asks for clarification. Hearing the same canned "I'm sorry, I didn't quite get that, can you repeat?" phrase 20 times in a row is annoying. Having the computer say "I'm sorry, what was that last word?" or "I didn't quite catch that, did you want me to call Benny or Betty?" would be far more acceptable.

daveguy · on Sept 15, 2016

Like someone else mentioned, how it makes sense out of words is much more important than a zero error rate.

Understanding rate is less than 10%. If you don't match a keyword it gives a useless web search.

Personally I don't think understanding rate is the whole issue as much as reaction to error (which is partly understanding). You can't say "no that's not what I said" and Siri et al never keep enough context to say "huh? What did you say? Or "I didn't get that last part. can you repeat it?"

It's that errors in understanding or accuracy turn the whole thing into a complete shitshow.

One failure and you might as well pull over and type what you want.

Houshalter · on Sept 14, 2016

Remember this is with low quality sound. It could be much higher under better conditions. Amazon's echo relies on good hardware as much as software, with an array of good mics.

dogma1138 · on Sept 15, 2016

From talking to a few people that do SR it's also considerably easier to do when you know the hardware.

They can cancel out reverb and create very fine tuned waveform profiles for speech.

I think one of the reasons that Siri is slightly better at SR than google is because of the control that Apple has over the hardware.

While Cortana turns sourpuss on me every time I switch headsets.

blazespin · on Sept 15, 2016

No, the err rate is not a big deal. What is a big deal is making sense of the words it actually can hear.

amelius · on Sept 15, 2016

One big problem with Siri is that it has zero sense of humor. That is, imho, what makes people feel tired talking to it. It's like talking to a boring civil servant.

YeGoblynQueenne · on Sept 15, 2016

>> Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.

It's better to avoid throwing around numbers like that but even if that was the case you have to remember that humans understand speech. The speech recognition task performed by AI systems on the other hand is more akin to transliteration: the system takes in sound as input and produces text as output. Any sort of "understanding" a) is extremly difficult to do well and b) must be performed by a different component of the system (a different algorithm, trained on different data).

blazespin · on Sept 15, 2016

The difference is that people understand what you say, don't just map your speech into words.

newscracker · on Sept 15, 2016

> People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.

For humans, isn't this a due to a combination of factors than just comprehension alone? Humans who ask, "sorry, what did you say?" or "would you repeat that please?" or even just a "huh?" usually aren't paying attention at all. It's not a comprehension or sound quality or surrounding noise problem for many, except in situations where the person is not fluent in a particular language or dialect or accent or if the surrounding noise vs. the person's hearing ability aren't conducive to listening properly.

Most people also usually tend to think about judging what the other is saying and constructing a counter-point during the process of listening that impairs the ability to listen and understand well.

On the other hand, a computer could expected to be, and made to be, paying attention a lot better in a predictable way, which is not possible with humans.

With the other comment reply above stating people's expectations with humans vs. computers, shouldn't we also consider the computer's strengths while making comparisons with humans?

Retra · on Sept 14, 2016

That's mostly because people are thinking about other things. We understand that and anticipate it. If my computer doesn't understand me, it has no excuse as it can't distract itself. It isn't going to hear me next time by "concentrating harder" like a human can. It's going to keep failing.

oldmanjay · on Sept 14, 2016

I have a different experience - many people speak with a mumble or a mushmouth and no amount of concentration helps me disentangle it until I can get them to speak more clearly.

YeGoblynQueenne · on Sept 15, 2016

Sure, but if you repeat your utterance there's a good chance that the conditions will have changed the second time around- maybe background noise will have subsided or you'll have swallowed that bit you were chewing on and so on. It makes sense to ask you to repeat a couple of times even if it's a computer you're talking to.

noonespecial · on Sept 15, 2016

Yes but people know to ask for clarification based on context. You know that they didn't just say they were off to wok their log.

UlyssesSKrunk · on Sept 15, 2016

Humans do way better than 6%. I don't know who you're talking too, but I don't know anybody who I need to repeat anywhere near 6% of what I say to.

ams6110 · on Sept 15, 2016

I'm also not seeing anything close to 6% on any public implementations. The voice mail transcript emails I get are often so bad that it's impossible to discern even the gist of what the caller is talking about.

fareesh · on Sept 14, 2016

Does the word error rate account for context? For example, if I heard "clown" instead of "cloud", I would probably treat it as something I misheard it in the context of the sentence and correct my understanding accordingly. I guess what I am trying to ask is that if the system reaches 4% (human error rate for same sample), can it really be considered to be on-par?

nl · on Sept 14, 2016

It measures against what humans write down when they listen. Humans generally use context, so I'd expect that they would record "cloud" if that was appropriate in context.

The Microsoft system also does this: it uses language modelling to attempt to model what word is more likely in a given context. This gives them a 18% word error rate reduction (see section 7 of the paper: http://arxiv.org/pdf/1609.03528v1.pdf).

gok · on Sept 14, 2016

All language models attempt to model what word is more likely in a given context :)

What is somewhat unusual about this approach is they use recurrent neural networks for the language modeling, as opposed to more traditional approached like a backoff n-gram model.

nl · on Sept 14, 2016

Most modern language modelling is RNN based now; eg: https://arxiv.org/abs/1602.02410 (which I think is SOTA).

YeGoblynQueenne · on Sept 15, 2016

>> 6% seems pretty good.

That's 6% on the NIST dataset. Typically, results get much worse on real-world datasets, not least because trying to get good results on the same dataset year after year leads to subtle bias. Don't forget this is a dataset that's been around for 16 years, now.

rhizome · on Sept 14, 2016

When I worked at a speech recognition startup in the late 90s I was told that anything over 5% is essentially useless for no-touch recognition.

copperx · on Sept 14, 2016

> a series of 8kHz audio recordings (ie. crappy phone quality samples)

I worked in speech recognition for a bit and 8kHz was the standard audio rate for recordings. No one saw this as an impediment; and it really isn't. 8kHz is able to capture up to 4kHz frequencies and the fundamental frequencies of speech are MUCH lower than 4kHz.

pps43 · on Sept 15, 2016

At 8 kHz sampling rate, you can tell f from s only by context (they sound the same).

M_Grey · on Sept 14, 2016

That's very technically impressive, but 6% is still a long way from where it needs to be as anything like a primary interface. Even 1% is pretty high when you consider how many words you can utter in a few minutes, and how many errors that would generate.

Edit: For comparison: http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

pesenti · on Sept 14, 2016

The human error rate on this task is 4% (https://pdfs.semanticscholar.org/387e/7349b8e31e316c2a738060...). So you are basically saying that telephones are a useless interface...

What is amazing is how consistently people overestimate human performance.

drspacemonkey · on Sept 14, 2016

Happens all the time with autonomous driving. I've heard so many people argue that self-driving cars need to be 100% perfect, but they always overlook that humans aren't anywhere near 100% perfect drivers.

Frankly, we're lousy drivers.

YeGoblynQueenne · on Sept 15, 2016

Sorry, what is this based on? I for one don't like the idea of self-driving cars and that has nothing to do with wanting them to be "100% perfect", or forgetting that human drivers are not safe either.

On the one hand, there are very good ethical arguments, for example about who bears the moral responsibility when a self-driving car is involved in a fatal accident.

Further, there is a great risk that self-driving cars will become available long before they are advanced enough to be less of a risk than humans, exactly because people may implicitly trust an automated system to be less error-prone than a human, which is not currently the state of the art.

drspacemonkey · on Sept 15, 2016

>Sorry, what is this based on?

Conversations I've had where people have told me that self-driving cars will need to be 100% perfect before they should be used. Ironically, one of those people was an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.

Anyway, based on Google's extensive test results, I'm pretty sure self-driving cars are already advanced enough to be less of a risk than humans. Right now, the sensors seem to be the limiting the factor.

YeGoblynQueenne · on Sept 16, 2016

Try looking for results where the Google car is driving off-road. There aren't any. That's because it can't drive off-road. It can't, because it needs its environment to be fully mapped. In other words: it's not about the sensors.

This should make, er, sense. Sensing your surroundings is only the first step in taking complex decisions based on those surroundings. The AI field as a whole has not yet solved this, so there's no reason to expect that self-driving cars have.

Seen another way, if self-driving cars could really drive themselves at least as good as humans drive them, we wouldn't have compilations of videos of robots falling off while trying to turn door knobs, on youtube.

The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.

>> an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.

Honestly.

drspacemonkey · on Sept 16, 2016

>The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.

Google's self-driving car accident statistics say otherwise.

>Honestly.

Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.

YeGoblynQueenne · on Sept 16, 2016

>> Google's self-driving car accident statistics say otherwise.

That's an experiment that's been running in a tiny part of one state in one country for a very limited time. I wouldn't count on them and in any case, see what I say above: the state of the art is not there yet, for fully autonomous driving better than humans'.

>> Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.

:snorts coffee:

drspacemonkey · on Sept 16, 2016

>That's an experiment that's been running in a tiny part of one state in one country for a very limited time.

It's driven over 1 million miles. That's the equivalent of 75 years of driving for the average human. Plenty of data to draw a conclusion from. In all that time, it's been responsible for a single accident. That's way better than human drivers.

YeGoblynQueenne · on Sept 18, 2016

Hours driven are one dimension, that is indeed important. However, there is also the geographical aspect that I point out and that may be more important in practice. I mentioned off-road driving. There's also driving in busy roads. The Google car project has not driven for 1 million miles in a busy city, like NY or SF or whatever, neither in heavy traffic conditions.

Then there's the fact that human drivers have to drive in all sorts of weather conditions with all sorts of different vehicles and so on. Google car- not so much.

But my point is very simple: AI in general is nowhere near producing autonomous robots, yet. Why would Google car (or a similar project) be the exception? What makes cars and driving so different that autonomy is easier to attain?

jmde · on Sept 15, 2016

I agree with your general sentiment, but think those error rates hide a lot as well. Human error might be at x% overall, but when you eliminate malfunctioning humans, broadly defined, it's probably much lower than x%.

The recent death of the Tesla owner, for example, as far as I know, was due to the vehicle accelerating into a semi. This is something that most people would not do even in their worst driving state unless they were intoxicated or seriously mentally impaired. I don't want AI driving errors to be compared to human benchmarks that include people who are seriously intoxicated.

A lot of speech frustration problems, similarly, are not only about poor recognition in general, or lack of appropriate prompting to increase classification certainty, but recognition failures in situations where a human would not have any trouble at all, such as in recognizing names of loved ones, or things that would be clear in context to a human. I.e., maybe humans listening to speech corpora would have x% error rate, but that's strangers listening to the corpora. The real question is, if I listen to a recording of my spouse or coworker having a conversation what's the error rate there?

So, although humans are far from perfect, which is something that's often forgotten, the true AI target is also probably not "humans broadly defined" but rather "functional humans" or something like that. AI research often sets the bar misleadingly low because it's so hard to reach as it is.

Hondor · on Sept 15, 2016

The types of mistakes a human and a car are prone to making are different. Neither one has to be a superset of the other. For example, cars are probably better at going round corners at a safe speed while humans can easily misjudge and end up skidding. You could make the opposite argument and say only the most malfunctioning self driving car would choose the wrong speed for a corner yet humans make that error all the time, so humans are even worse than the worst self-driving cars.

Another example. If a self driving car is hit by another car that's running a red light while speeding, we might be more forgiving and say "well nobody could have avoided that accident" but actually we'd be being too soft on the self driving car since it has access to more data and faster reaction times and should probably be expected to avoid that type of crash even when a human can't.

wumpus · on Sept 15, 2016

Sorry, but did you get the Tesla story right? The Tesla driver was not paying attention, and the car drove at constant speed into an obstacle of the sort that it is known to not be able to see. I know people are posting all kinds of things on the Internet and HN about this accident, but that doesn't make it true. If an actual self-driving car did the same thing, you'd have a great example. But not this one.

pps43 · on Sept 15, 2016

On the flip side a large percentage of human drivers is unlikely to suddenly run into the nearest stationary object because of the botched software update.

drspacemonkey · on Sept 15, 2016

True, but they often run into the nearest stationary object because they're texting, or eating, or putting on makeup (an ex-gf of mine caused TWO accidents that way), or arguing with a passenger, or speeding, or driving tired after a long day of work, or driving after having a few beers but I'm totally fine I swear I'm cool to drive home, etc...

You're absolutely right that there are risks. But honestly, I suspect drunk drivers alone cause more fatal accidents than autonomous cars ever could.

M_Grey · on Sept 14, 2016

That's the point though... people who want 100% perfect automation have no leg to stand on when presented with a more reliable option than a human. That's not the case with speech recognition yet, and I'd say as a result people would be well justified in demanding at least as much capability as they're capable of themselves.

taneq · on Sept 15, 2016

Really? I expect there to be huge resistance to driverless cars (at first) as people come up with crazy scenarios in which they're certain that their own elite driving skills could save them but the car could not. Then there'll be a reversal as people start to actually accept that the car is a better driver than they are, and surprisingly quickly they'll start saying anyone who wants to drive manually is a dangerous egotist.

M_Grey · on Sept 15, 2016

Maybe, but the "upside" of humans being such terrible drivers is that we've almost all either gotten into a bad accident, or know someone close who has. The fact that it's not going to be hard to empirically prove a difference between machine reliability in this case, and human reliability may help as well.

Unfortunately... I think the big issue is going to be pure anxiety; a bigger and more immediate form of what a lot of people experience on an airplane. Giving up even the illusion of control is supremely hard for us as a species, in general. Then there's just the fact that as a species we're terrible at risk assessment.

https://www.schneier.com/blog/archives/2006/11/perceived_ris...

M_Grey · on Sept 14, 2016

http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

I did say "primary interface" as well, which definitely rules out a mediocre phone connection.

contextfree · on Sept 15, 2016

Does using speech multimodally with other input methods (touch, gesture, pen, clicker, game controller, even keyboard - may seem silly but speech is potentially faster at some tasks) still count as "primary" if the other input method is used supplementarily to help disambiguate?

M_Grey · on Sept 15, 2016

I'd say, and let be clear that I know and accept this to be a relatively arbitrary distinction on my part, that for an input to be "primary" it needs to be able to stand alone (touchscreen input, keyboard and mouse, etc). That's not to say that it must stand alone, but that when all other options are off the table, that would be your preferred method for text/data entry.

That's a troublesome definition though at least in part for reasons you've brought up or alluded to, which is that a multimodal approach is pretty clearly going to dominate. That said, speech recognition at least stands to replace the keyboard for say, the author of a book or article, if it's good enough.

Swizec · on Sept 14, 2016

Compared to voice over internet? Yeah, it's pretty terrible.

funkymike · on Sept 14, 2016

Recognizing the words is just the first step. Getting meaning out of those words is what really counts. When on a telephone call I may miss a percentage of what was said, due to poor audio quality, the speaker's accent, etc., but based on context I can still understand the message the speaker is trying to convey most of the time. Transcribing each word with near-perfect accuracy is unnecessary if the layers above that can handle it.

M_Grey · on Sept 14, 2016

That's true, but I wonder if computers will exceed out technical accuracy first, or begin to actually "understand" things first? I suspect the former.

icebraining · on Sept 14, 2016

They don't need to actually understand - whatever that means - they can apply statistical models based on phonetic distances and large corpora of dialogue.

M_Grey · on Sept 14, 2016

Has that worked so far?

Fricken · on Sept 15, 2016

Google voice search will make several guesses as to what I said. It will sometimes enter a nonsensical, but similar sounding phrase to what I intended into the text box, but will then figure out that it's first guess is nonsense and return the correct search result.

So just based on observation I would say it works pretty good. On mobile I almost always use voice search, and unless I'm searching for an italian name or something weird it almost always hears me correctly. Even in a noisy pub.

M_Grey · on Sept 15, 2016

That's a good point, I've rarely had issues with Google voice searching.

copperx · on Sept 14, 2016

Of course it has. We went down from 40% to 6.9% error rate in 26 years. It may take a couple of decades to get to 0.1%.

mintplant · on Sept 14, 2016

How does it compare to the typical human error rate?

M_Grey · on Sept 14, 2016

Best I can find is section four here: http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

It seems to be significantly less than 1% - 1.6%

pesenti · on Sept 14, 2016

No you looked at the wrong figure. It's figure 7 (Switchboard) and the WER for humans is 4%.

M_Grey · on Sept 14, 2016

No, if you look that's only the CC component of the test, with and without context; not the whole test.

gok · on Sept 14, 2016

6.3% on Switchboard. This is of course in response to IBM getting 6.6%, which was in turn in response to Baidu getting...

Switchboard is kind of a lame evaluation set. It's narrowband, old, and doesn't contain all that much training data (100s of hours, whereas many newer systems are trained on 1000s or 10Ks of hours). And the quest for a lower Switchboard WER to publish means teams are now throwing extra training data at the problem, or using frankly unlikely-to-be-deployed techniques like speaker adaptation, impractically slow language models, or bidirectional acoustic models (which require the entire utterance before they can emit any results).

I really wish they would have stuck to just publishing a paper explaining was actually new here (ResNet for acoustic models? Cool!) rather than just a "let's see how low we can push this 20 year old benchmark" paper.

meepmorp · on Sept 14, 2016

I'm not sure what your complaint is. The paper (on arxiv, linked in the blog post) describes the general techniques used.

Are you saying the benchmark is useless? It's old, yes, but it's extremely valuable to have a benchmark that allows one to assess system performance over time. It gives a good idea of the rate of progress and the distance still to go to match people - after ~16 years, computed are still about a third worse than humans, error rate wise.

gok · on Sept 14, 2016

Surely not "useless" but it doesn't reflect the way speech recognition is used today. Unless you're routinely listening in on two humans having a phone call conversation that they don't expect a computer to be hearing, which is what the test set actually contains.

If you looked at modern performance on test sets from the 1980s (like Resource Management or TIDIGITS) you might be under the impression that we'd achieved human-level accuracy levels years ago, but we clearly haven't. And similarly, what users expect from speech recognition today is in many ways much more demanding than it was in 2000: vocabularies are huge (think about all the words you could say to Google), latency needs be very low, and no one thinks it's acceptable to require users to perform enrollment any more.

So yes, just like other benchmarks, we should retire them after a few years. The fact that a modern computer could get 100,000 FPS on a video game from 2000 wouldn't be considered a "milestone."

YeGoblynQueenne · on Sept 15, 2016

>> Are you saying the benchmark is useless?

I'm not the OP (who replied already) and I don't think old benchmarks are useless but I'm worried that teams trying to beat a dataset from an old competition for a sufficiently long time will inevitably overfit to the dataset, reducing the accuracy of their published results. That's even more so when the test set for the competition is available and there's nothing really keeping it from "creeping" into the training set at some point, maybe between different system versions.

What would really be useful is a sort of ongoing challenge where a training set stays up for a decade at least and the test set is never revealed (but can be used to test systems). Perhaps data could even be renewed every few years as long as new examples can be reliably collected in a similar enough manner with older data.

hiddencost · on Sept 15, 2016

Speaker adaptation, unlikely to be deployed? There are plenty of really big production systems with deployed speaker adaptation, whether that just be saving CMVN stats or saving i-vectors. I've worked on a couple of them.

w.r.t. run time, though, agreed. Hearing the IBM folks say "... 10" in response to the "what's the RTF" question was funny.

(and, agreed, at this point the switchboard announcements are definitely just marketing.)

yread · on Sept 15, 2016

RTF as in real time factor - how long does the processing take compared to the length of the recording?

hiddencost · on Sept 15, 2016

Yup! Good production systems shoot for RTF ~ 1.0. This means that they can usually answer almost as soon as the speech is ended, because recognition is streaming.

And it's _really easy_ to increase accuracy by taking more time, by: building bigger DNN acoustic models; exploring a larger search space of hypotheses; using a slower language model (like an RNN) to rescore hypotheses; considering more possible pronunciations; etc....

(ML is usually a space / time / accuracy trade-off, so if you get phat accuracy gains at the cost of significant slow down, I'm usually unimpressed. The deepmind TTS paper _was_ impressive because it went beyond the best we can do, so even though it was 90 minutes to generate 1 second of speech, it's cool because it shows where we can go. TBH all of these switchboard papers don't do a ton of new stuff, they just get more aggressive about system combination and tuning hyperparameters.)

dave168 · on Sept 16, 2016

The improvement from Switchboard can often lead to other tasks. This is the task that has been used for 20 years by the speech community - a well know entity for most people working on speech. It is very good to compare notes.

dmreedy · on Sept 14, 2016

I would love to see a breakdown of the kinds of errors these systems make. WER is an interesting broad stroke, but it doesn't necessarily tell me how useful a given system will be for some given application[0] (unless, of course, it is 0). It'd be even more interesting to see comparative error analysis across the selection of these systems. A 0.06 point improvement is certainly impressive, especially this close to the end of the scale, but I'd be curious to see if it lost anything in getting there. It's one thing if this system is strictly better than it's predecessor. It's entirely another if it is now 10% better at recognizing instances of the word 'it', but has lost the ability to distinguish 'their' and 'they're'[1].

---

[0] It is like that any viability analysis would be on an by-application basis, so I don't pretend like I'm asking for an insignificant amount of work here!

[1] a crude, toy, and likely inaccurate example. Not trying to belittle the work.

hiddencost · on Sept 15, 2016

IBM was asked this, at Interspeech, and answered that a lot of their errors were on very short words ('a', 'the', 'of', etc.)

nshm · on Sept 14, 2016

Open source Kaldi gives you 7.8%, Microsoft didn't went too far.

Also, major issue with this kind of research is that they combined several systems in order to get best results. Most practical systems don't use combinations, they are too slow.

WilliamDhalgren · on Sept 15, 2016

So this model won't be free software??? Odd, and bummer...

Also I'm note sure an error reduction of 20% (1-6.3/7.8) is to be considered small; depends on the particular challenge really. Like, sentiment analysis only starts to get interesting above 80% on some dataset, as much can be guessed correctly in very naive ways..

Human lvl on this task is estimated to be ~4% so we have quite a lot of ground to cover still..

timgws · on Sept 15, 2016

There are 33% less errors with the Microsoft solution then with Kaldi... one could say that is quite significant.

_r5wf · on Sept 19, 2016

Relative decrease in WER is not so significant for lower percentages. How about "we make 6 errors on 100 words but Kaldi makes 7".

dave168 · on Sept 16, 2016

It is cool anyone can use CNTK to produce something similar now

nshm · on Sept 15, 2016

23% only

visarga · on Sept 15, 2016

Maybe they can distillate the ensemble in a compact and efficient version for production.

danielmorozoff · on Sept 14, 2016

Could you please provide a link?

nshm · on Sept 14, 2016

https://github.com/kaldi-asr/kaldi/blob/master/egs/fisher_sw...

In Microsoft paper http://arxiv.org/pdf/1609.03528v1.pdf Table 5 it's a line "Povey et al. [19] LSTM". http://www.isca-speech.org/archive/Interspeech_2016/pdfs/059... Interpolate it to RNNLM column and you'll get 7.8. See also http://www.isca-speech.org/archive/Interspeech_2016/pdfs/047...

rngesus · on Sept 15, 2016

The paper itself can be found here https://arxiv.org/pdf/1609.03528v1.pdf quite interesting to see that the failure rate is lower than the average human failure rate, can't wait to see how this will improve over the coming years.

random42 · on Sept 14, 2016

Speech-to-text has to still go a long way when it comes to foreign accents. Google now's "Ok Google" initializer has about 3/10 hit-rate for my Indian accent speech.

EricBurnett · on Sept 14, 2016

'Ok Google' is a different problem though - needs to be a low resource, always-on listener with a low false-positive rate. That's quite a different problem space than general speech-to-text.

I think I get about 50% hit rate with 'OK Google' and I'm a native English speaker :).

DiabloD3 · on Sept 15, 2016

I had the weirdest conversation with my phone this morning, and I'm a native English speaker with nothing really identifiable as an accent.

"Okay, Google." Nothing.

"Okay, Google fucking work or I am taking a hammer to this fucking phone." "DING!"

Apparently threats of violence still work against our machine overlords.

mintplant · on Sept 14, 2016

How about the inverse process -- speech synthesis? Anyone know what the state of the art is in that field? The tech has been getting steadily better but we still seem a ways away from passable machine-generated audiobooks, for example.

Devid2014 · on Sept 14, 2016

Like thins one ?

WaveNet: A Generative Model for Raw Audio https://deepmind.com/blog/wavenet-generative-model-raw-audio...

mintplant · on Sept 14, 2016

That's very cool, thanks for the link! From the comparison, the audio quality is clearly improved and they eliminated the sort of digital "wobble" that I usually associate with TTS. Intonation is still a bit off, though. Will check out their paper.

marmshallow · on Sept 14, 2016

I've wondered the same thing! How much work is being done in the area of making Siri and OK Google sound more human-like?

snowwrestler · on Sept 15, 2016

Computer generated speech (Siri and voiceover) was one of the areas that Apple said they were applying AI techniques to improve.

cbasoglu · on Sept 16, 2016

From the linked arxiv paper, http://arxiv.org/abs/1609.03528 this is a very interesting use of CNTK to adapt image CNN techniques to speech recognition. Surprising that CNNs worked so well on speech audio. Full disclosure: I am a MSFT employee.

dave168 · on Sept 16, 2016

For the world record, CNTK delivered the best speech recognition system and TensorFlow delivered the best Go player (switched from Torch) :-)

0xdeadbeefbabe · on Sept 14, 2016

If these speech to computer interfaces are so important, why don't we develop a dialect for humans to speak to computers more efficiently, kind of like the grafiti alphabet on the palm pilot but for speech?

stevetrewick · on Sept 14, 2016

OT, this concept was explored by British author Robert Westall in his 1983 SF dystopia Futuretrack 5 :

"Was it just coincidence that computerspeak, which we'd learned with such throat wrenching difficulty, was also a complaining, niggling, nit picking noise that gave any normal human being a pain in the arse?"

Now me, I used Graffiti a ton as well as the not quite so single stroke and therefore not so patent encumbered version that MS used on their late PDAs. I even implemented my own version on an early windows touchscreen netbook (angular delta and Levenshtein distance, match) so much so that it still pollutes my hand writing to this very day. I find it elegant and immediately understandable. Every one else thinks I'm writing in pig pen cipher.

pmontra · on Sept 14, 2016

Or the way we google. It would mean acknowledging that the listener (the computer) is severely limited. Actually it happens when speaking with foreigners with low proficiency in our language.

0xdeadbeefbabe · on Sept 14, 2016

Acknowledging that the computer is severely limited seems like a good first step. Some people hate thinking of it that way though, and I have no idea why. It's annoying.

rsi_oww · on Sept 14, 2016

https://en.wikipedia.org/wiki/Lojban ?

schoen · on Sept 15, 2016

Communicating with computers was one specific goal of Loglan!

corysama · on Sept 14, 2016

Here's an example, at least for code. "Using Python to Code by Voice" https://www.youtube.com/watch?v=8SkdfdXWYaI

wmf · on Sept 14, 2016

While we're at it, maybe all my coworkers could speak in an American accent, too. That would just be so much easier.

0xdeadbeefbabe · on Sept 14, 2016

Like Totally!

contextfree · on Sept 15, 2016

I wonder if we will get there gradually, as "power users" of Siri/OkGoogle/Cortana/etc, and features targeted to them, start to emerge. Initially it won't be couched as a new dialect but more like "shortcuts" which will become more sophisticated over time.

jnbiche · on Sept 14, 2016

This is an interesting question. I wonder if the limitations are due to particularities of English and other common languages in general, or to particularities of human speech in general.

ebalit · on Sept 15, 2016

This poster[1] might be interesting. English spelling-pronunciation mapping is really complex compared to Italian for example. I wonder if Italian would be a better language for speech recognition. Most of speech recognition research is targeted at English, but with similar effort, maybe Italian would enable lower error rates.

[1]: http://web.psych.ualberta.ca/~chrisw/ML3/84.pdf

yalogin · on Sept 15, 2016

Didn't google announce some speech breakthrough last week?

dredmorbius · on Sept 15, 2016

Generation, not recognition.

https://deepmind.com/blog/wavenet-generative-model-raw-audio...

Dwolb · on Sept 14, 2016

Are speech recognition systems also paired with vision recognition systems to determine intent? Seems like that would be where research would be headed.

copperx · on Sept 15, 2016

That's an interesting one. Not only for intent, but for getting the WER down. For example, my mother often mumbles but if I'm in front of her seeing her face I understand her perfectly, but if she's out of my sight where I can't read her lips I have trouble understanding her.

wodenokoto · on Sept 15, 2016

Why is this score a milestone?

ausjke · on Sept 15, 2016

is Microsoft's speech cloud api supporting this so that we can use it?

visarga · on Sept 15, 2016

Others, such as Google, do the same. They develop some system in their labs, but put in production a worse system because it has lower resource usage.

hiddencost · on Sept 15, 2016

No way they're running this system in production. These papers are pure marketing.

I'm sure their accuracy is fairly reasonable, though.

danielvf · on Sept 14, 2016

The best [speech] recognition engine in the world, and it hears the wrong word more than 6% of the time. Ouch.

I would have though the state of the art would be better, given anecdotal evidence from friends who write with speech to text programs, and love them.

Perhaps some of this is due to deliberately bad audio quality in the switchboard samples.

fma · on Sept 14, 2016

Mistakes are part of real life, just like how you spelled speach, instead of speech.

escapecharacter · on Sept 14, 2016

The parent comment has a failure rate of 1/38 words, which means a 2.6% failure rate, and this is one of the best commentators in the world, too.

pesenti · on Sept 14, 2016

The human error rate on this task is 4% (https://pdfs.semanticscholar.org/387e/7349b8e31e316c2a738060...). What is amazing is how consistently people overestimate human performance.

justnikos · on Sept 14, 2016

Even humans make mistakes: speach -> speech deliberatly -> deliberately I expect these systems to surpass human ability in the next few years.

plafl · on Sept 14, 2016

My bet is that is not really about audio quality, but because in order to understand speech you need to fully understand the language. Try to write down something you hear in a foreign language. As someone who has learnt English as a foreign language I can assure you that at first I wasn't able to even separate between words, it just sounded as a continous stream of sound. Once you learn enough vocabulary you identify some words. Once you identify enough words you are able to fill the gaps using the context. With practice the whole process happens so fast you don't even notice.

JorgeGT · on Sept 14, 2016

I wish I only failed to recognize 6% of the words when listening to crappy teleconference calls...

ams6110 · on Sept 15, 2016

You listen on teleconference calls?

I like it when people say "Can you repeat that, I was on mute"

eva1984 · on Sept 14, 2016

Well, we have to put human into comparison in order to know exactly how impressive 6% metric is. I would guess it might be pretty close.

rajeshp1986 · on Sept 14, 2016

I wouldn't blame the audio quality. It is more about different humans have different kinds of accent which makes the same word sound differently. Considering the different accents, I would say 6% error rate is pretty good compared where we started.

aswanson · on Sept 14, 2016

Why shouldn't the audio quality factor in? Isn't it easier to understand 44.1 Khz 16 bit CD quality audio than 8khz 8 bit pcm (phone data).

rhizome · on Sept 14, 2016

It might be, but the resource demands of higher fidelity acoustic models slows processing down. 44.1/16 has an order of magnitude greater bitrate than 8/8.

dharma1 · on Sept 14, 2016

I guess the point of this particular data set is to check performance on low quality phone audio

rhizome · on Sept 15, 2016

The base product use case has been to handle phone fidelity for many years. Think: legal dictation, retail digital recording hardware (phones), and medical transcription. Speech-to-text for recordings fed by a Telefunken U-47 is highly niche. :)

dharma1 · on Sept 15, 2016

heh. It's not so much because of the microphones that the data sets are THAT narrow band - it's more phone bandwidth limitations. Even the cheapest electret and MEMS mics have pretty good freq response, far beyond 4khz (nyquist 8khz) this data set uses.

Now that bandwidth is becoming less of an issue, we will be getting less shitty sounding, wider bandwidth phone audio - https://en.wikipedia.org/wiki/Wideband_audio

Though if I had it my way, U87 or U47, or hey even SM7B would be mandatory for all speech recordings :)