For those not familiar with the NIST 2000 Switchboard evaluation[1] it is a seri...

cs702 · on Sept 14, 2016

Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.

People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.

strombofulous · on Sept 14, 2016

The difference is that people are OK with a human asking for clarification, but systems like Siri need to have a near-zero error rare before people will consider them good (a person who has to repeat themselves once every 20 times will consider it bad, or at least not good enough)

a3_nm · on Sept 14, 2016

I'm not sure people expect super-human performance out of Siri. An important difference is that a human who doesn't understand will say so, and ask to repeat the relevant part (or to choose between two alternatives), conversationally; or it will pick an interpretation which is not the intended one but was an understandable misunderstanding.

Contrast this with speech recognition, which will often substitute words that are nonsensical in context, making it look silly from a human perspective...

click170 · on Sept 15, 2016

I think another important difference is that humans won't get stuck in a loop asking you for clarification the same way several times, after 2 or 3 times they'll typically change behaviors. Eg they'll ask you to spell the word or respond with the not-understood word with a questioning tone to signal that they don't understand what that word means.

jobigoud · on Sept 15, 2016

This could be implemented though. Based on the part of the sentence that is understood, figure out most likely words for the missing part and ask a specific question about it to fill the gap.

legolas2412 · on Sept 15, 2016

See, it's not about hard coding such behavior. I would say that it reaches human level of understanding if it automatically learns these ways of solving the problem. Asking relevant questions can be hard coded, but it doesn't equal "understanding" the problem.

I think the chinese room experiment overlooks this part of "understanding"

cptskippy · on Sept 14, 2016

Exactly, when SR has a low confidence level it needs to ask for you to repeat yourself. Not just choose the highest confidence match and hope for the best.

randyrand · on Sept 15, 2016

Siri underlines words its not sure about. Then if you click it, it gives you a menu of potential other candidates.

Seems like a good approach.

tarikjn · on Sept 15, 2016

That's a good start but a probably the wrong interface for it, "non-native" in the context, a command initialized by voice should present the options by voice.

zaroth · on Sept 15, 2016

It's a valid HCI solution to a technical failure mode. Once the software has advanced to the point where the AI is truly conversational, it is a watershed moment.

cptskippy · on Sept 15, 2016

That's fine for dictation but of little use when driving or other eyes free scenarios.

ams6110 · on Sept 15, 2016

Also. When. People. Talk. To. Siri. They. Speak. Very. Distinctly. With. Clear. Separation. Between. Words.

Or that is my observation, anyway. I don't use it myself.

dasboth · on Sept 15, 2016

I bet Siri's great at understanding what William Shatner says.

taneq · on Sept 15, 2016

The important thing here, IMO, is going to be how the system asks for clarification. Hearing the same canned "I'm sorry, I didn't quite get that, can you repeat?" phrase 20 times in a row is annoying. Having the computer say "I'm sorry, what was that last word?" or "I didn't quite catch that, did you want me to call Benny or Betty?" would be far more acceptable.

daveguy · on Sept 15, 2016

Like someone else mentioned, how it makes sense out of words is much more important than a zero error rate.

Understanding rate is less than 10%. If you don't match a keyword it gives a useless web search.

Personally I don't think understanding rate is the whole issue as much as reaction to error (which is partly understanding). You can't say "no that's not what I said" and Siri et al never keep enough context to say "huh? What did you say? Or "I didn't get that last part. can you repeat it?"

It's that errors in understanding or accuracy turn the whole thing into a complete shitshow.

One failure and you might as well pull over and type what you want.

Houshalter · on Sept 14, 2016

Remember this is with low quality sound. It could be much higher under better conditions. Amazon's echo relies on good hardware as much as software, with an array of good mics.

dogma1138 · on Sept 15, 2016

From talking to a few people that do SR it's also considerably easier to do when you know the hardware.

They can cancel out reverb and create very fine tuned waveform profiles for speech.

I think one of the reasons that Siri is slightly better at SR than google is because of the control that Apple has over the hardware.

While Cortana turns sourpuss on me every time I switch headsets.

blazespin · on Sept 15, 2016

No, the err rate is not a big deal. What is a big deal is making sense of the words it actually can hear.

amelius · on Sept 15, 2016

One big problem with Siri is that it has zero sense of humor. That is, imho, what makes people feel tired talking to it. It's like talking to a boring civil servant.

YeGoblynQueenne · on Sept 15, 2016

>> Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.

It's better to avoid throwing around numbers like that but even if that was the case you have to remember that humans understand speech. The speech recognition task performed by AI systems on the other hand is more akin to transliteration: the system takes in sound as input and produces text as output. Any sort of "understanding" a) is extremly difficult to do well and b) must be performed by a different component of the system (a different algorithm, trained on different data).

blazespin · on Sept 15, 2016

The difference is that people understand what you say, don't just map your speech into words.

newscracker · on Sept 15, 2016

> People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.

For humans, isn't this a due to a combination of factors than just comprehension alone? Humans who ask, "sorry, what did you say?" or "would you repeat that please?" or even just a "huh?" usually aren't paying attention at all. It's not a comprehension or sound quality or surrounding noise problem for many, except in situations where the person is not fluent in a particular language or dialect or accent or if the surrounding noise vs. the person's hearing ability aren't conducive to listening properly.

Most people also usually tend to think about judging what the other is saying and constructing a counter-point during the process of listening that impairs the ability to listen and understand well.

On the other hand, a computer could expected to be, and made to be, paying attention a lot better in a predictable way, which is not possible with humans.

With the other comment reply above stating people's expectations with humans vs. computers, shouldn't we also consider the computer's strengths while making comparisons with humans?

Retra · on Sept 14, 2016

That's mostly because people are thinking about other things. We understand that and anticipate it. If my computer doesn't understand me, it has no excuse as it can't distract itself. It isn't going to hear me next time by "concentrating harder" like a human can. It's going to keep failing.

oldmanjay · on Sept 14, 2016

I have a different experience - many people speak with a mumble or a mushmouth and no amount of concentration helps me disentangle it until I can get them to speak more clearly.

YeGoblynQueenne · on Sept 15, 2016

Sure, but if you repeat your utterance there's a good chance that the conditions will have changed the second time around- maybe background noise will have subsided or you'll have swallowed that bit you were chewing on and so on. It makes sense to ask you to repeat a couple of times even if it's a computer you're talking to.

noonespecial · on Sept 15, 2016

Yes but people know to ask for clarification based on context. You know that they didn't just say they were off to wok their log.

UlyssesSKrunk · on Sept 15, 2016

Humans do way better than 6%. I don't know who you're talking too, but I don't know anybody who I need to repeat anywhere near 6% of what I say to.

ams6110 · on Sept 15, 2016

I'm also not seeing anything close to 6% on any public implementations. The voice mail transcript emails I get are often so bad that it's impossible to discern even the gist of what the caller is talking about.

fareesh · on Sept 14, 2016

Does the word error rate account for context? For example, if I heard "clown" instead of "cloud", I would probably treat it as something I misheard it in the context of the sentence and correct my understanding accordingly. I guess what I am trying to ask is that if the system reaches 4% (human error rate for same sample), can it really be considered to be on-par?

nl · on Sept 14, 2016

It measures against what humans write down when they listen. Humans generally use context, so I'd expect that they would record "cloud" if that was appropriate in context.

The Microsoft system also does this: it uses language modelling to attempt to model what word is more likely in a given context. This gives them a 18% word error rate reduction (see section 7 of the paper: http://arxiv.org/pdf/1609.03528v1.pdf).

gok · on Sept 14, 2016

All language models attempt to model what word is more likely in a given context :)

What is somewhat unusual about this approach is they use recurrent neural networks for the language modeling, as opposed to more traditional approached like a backoff n-gram model.

nl · on Sept 14, 2016

Most modern language modelling is RNN based now; eg: https://arxiv.org/abs/1602.02410 (which I think is SOTA).

YeGoblynQueenne · on Sept 15, 2016

>> 6% seems pretty good.

That's 6% on the NIST dataset. Typically, results get much worse on real-world datasets, not least because trying to get good results on the same dataset year after year leads to subtle bias. Don't forget this is a dataset that's been around for 16 years, now.

rhizome · on Sept 14, 2016

When I worked at a speech recognition startup in the late 90s I was told that anything over 5% is essentially useless for no-touch recognition.

copperx · on Sept 14, 2016

> a series of 8kHz audio recordings (ie. crappy phone quality samples)

I worked in speech recognition for a bit and 8kHz was the standard audio rate for recordings. No one saw this as an impediment; and it really isn't. 8kHz is able to capture up to 4kHz frequencies and the fundamental frequencies of speech are MUCH lower than 4kHz.

pps43 · on Sept 15, 2016

At 8 kHz sampling rate, you can tell f from s only by context (they sound the same).

M_Grey · on Sept 14, 2016

That's very technically impressive, but 6% is still a long way from where it needs to be as anything like a primary interface. Even 1% is pretty high when you consider how many words you can utter in a few minutes, and how many errors that would generate.

Edit: For comparison: http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

pesenti · on Sept 14, 2016

The human error rate on this task is 4% (https://pdfs.semanticscholar.org/387e/7349b8e31e316c2a738060...). So you are basically saying that telephones are a useless interface...

What is amazing is how consistently people overestimate human performance.

drspacemonkey · on Sept 14, 2016

Happens all the time with autonomous driving. I've heard so many people argue that self-driving cars need to be 100% perfect, but they always overlook that humans aren't anywhere near 100% perfect drivers.

Frankly, we're lousy drivers.

YeGoblynQueenne · on Sept 15, 2016

Sorry, what is this based on? I for one don't like the idea of self-driving cars and that has nothing to do with wanting them to be "100% perfect", or forgetting that human drivers are not safe either.

On the one hand, there are very good ethical arguments, for example about who bears the moral responsibility when a self-driving car is involved in a fatal accident.

Further, there is a great risk that self-driving cars will become available long before they are advanced enough to be less of a risk than humans, exactly because people may implicitly trust an automated system to be less error-prone than a human, which is not currently the state of the art.

drspacemonkey · on Sept 15, 2016

>Sorry, what is this based on?

Conversations I've had where people have told me that self-driving cars will need to be 100% perfect before they should be used. Ironically, one of those people was an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.

Anyway, based on Google's extensive test results, I'm pretty sure self-driving cars are already advanced enough to be less of a risk than humans. Right now, the sensors seem to be the limiting the factor.

YeGoblynQueenne · on Sept 16, 2016

Try looking for results where the Google car is driving off-road. There aren't any. That's because it can't drive off-road. It can't, because it needs its environment to be fully mapped. In other words: it's not about the sensors.

This should make, er, sense. Sensing your surroundings is only the first step in taking complex decisions based on those surroundings. The AI field as a whole has not yet solved this, so there's no reason to expect that self-driving cars have.

Seen another way, if self-driving cars could really drive themselves at least as good as humans drive them, we wouldn't have compilations of videos of robots falling off while trying to turn door knobs, on youtube.

The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.

>> an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.

Honestly.

drspacemonkey · on Sept 16, 2016

>The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.

Google's self-driving car accident statistics say otherwise.

>Honestly.

Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.

YeGoblynQueenne · on Sept 16, 2016

>> Google's self-driving car accident statistics say otherwise.

That's an experiment that's been running in a tiny part of one state in one country for a very limited time. I wouldn't count on them and in any case, see what I say above: the state of the art is not there yet, for fully autonomous driving better than humans'.

>> Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.

:snorts coffee:

drspacemonkey · on Sept 16, 2016

>That's an experiment that's been running in a tiny part of one state in one country for a very limited time.

It's driven over 1 million miles. That's the equivalent of 75 years of driving for the average human. Plenty of data to draw a conclusion from. In all that time, it's been responsible for a single accident. That's way better than human drivers.

YeGoblynQueenne · on Sept 18, 2016

Hours driven are one dimension, that is indeed important. However, there is also the geographical aspect that I point out and that may be more important in practice. I mentioned off-road driving. There's also driving in busy roads. The Google car project has not driven for 1 million miles in a busy city, like NY or SF or whatever, neither in heavy traffic conditions.

Then there's the fact that human drivers have to drive in all sorts of weather conditions with all sorts of different vehicles and so on. Google car- not so much.

But my point is very simple: AI in general is nowhere near producing autonomous robots, yet. Why would Google car (or a similar project) be the exception? What makes cars and driving so different that autonomy is easier to attain?

jmde · on Sept 15, 2016

I agree with your general sentiment, but think those error rates hide a lot as well. Human error might be at x% overall, but when you eliminate malfunctioning humans, broadly defined, it's probably much lower than x%.

The recent death of the Tesla owner, for example, as far as I know, was due to the vehicle accelerating into a semi. This is something that most people would not do even in their worst driving state unless they were intoxicated or seriously mentally impaired. I don't want AI driving errors to be compared to human benchmarks that include people who are seriously intoxicated.

A lot of speech frustration problems, similarly, are not only about poor recognition in general, or lack of appropriate prompting to increase classification certainty, but recognition failures in situations where a human would not have any trouble at all, such as in recognizing names of loved ones, or things that would be clear in context to a human. I.e., maybe humans listening to speech corpora would have x% error rate, but that's strangers listening to the corpora. The real question is, if I listen to a recording of my spouse or coworker having a conversation what's the error rate there?

So, although humans are far from perfect, which is something that's often forgotten, the true AI target is also probably not "humans broadly defined" but rather "functional humans" or something like that. AI research often sets the bar misleadingly low because it's so hard to reach as it is.

Hondor · on Sept 15, 2016

The types of mistakes a human and a car are prone to making are different. Neither one has to be a superset of the other. For example, cars are probably better at going round corners at a safe speed while humans can easily misjudge and end up skidding. You could make the opposite argument and say only the most malfunctioning self driving car would choose the wrong speed for a corner yet humans make that error all the time, so humans are even worse than the worst self-driving cars.

Another example. If a self driving car is hit by another car that's running a red light while speeding, we might be more forgiving and say "well nobody could have avoided that accident" but actually we'd be being too soft on the self driving car since it has access to more data and faster reaction times and should probably be expected to avoid that type of crash even when a human can't.

wumpus · on Sept 15, 2016

Sorry, but did you get the Tesla story right? The Tesla driver was not paying attention, and the car drove at constant speed into an obstacle of the sort that it is known to not be able to see. I know people are posting all kinds of things on the Internet and HN about this accident, but that doesn't make it true. If an actual self-driving car did the same thing, you'd have a great example. But not this one.

pps43 · on Sept 15, 2016

On the flip side a large percentage of human drivers is unlikely to suddenly run into the nearest stationary object because of the botched software update.

drspacemonkey · on Sept 15, 2016

True, but they often run into the nearest stationary object because they're texting, or eating, or putting on makeup (an ex-gf of mine caused TWO accidents that way), or arguing with a passenger, or speeding, or driving tired after a long day of work, or driving after having a few beers but I'm totally fine I swear I'm cool to drive home, etc...

You're absolutely right that there are risks. But honestly, I suspect drunk drivers alone cause more fatal accidents than autonomous cars ever could.

M_Grey · on Sept 14, 2016

That's the point though... people who want 100% perfect automation have no leg to stand on when presented with a more reliable option than a human. That's not the case with speech recognition yet, and I'd say as a result people would be well justified in demanding at least as much capability as they're capable of themselves.

taneq · on Sept 15, 2016

Really? I expect there to be huge resistance to driverless cars (at first) as people come up with crazy scenarios in which they're certain that their own elite driving skills could save them but the car could not. Then there'll be a reversal as people start to actually accept that the car is a better driver than they are, and surprisingly quickly they'll start saying anyone who wants to drive manually is a dangerous egotist.

M_Grey · on Sept 15, 2016

Maybe, but the "upside" of humans being such terrible drivers is that we've almost all either gotten into a bad accident, or know someone close who has. The fact that it's not going to be hard to empirically prove a difference between machine reliability in this case, and human reliability may help as well.

Unfortunately... I think the big issue is going to be pure anxiety; a bigger and more immediate form of what a lot of people experience on an airplane. Giving up even the illusion of control is supremely hard for us as a species, in general. Then there's just the fact that as a species we're terrible at risk assessment.

https://www.schneier.com/blog/archives/2006/11/perceived_ris...

M_Grey · on Sept 14, 2016

http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

I did say "primary interface" as well, which definitely rules out a mediocre phone connection.

contextfree · on Sept 15, 2016

Does using speech multimodally with other input methods (touch, gesture, pen, clicker, game controller, even keyboard - may seem silly but speech is potentially faster at some tasks) still count as "primary" if the other input method is used supplementarily to help disambiguate?

M_Grey · on Sept 15, 2016

I'd say, and let be clear that I know and accept this to be a relatively arbitrary distinction on my part, that for an input to be "primary" it needs to be able to stand alone (touchscreen input, keyboard and mouse, etc). That's not to say that it must stand alone, but that when all other options are off the table, that would be your preferred method for text/data entry.

That's a troublesome definition though at least in part for reasons you've brought up or alluded to, which is that a multimodal approach is pretty clearly going to dominate. That said, speech recognition at least stands to replace the keyboard for say, the author of a book or article, if it's good enough.

Swizec · on Sept 14, 2016

Compared to voice over internet? Yeah, it's pretty terrible.

funkymike · on Sept 14, 2016

Recognizing the words is just the first step. Getting meaning out of those words is what really counts. When on a telephone call I may miss a percentage of what was said, due to poor audio quality, the speaker's accent, etc., but based on context I can still understand the message the speaker is trying to convey most of the time. Transcribing each word with near-perfect accuracy is unnecessary if the layers above that can handle it.

M_Grey · on Sept 14, 2016

That's true, but I wonder if computers will exceed out technical accuracy first, or begin to actually "understand" things first? I suspect the former.

icebraining · on Sept 14, 2016

They don't need to actually understand - whatever that means - they can apply statistical models based on phonetic distances and large corpora of dialogue.

M_Grey · on Sept 14, 2016

Has that worked so far?

Fricken · on Sept 15, 2016

Google voice search will make several guesses as to what I said. It will sometimes enter a nonsensical, but similar sounding phrase to what I intended into the text box, but will then figure out that it's first guess is nonsense and return the correct search result.

So just based on observation I would say it works pretty good. On mobile I almost always use voice search, and unless I'm searching for an italian name or something weird it almost always hears me correctly. Even in a noisy pub.

M_Grey · on Sept 15, 2016

That's a good point, I've rarely had issues with Google voice searching.

copperx · on Sept 14, 2016

Of course it has. We went down from 40% to 6.9% error rate in 26 years. It may take a couple of decades to get to 0.1%.

mintplant · on Sept 14, 2016

How does it compare to the typical human error rate?

M_Grey · on Sept 14, 2016

Best I can find is section four here: http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

It seems to be significantly less than 1% - 1.6%

pesenti · on Sept 14, 2016

No you looked at the wrong figure. It's figure 7 (Switchboard) and the WER for humans is 4%.

M_Grey · on Sept 14, 2016

No, if you look that's only the CC component of the test, with and without context; not the whole test.