For those not familiar with the NIST 2000 Switchboard evaluation[1] it is a series of 8kHz audio recordings (ie. crappy phone quality samples) of a conversation, including things like "uhhuh" and other pause words. So, 6% seems pretty good.
The difference is that people are OK with a human asking for clarification, but systems like Siri need to have a near-zero error rare before people will consider them good (a person who has to repeat themselves once every 20 times will consider it bad, or at least not good enough)
I'm not sure people expect super-human performance out of Siri. An important difference is that a human who doesn't understand will say so, and ask to repeat the relevant part (or to choose between two alternatives), conversationally; or it will pick an interpretation which is not the intended one but was an understandable misunderstanding.
Contrast this with speech recognition, which will often substitute words that are nonsensical in context, making it look silly from a human perspective...
I think another important difference is that humans won't get stuck in a loop asking you for clarification the same way several times, after 2 or 3 times they'll typically change behaviors. Eg they'll ask you to spell the word or respond with the not-understood word with a questioning tone to signal that they don't understand what that word means.
This could be implemented though. Based on the part of the sentence that is understood, figure out most likely words for the missing part and ask a specific question about it to fill the gap.
See, it's not about hard coding such behavior. I would say that it reaches human level of understanding if it automatically learns these ways of solving the problem. Asking relevant questions can be hard coded, but it doesn't equal "understanding" the problem.
I think the chinese room experiment overlooks this part of "understanding"
Exactly, when SR has a low confidence level it needs to ask for you to repeat yourself. Not just choose the highest confidence match and hope for the best.
That's a good start but a probably the wrong interface for it, "non-native" in the context, a command initialized by voice should present the options by voice.
It's a valid HCI solution to a technical failure mode. Once the software has advanced to the point where the AI is truly conversational, it is a watershed moment.
The important thing here, IMO, is going to be how the system asks for clarification. Hearing the same canned "I'm sorry, I didn't quite get that, can you repeat?" phrase 20 times in a row is annoying. Having the computer say "I'm sorry, what was that last word?" or "I didn't quite catch that, did you want me to call Benny or Betty?" would be far more acceptable.
Like someone else mentioned, how it makes sense out of words is much more important than a zero error rate.
Understanding rate is less than 10%. If you don't match a keyword it gives a useless web search.
Personally I don't think understanding rate is the whole issue as much as reaction to error (which is partly understanding). You can't say "no that's not what I said" and Siri et al never keep enough context to say "huh? What did you say? Or "I didn't get that last part. can you repeat it?"
It's that errors in understanding or accuracy turn the whole thing into a complete shitshow.
One failure and you might as well pull over and type what you want.
Remember this is with low quality sound. It could be much higher under better conditions. Amazon's echo relies on good hardware as much as software, with an array of good mics.
One big problem with Siri is that it has zero sense of humor. That is, imho, what makes people feel tired talking to it. It's like talking to a boring civil servant.
>> Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.
It's better to avoid throwing around numbers like that but even if that was the case you have to remember that humans understand speech. The speech recognition task performed by AI systems on the other hand is more akin to transliteration: the system takes in sound as input and produces text as output. Any sort of "understanding" a) is extremly difficult to do well and b) must be performed by a different component of the system (a different algorithm, trained on different data).
> People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.
For humans, isn't this a due to a combination of factors than just comprehension alone? Humans who ask, "sorry, what did you say?" or "would you repeat that please?" or even just a "huh?" usually aren't paying attention at all. It's not a comprehension or sound quality or surrounding noise problem for many, except in situations where the person is not fluent in a particular language or dialect or accent or if the surrounding noise vs. the person's hearing ability aren't conducive to listening properly.
Most people also usually tend to think about judging what the other is saying and constructing a counter-point during the process of listening that impairs the ability to listen and understand well.
On the other hand, a computer could expected to be, and made to be, paying attention a lot better in a predictable way, which is not possible with humans.
With the other comment reply above stating people's expectations with humans vs. computers, shouldn't we also consider the computer's strengths while making comparisons with humans?
That's mostly because people are thinking about other things. We understand that and anticipate it. If my computer doesn't understand me, it has no excuse as it can't distract itself. It isn't going to hear me next time by "concentrating harder" like a human can. It's going to keep failing.
I have a different experience - many people speak with a mumble or a mushmouth and no amount of concentration helps me disentangle it until I can get them to speak more clearly.
Sure, but if you repeat your utterance there's a good chance that the conditions will have changed the second time around- maybe background noise will have subsided or you'll have swallowed that bit you were chewing on and so on. It makes sense to ask you to repeat a couple of times even if it's a computer you're talking to.
I'm also not seeing anything close to 6% on any public implementations. The voice mail transcript emails I get are often so bad that it's impossible to discern even the gist of what the caller is talking about.
Does the word error rate account for context? For example, if I heard "clown" instead of "cloud", I would probably treat it as something I misheard it in the context of the sentence and correct my understanding accordingly. I guess what I am trying to ask is that if the system reaches 4% (human error rate for same sample), can it really be considered to be on-par?
It measures against what humans write down when they listen. Humans generally use context, so I'd expect that they would record "cloud" if that was appropriate in context.
The Microsoft system also does this: it uses language modelling to attempt to model what word is more likely in a given context. This gives them a 18% word error rate reduction (see section 7 of the paper: http://arxiv.org/pdf/1609.03528v1.pdf).
All language models attempt to model what word is more likely in a given context :)
What is somewhat unusual about this approach is they use recurrent neural networks for the language modeling, as opposed to more traditional approached like a backoff n-gram model.
That's 6% on the NIST dataset. Typically, results get much worse on real-world datasets, not least because trying to get good results on the same dataset year after year leads to subtle bias. Don't forget this is a dataset that's been around for 16 years, now.
> a series of 8kHz audio recordings (ie. crappy phone quality samples)
I worked in speech recognition for a bit and 8kHz was the standard audio rate for recordings. No one saw this as an impediment; and it really isn't. 8kHz is able to capture up to 4kHz frequencies and the fundamental frequencies of speech are MUCH lower than 4kHz.
That's very technically impressive, but 6% is still a long way from where it needs to be as anything like a primary interface. Even 1% is pretty high when you consider how many words you can utter in a few minutes, and how many errors that would generate.
Happens all the time with autonomous driving. I've heard so many people argue that self-driving cars need to be 100% perfect, but they always overlook that humans aren't anywhere near 100% perfect drivers.
Sorry, what is this based on? I for one don't like the idea of self-driving cars and that has nothing to do with wanting them to be "100% perfect", or forgetting that human drivers are not safe either.
On the one hand, there are very good ethical arguments, for example about who bears the moral responsibility when a self-driving car is involved in a fatal accident.
Further, there is a great risk that self-driving cars will become available long before they are advanced enough to be less of a risk than humans, exactly because people may implicitly trust an automated system to be less error-prone than a human, which is not currently the state of the art.
Conversations I've had where people have told me that self-driving cars will need to be 100% perfect before they should be used. Ironically, one of those people was an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.
Anyway, based on Google's extensive test results, I'm pretty sure self-driving cars are already advanced enough to be less of a risk than humans. Right now, the sensors seem to be the limiting the factor.
Try looking for results where the Google car is driving off-road. There aren't any. That's because it can't drive off-road. It can't, because it needs its environment to be fully mapped. In other words: it's not about the sensors.
This should make, er, sense. Sensing your surroundings is only the first step in taking complex decisions based on those surroundings. The AI field as a whole has not yet solved this, so there's no reason to expect that self-driving cars have.
Seen another way, if self-driving cars could really drive themselves at least as good as humans drive them, we wouldn't have compilations of videos of robots falling off while trying to turn door knobs, on youtube.
The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.
>> an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.
>The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.
Google's self-driving car accident statistics say otherwise.
>Honestly.
Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.
>> Google's self-driving car accident statistics say otherwise.
That's an experiment that's been running in a tiny part of one state in one country for a very limited time. I wouldn't count on them and in any case, see what I say above: the state of the art is not there yet, for fully autonomous driving better than humans'.
>> Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.
>That's an experiment that's been running in a tiny part of one state in one country for a very limited time.
It's driven over 1 million miles. That's the equivalent of 75 years of driving for the average human. Plenty of data to draw a conclusion from. In all that time, it's been responsible for a single accident. That's way better than human drivers.
Hours driven are one dimension, that is indeed important. However, there is also the geographical aspect that I point out and that may be more important in practice. I mentioned off-road driving. There's also driving in busy roads. The Google car project has not driven for 1 million miles in a busy city, like NY or SF or whatever, neither in heavy traffic conditions.
Then there's the fact that human drivers have to drive in all sorts of weather conditions with all sorts of different vehicles and so on. Google car- not so much.
But my point is very simple: AI in general is nowhere near producing autonomous robots, yet. Why would Google car (or a similar project) be the exception? What makes cars and driving so different that autonomy is easier to attain?
I agree with your general sentiment, but think those error rates hide a lot as well. Human error might be at x% overall, but when you eliminate malfunctioning humans, broadly defined, it's probably much lower than x%.
The recent death of the Tesla owner, for example, as far as I know, was due to the vehicle accelerating into a semi. This is something that most people would not do even in their worst driving state unless they were intoxicated or seriously mentally impaired. I don't want AI driving errors to be compared to human benchmarks that include people who are seriously intoxicated.
A lot of speech frustration problems, similarly, are not only about poor recognition in general, or lack of appropriate prompting to increase classification certainty, but recognition failures in situations where a human would not have any trouble at all, such as in recognizing names of loved ones, or things that would be clear in context to a human. I.e., maybe humans listening to speech corpora would have x% error rate, but that's strangers listening to the corpora. The real question is, if I listen to a recording of my spouse or coworker having a conversation what's the error rate there?
So, although humans are far from perfect, which is something that's often forgotten, the true AI target is also probably not "humans broadly defined" but rather "functional humans" or something like that. AI research often sets the bar misleadingly low because it's so hard to reach as it is.
The types of mistakes a human and a car are prone to making are different. Neither one has to be a superset of the other. For example, cars are probably better at going round corners at a safe speed while humans can easily misjudge and end up skidding. You could make the opposite argument and say only the most malfunctioning self driving car would choose the wrong speed for a corner yet humans make that error all the time, so humans are even worse than the worst self-driving cars.
Another example. If a self driving car is hit by another car that's running a red light while speeding, we might be more forgiving and say "well nobody could have avoided that accident" but actually we'd be being too soft on the self driving car since it has access to more data and faster reaction times and should probably be expected to avoid that type of crash even when a human can't.
Sorry, but did you get the Tesla story right? The Tesla driver was not paying attention, and the car drove at constant speed into an obstacle of the sort that it is known to not be able to see. I know people are posting all kinds of things on the Internet and HN about this accident, but that doesn't make it true. If an actual self-driving car did the same thing, you'd have a great example. But not this one.
On the flip side a large percentage of human drivers is unlikely to suddenly run into the nearest stationary object because of the botched software update.
True, but they often run into the nearest stationary object because they're texting, or eating, or putting on makeup (an ex-gf of mine caused TWO accidents that way), or arguing with a passenger, or speeding, or driving tired after a long day of work, or driving after having a few beers but I'm totally fine I swear I'm cool to drive home, etc...
You're absolutely right that there are risks. But honestly, I suspect drunk drivers alone cause more fatal accidents than autonomous cars ever could.
That's the point though... people who want 100% perfect automation have no leg to stand on when presented with a more reliable option than a human. That's not the case with speech recognition yet, and I'd say as a result people would be well justified in demanding at least as much capability as they're capable of themselves.
Really? I expect there to be huge resistance to driverless cars (at first) as people come up with crazy scenarios in which they're certain that their own elite driving skills could save them but the car could not. Then there'll be a reversal as people start to actually accept that the car is a better driver than they are, and surprisingly quickly they'll start saying anyone who wants to drive manually is a dangerous egotist.
Maybe, but the "upside" of humans being such terrible drivers is that we've almost all either gotten into a bad accident, or know someone close who has. The fact that it's not going to be hard to empirically prove a difference between machine reliability in this case, and human reliability may help as well.
Unfortunately... I think the big issue is going to be pure anxiety; a bigger and more immediate form of what a lot of people experience on an airplane. Giving up even the illusion of control is supremely hard for us as a species, in general. Then there's just the fact that as a species we're terrible at risk assessment.
Does using speech multimodally with other input methods (touch, gesture, pen, clicker, game controller, even keyboard - may seem silly but speech is potentially faster at some tasks) still count as "primary" if the other input method is used supplementarily to help disambiguate?
I'd say, and let be clear that I know and accept this to be a relatively arbitrary distinction on my part, that for an input to be "primary" it needs to be able to stand alone (touchscreen input, keyboard and mouse, etc). That's not to say that it must stand alone, but that when all other options are off the table, that would be your preferred method for text/data entry.
That's a troublesome definition though at least in part for reasons you've brought up or alluded to, which is that a multimodal approach is pretty clearly going to dominate. That said, speech recognition at least stands to replace the keyboard for say, the author of a book or article, if it's good enough.
Recognizing the words is just the first step. Getting meaning out of those words is what really counts. When on a telephone call I may miss a percentage of what was said, due to poor audio quality, the speaker's accent, etc., but based on context I can still understand the message the speaker is trying to convey most of the time. Transcribing each word with near-perfect accuracy is unnecessary if the layers above that can handle it.
They don't need to actually understand - whatever that means - they can apply statistical models based on phonetic distances and large corpora of dialogue.
Google voice search will make several guesses as to what I said. It will sometimes enter a nonsensical, but similar sounding phrase to what I intended into the text box, but will then figure out that it's first guess is nonsense and return the correct search result.
So just based on observation I would say it works pretty good. On mobile I almost always use voice search, and unless I'm searching for an italian name or something weird it almost always hears me correctly. Even in a noisy pub.
[1] http://www.itl.nist.gov/iad/mig/tests/ctr/2000/h5_2000_v1.3....