The difference is that people are OK with a human asking for clarification, but systems like Siri need to have a near-zero error rare before people will consider them good (a person who has to repeat themselves once every 20 times will consider it bad, or at least not good enough)
I'm not sure people expect super-human performance out of Siri. An important difference is that a human who doesn't understand will say so, and ask to repeat the relevant part (or to choose between two alternatives), conversationally; or it will pick an interpretation which is not the intended one but was an understandable misunderstanding.
Contrast this with speech recognition, which will often substitute words that are nonsensical in context, making it look silly from a human perspective...
I think another important difference is that humans won't get stuck in a loop asking you for clarification the same way several times, after 2 or 3 times they'll typically change behaviors. Eg they'll ask you to spell the word or respond with the not-understood word with a questioning tone to signal that they don't understand what that word means.
This could be implemented though. Based on the part of the sentence that is understood, figure out most likely words for the missing part and ask a specific question about it to fill the gap.
See, it's not about hard coding such behavior. I would say that it reaches human level of understanding if it automatically learns these ways of solving the problem. Asking relevant questions can be hard coded, but it doesn't equal "understanding" the problem.
I think the chinese room experiment overlooks this part of "understanding"
Exactly, when SR has a low confidence level it needs to ask for you to repeat yourself. Not just choose the highest confidence match and hope for the best.
That's a good start but a probably the wrong interface for it, "non-native" in the context, a command initialized by voice should present the options by voice.
It's a valid HCI solution to a technical failure mode. Once the software has advanced to the point where the AI is truly conversational, it is a watershed moment.
The important thing here, IMO, is going to be how the system asks for clarification. Hearing the same canned "I'm sorry, I didn't quite get that, can you repeat?" phrase 20 times in a row is annoying. Having the computer say "I'm sorry, what was that last word?" or "I didn't quite catch that, did you want me to call Benny or Betty?" would be far more acceptable.
Like someone else mentioned, how it makes sense out of words is much more important than a zero error rate.
Understanding rate is less than 10%. If you don't match a keyword it gives a useless web search.
Personally I don't think understanding rate is the whole issue as much as reaction to error (which is partly understanding). You can't say "no that's not what I said" and Siri et al never keep enough context to say "huh? What did you say? Or "I didn't get that last part. can you repeat it?"
It's that errors in understanding or accuracy turn the whole thing into a complete shitshow.
One failure and you might as well pull over and type what you want.
Remember this is with low quality sound. It could be much higher under better conditions. Amazon's echo relies on good hardware as much as software, with an array of good mics.
One big problem with Siri is that it has zero sense of humor. That is, imho, what makes people feel tired talking to it. It's like talking to a boring civil servant.