Hacker News new | past | comments | ask | show | jobs | submit login
Researchers achieve speech recognition milestone (microsoft.com)
210 points by gzweig on Sept 14, 2016 | hide | past | favorite | 135 comments



For those not familiar with the NIST 2000 Switchboard evaluation[1] it is a series of 8kHz audio recordings (ie. crappy phone quality samples) of a conversation, including things like "uhhuh" and other pause words. So, 6% seems pretty good.

[1] http://www.itl.nist.gov/iad/mig/tests/ctr/2000/h5_2000_v1.3....


Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.

People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.


The difference is that people are OK with a human asking for clarification, but systems like Siri need to have a near-zero error rare before people will consider them good (a person who has to repeat themselves once every 20 times will consider it bad, or at least not good enough)


I'm not sure people expect super-human performance out of Siri. An important difference is that a human who doesn't understand will say so, and ask to repeat the relevant part (or to choose between two alternatives), conversationally; or it will pick an interpretation which is not the intended one but was an understandable misunderstanding.

Contrast this with speech recognition, which will often substitute words that are nonsensical in context, making it look silly from a human perspective...


I think another important difference is that humans won't get stuck in a loop asking you for clarification the same way several times, after 2 or 3 times they'll typically change behaviors. Eg they'll ask you to spell the word or respond with the not-understood word with a questioning tone to signal that they don't understand what that word means.


This could be implemented though. Based on the part of the sentence that is understood, figure out most likely words for the missing part and ask a specific question about it to fill the gap.


See, it's not about hard coding such behavior. I would say that it reaches human level of understanding if it automatically learns these ways of solving the problem. Asking relevant questions can be hard coded, but it doesn't equal "understanding" the problem.

I think the chinese room experiment overlooks this part of "understanding"


Exactly, when SR has a low confidence level it needs to ask for you to repeat yourself. Not just choose the highest confidence match and hope for the best.


Siri underlines words its not sure about. Then if you click it, it gives you a menu of potential other candidates.

Seems like a good approach.


That's a good start but a probably the wrong interface for it, "non-native" in the context, a command initialized by voice should present the options by voice.


It's a valid HCI solution to a technical failure mode. Once the software has advanced to the point where the AI is truly conversational, it is a watershed moment.


That's fine for dictation but of little use when driving or other eyes free scenarios.


Also. When. People. Talk. To. Siri. They. Speak. Very. Distinctly. With. Clear. Separation. Between. Words.

Or that is my observation, anyway. I don't use it myself.


I bet Siri's great at understanding what William Shatner says.


The important thing here, IMO, is going to be how the system asks for clarification. Hearing the same canned "I'm sorry, I didn't quite get that, can you repeat?" phrase 20 times in a row is annoying. Having the computer say "I'm sorry, what was that last word?" or "I didn't quite catch that, did you want me to call Benny or Betty?" would be far more acceptable.


Like someone else mentioned, how it makes sense out of words is much more important than a zero error rate.

Understanding rate is less than 10%. If you don't match a keyword it gives a useless web search.

Personally I don't think understanding rate is the whole issue as much as reaction to error (which is partly understanding). You can't say "no that's not what I said" and Siri et al never keep enough context to say "huh? What did you say? Or "I didn't get that last part. can you repeat it?"

It's that errors in understanding or accuracy turn the whole thing into a complete shitshow.

One failure and you might as well pull over and type what you want.


Remember this is with low quality sound. It could be much higher under better conditions. Amazon's echo relies on good hardware as much as software, with an array of good mics.


From talking to a few people that do SR it's also considerably easier to do when you know the hardware.

They can cancel out reverb and create very fine tuned waveform profiles for speech.

I think one of the reasons that Siri is slightly better at SR than google is because of the control that Apple has over the hardware.

While Cortana turns sourpuss on me every time I switch headsets.


No, the err rate is not a big deal. What is a big deal is making sense of the words it actually can hear.


One big problem with Siri is that it has zero sense of humor. That is, imho, what makes people feel tired talking to it. It's like talking to a boring civil servant.


>> Judging by my everyday interactions, a 6% error rate is lower than human error rates in casual conversation.

It's better to avoid throwing around numbers like that but even if that was the case you have to remember that humans understand speech. The speech recognition task performed by AI systems on the other hand is more akin to transliteration: the system takes in sound as input and produces text as output. Any sort of "understanding" a) is extremly difficult to do well and b) must be performed by a different component of the system (a different algorithm, trained on different data).


The difference is that people understand what you say, don't just map your speech into words.


> People regularly ask each other, "sorry, what did you say?", "wait, what did she say?", "would you repeat that please?", "huh?", etc.

For humans, isn't this a due to a combination of factors than just comprehension alone? Humans who ask, "sorry, what did you say?" or "would you repeat that please?" or even just a "huh?" usually aren't paying attention at all. It's not a comprehension or sound quality or surrounding noise problem for many, except in situations where the person is not fluent in a particular language or dialect or accent or if the surrounding noise vs. the person's hearing ability aren't conducive to listening properly.

Most people also usually tend to think about judging what the other is saying and constructing a counter-point during the process of listening that impairs the ability to listen and understand well.

On the other hand, a computer could expected to be, and made to be, paying attention a lot better in a predictable way, which is not possible with humans.

With the other comment reply above stating people's expectations with humans vs. computers, shouldn't we also consider the computer's strengths while making comparisons with humans?


That's mostly because people are thinking about other things. We understand that and anticipate it. If my computer doesn't understand me, it has no excuse as it can't distract itself. It isn't going to hear me next time by "concentrating harder" like a human can. It's going to keep failing.


I have a different experience - many people speak with a mumble or a mushmouth and no amount of concentration helps me disentangle it until I can get them to speak more clearly.


Sure, but if you repeat your utterance there's a good chance that the conditions will have changed the second time around- maybe background noise will have subsided or you'll have swallowed that bit you were chewing on and so on. It makes sense to ask you to repeat a couple of times even if it's a computer you're talking to.


Yes but people know to ask for clarification based on context. You know that they didn't just say they were off to wok their log.


Humans do way better than 6%. I don't know who you're talking too, but I don't know anybody who I need to repeat anywhere near 6% of what I say to.


I'm also not seeing anything close to 6% on any public implementations. The voice mail transcript emails I get are often so bad that it's impossible to discern even the gist of what the caller is talking about.


Does the word error rate account for context? For example, if I heard "clown" instead of "cloud", I would probably treat it as something I misheard it in the context of the sentence and correct my understanding accordingly. I guess what I am trying to ask is that if the system reaches 4% (human error rate for same sample), can it really be considered to be on-par?


It measures against what humans write down when they listen. Humans generally use context, so I'd expect that they would record "cloud" if that was appropriate in context.

The Microsoft system also does this: it uses language modelling to attempt to model what word is more likely in a given context. This gives them a 18% word error rate reduction (see section 7 of the paper: http://arxiv.org/pdf/1609.03528v1.pdf).


All language models attempt to model what word is more likely in a given context :)

What is somewhat unusual about this approach is they use recurrent neural networks for the language modeling, as opposed to more traditional approached like a backoff n-gram model.


Most modern language modelling is RNN based now; eg: https://arxiv.org/abs/1602.02410 (which I think is SOTA).


>> 6% seems pretty good.

That's 6% on the NIST dataset. Typically, results get much worse on real-world datasets, not least because trying to get good results on the same dataset year after year leads to subtle bias. Don't forget this is a dataset that's been around for 16 years, now.


When I worked at a speech recognition startup in the late 90s I was told that anything over 5% is essentially useless for no-touch recognition.


> a series of 8kHz audio recordings (ie. crappy phone quality samples)

I worked in speech recognition for a bit and 8kHz was the standard audio rate for recordings. No one saw this as an impediment; and it really isn't. 8kHz is able to capture up to 4kHz frequencies and the fundamental frequencies of speech are MUCH lower than 4kHz.


At 8 kHz sampling rate, you can tell f from s only by context (they sound the same).


That's very technically impressive, but 6% is still a long way from where it needs to be as anything like a primary interface. Even 1% is pretty high when you consider how many words you can utter in a few minutes, and how many errors that would generate.

Edit: For comparison: http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf


The human error rate on this task is 4% (https://pdfs.semanticscholar.org/387e/7349b8e31e316c2a738060...). So you are basically saying that telephones are a useless interface...

What is amazing is how consistently people overestimate human performance.


Happens all the time with autonomous driving. I've heard so many people argue that self-driving cars need to be 100% perfect, but they always overlook that humans aren't anywhere near 100% perfect drivers.

Frankly, we're lousy drivers.


Sorry, what is this based on? I for one don't like the idea of self-driving cars and that has nothing to do with wanting them to be "100% perfect", or forgetting that human drivers are not safe either.

On the one hand, there are very good ethical arguments, for example about who bears the moral responsibility when a self-driving car is involved in a fatal accident.

Further, there is a great risk that self-driving cars will become available long before they are advanced enough to be less of a risk than humans, exactly because people may implicitly trust an automated system to be less error-prone than a human, which is not currently the state of the art.


>Sorry, what is this based on?

Conversations I've had where people have told me that self-driving cars will need to be 100% perfect before they should be used. Ironically, one of those people was an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.

Anyway, based on Google's extensive test results, I'm pretty sure self-driving cars are already advanced enough to be less of a risk than humans. Right now, the sensors seem to be the limiting the factor.


Try looking for results where the Google car is driving off-road. There aren't any. That's because it can't drive off-road. It can't, because it needs its environment to be fully mapped. In other words: it's not about the sensors.

This should make, er, sense. Sensing your surroundings is only the first step in taking complex decisions based on those surroundings. The AI field as a whole has not yet solved this, so there's no reason to expect that self-driving cars have.

Seen another way, if self-driving cars could really drive themselves at least as good as humans drive them, we wouldn't have compilations of videos of robots falling off while trying to turn door knobs, on youtube.

The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.

>> an ex-gf of mine who caused two car accidents because she was putting on makeup while driving.

Honestly.


>The state of the art in AI is such that self-driving cars are not yet less dangerous than humans.

Google's self-driving car accident statistics say otherwise.

>Honestly.

Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.


>> Google's self-driving car accident statistics say otherwise.

That's an experiment that's been running in a tiny part of one state in one country for a very limited time. I wouldn't count on them and in any case, see what I say above: the state of the art is not there yet, for fully autonomous driving better than humans'.

>> Yeah. Weirdest part was, she actually thought she was a GOOD driver. Mostly because of all the times she was able to apply makeup while driving and didn't cause an accident.

:snorts coffee:


>That's an experiment that's been running in a tiny part of one state in one country for a very limited time.

It's driven over 1 million miles. That's the equivalent of 75 years of driving for the average human. Plenty of data to draw a conclusion from. In all that time, it's been responsible for a single accident. That's way better than human drivers.


Hours driven are one dimension, that is indeed important. However, there is also the geographical aspect that I point out and that may be more important in practice. I mentioned off-road driving. There's also driving in busy roads. The Google car project has not driven for 1 million miles in a busy city, like NY or SF or whatever, neither in heavy traffic conditions.

Then there's the fact that human drivers have to drive in all sorts of weather conditions with all sorts of different vehicles and so on. Google car- not so much.

But my point is very simple: AI in general is nowhere near producing autonomous robots, yet. Why would Google car (or a similar project) be the exception? What makes cars and driving so different that autonomy is easier to attain?


I agree with your general sentiment, but think those error rates hide a lot as well. Human error might be at x% overall, but when you eliminate malfunctioning humans, broadly defined, it's probably much lower than x%.

The recent death of the Tesla owner, for example, as far as I know, was due to the vehicle accelerating into a semi. This is something that most people would not do even in their worst driving state unless they were intoxicated or seriously mentally impaired. I don't want AI driving errors to be compared to human benchmarks that include people who are seriously intoxicated.

A lot of speech frustration problems, similarly, are not only about poor recognition in general, or lack of appropriate prompting to increase classification certainty, but recognition failures in situations where a human would not have any trouble at all, such as in recognizing names of loved ones, or things that would be clear in context to a human. I.e., maybe humans listening to speech corpora would have x% error rate, but that's strangers listening to the corpora. The real question is, if I listen to a recording of my spouse or coworker having a conversation what's the error rate there?

So, although humans are far from perfect, which is something that's often forgotten, the true AI target is also probably not "humans broadly defined" but rather "functional humans" or something like that. AI research often sets the bar misleadingly low because it's so hard to reach as it is.


The types of mistakes a human and a car are prone to making are different. Neither one has to be a superset of the other. For example, cars are probably better at going round corners at a safe speed while humans can easily misjudge and end up skidding. You could make the opposite argument and say only the most malfunctioning self driving car would choose the wrong speed for a corner yet humans make that error all the time, so humans are even worse than the worst self-driving cars.

Another example. If a self driving car is hit by another car that's running a red light while speeding, we might be more forgiving and say "well nobody could have avoided that accident" but actually we'd be being too soft on the self driving car since it has access to more data and faster reaction times and should probably be expected to avoid that type of crash even when a human can't.


Sorry, but did you get the Tesla story right? The Tesla driver was not paying attention, and the car drove at constant speed into an obstacle of the sort that it is known to not be able to see. I know people are posting all kinds of things on the Internet and HN about this accident, but that doesn't make it true. If an actual self-driving car did the same thing, you'd have a great example. But not this one.


On the flip side a large percentage of human drivers is unlikely to suddenly run into the nearest stationary object because of the botched software update.


True, but they often run into the nearest stationary object because they're texting, or eating, or putting on makeup (an ex-gf of mine caused TWO accidents that way), or arguing with a passenger, or speeding, or driving tired after a long day of work, or driving after having a few beers but I'm totally fine I swear I'm cool to drive home, etc...

You're absolutely right that there are risks. But honestly, I suspect drunk drivers alone cause more fatal accidents than autonomous cars ever could.


That's the point though... people who want 100% perfect automation have no leg to stand on when presented with a more reliable option than a human. That's not the case with speech recognition yet, and I'd say as a result people would be well justified in demanding at least as much capability as they're capable of themselves.


Really? I expect there to be huge resistance to driverless cars (at first) as people come up with crazy scenarios in which they're certain that their own elite driving skills could save them but the car could not. Then there'll be a reversal as people start to actually accept that the car is a better driver than they are, and surprisingly quickly they'll start saying anyone who wants to drive manually is a dangerous egotist.


Maybe, but the "upside" of humans being such terrible drivers is that we've almost all either gotten into a bad accident, or know someone close who has. The fact that it's not going to be hard to empirically prove a difference between machine reliability in this case, and human reliability may help as well.

Unfortunately... I think the big issue is going to be pure anxiety; a bigger and more immediate form of what a lot of people experience on an airplane. Giving up even the illusion of control is supremely hard for us as a species, in general. Then there's just the fact that as a species we're terrible at risk assessment.

https://www.schneier.com/blog/archives/2006/11/perceived_ris...


http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

I did say "primary interface" as well, which definitely rules out a mediocre phone connection.


Does using speech multimodally with other input methods (touch, gesture, pen, clicker, game controller, even keyboard - may seem silly but speech is potentially faster at some tasks) still count as "primary" if the other input method is used supplementarily to help disambiguate?


I'd say, and let be clear that I know and accept this to be a relatively arbitrary distinction on my part, that for an input to be "primary" it needs to be able to stand alone (touchscreen input, keyboard and mouse, etc). That's not to say that it must stand alone, but that when all other options are off the table, that would be your preferred method for text/data entry.

That's a troublesome definition though at least in part for reasons you've brought up or alluded to, which is that a multimodal approach is pretty clearly going to dominate. That said, speech recognition at least stands to replace the keyboard for say, the author of a book or article, if it's good enough.


Compared to voice over internet? Yeah, it's pretty terrible.


Recognizing the words is just the first step. Getting meaning out of those words is what really counts. When on a telephone call I may miss a percentage of what was said, due to poor audio quality, the speaker's accent, etc., but based on context I can still understand the message the speaker is trying to convey most of the time. Transcribing each word with near-perfect accuracy is unnecessary if the layers above that can handle it.


That's true, but I wonder if computers will exceed out technical accuracy first, or begin to actually "understand" things first? I suspect the former.


They don't need to actually understand - whatever that means - they can apply statistical models based on phonetic distances and large corpora of dialogue.


Has that worked so far?


Google voice search will make several guesses as to what I said. It will sometimes enter a nonsensical, but similar sounding phrase to what I intended into the text box, but will then figure out that it's first guess is nonsense and return the correct search result.

So just based on observation I would say it works pretty good. On mobile I almost always use voice search, and unless I'm searching for an italian name or something weird it almost always hears me correctly. Even in a noisy pub.


That's a good point, I've rarely had issues with Google voice searching.


Of course it has. We went down from 40% to 6.9% error rate in 26 years. It may take a couple of decades to get to 0.1%.


How does it compare to the typical human error rate?


Best I can find is section four here: http://www.utdallas.edu/~assmann/hcs6367/lippmann97.pdf

It seems to be significantly less than 1% - 1.6%


No you looked at the wrong figure. It's figure 7 (Switchboard) and the WER for humans is 4%.


No, if you look that's only the CC component of the test, with and without context; not the whole test.


6.3% on Switchboard. This is of course in response to IBM getting 6.6%, which was in turn in response to Baidu getting...

Switchboard is kind of a lame evaluation set. It's narrowband, old, and doesn't contain all that much training data (100s of hours, whereas many newer systems are trained on 1000s or 10Ks of hours). And the quest for a lower Switchboard WER to publish means teams are now throwing extra training data at the problem, or using frankly unlikely-to-be-deployed techniques like speaker adaptation, impractically slow language models, or bidirectional acoustic models (which require the entire utterance before they can emit any results).

I really wish they would have stuck to just publishing a paper explaining was actually new here (ResNet for acoustic models? Cool!) rather than just a "let's see how low we can push this 20 year old benchmark" paper.


I'm not sure what your complaint is. The paper (on arxiv, linked in the blog post) describes the general techniques used.

Are you saying the benchmark is useless? It's old, yes, but it's extremely valuable to have a benchmark that allows one to assess system performance over time. It gives a good idea of the rate of progress and the distance still to go to match people - after ~16 years, computed are still about a third worse than humans, error rate wise.


Surely not "useless" but it doesn't reflect the way speech recognition is used today. Unless you're routinely listening in on two humans having a phone call conversation that they don't expect a computer to be hearing, which is what the test set actually contains.

If you looked at modern performance on test sets from the 1980s (like Resource Management or TIDIGITS) you might be under the impression that we'd achieved human-level accuracy levels years ago, but we clearly haven't. And similarly, what users expect from speech recognition today is in many ways much more demanding than it was in 2000: vocabularies are huge (think about all the words you could say to Google), latency needs be very low, and no one thinks it's acceptable to require users to perform enrollment any more.

So yes, just like other benchmarks, we should retire them after a few years. The fact that a modern computer could get 100,000 FPS on a video game from 2000 wouldn't be considered a "milestone."


>> Are you saying the benchmark is useless?

I'm not the OP (who replied already) and I don't think old benchmarks are useless but I'm worried that teams trying to beat a dataset from an old competition for a sufficiently long time will inevitably overfit to the dataset, reducing the accuracy of their published results. That's even more so when the test set for the competition is available and there's nothing really keeping it from "creeping" into the training set at some point, maybe between different system versions.

What would really be useful is a sort of ongoing challenge where a training set stays up for a decade at least and the test set is never revealed (but can be used to test systems). Perhaps data could even be renewed every few years as long as new examples can be reliably collected in a similar enough manner with older data.


Speaker adaptation, unlikely to be deployed? There are plenty of really big production systems with deployed speaker adaptation, whether that just be saving CMVN stats or saving i-vectors. I've worked on a couple of them.

w.r.t. run time, though, agreed. Hearing the IBM folks say "... 10" in response to the "what's the RTF" question was funny.

(and, agreed, at this point the switchboard announcements are definitely just marketing.)


RTF as in real time factor - how long does the processing take compared to the length of the recording?


Yup! Good production systems shoot for RTF ~ 1.0. This means that they can usually answer almost as soon as the speech is ended, because recognition is streaming.

And it's _really easy_ to increase accuracy by taking more time, by: building bigger DNN acoustic models; exploring a larger search space of hypotheses; using a slower language model (like an RNN) to rescore hypotheses; considering more possible pronunciations; etc....

(ML is usually a space / time / accuracy trade-off, so if you get phat accuracy gains at the cost of significant slow down, I'm usually unimpressed. The deepmind TTS paper _was_ impressive because it went beyond the best we can do, so even though it was 90 minutes to generate 1 second of speech, it's cool because it shows where we can go. TBH all of these switchboard papers don't do a ton of new stuff, they just get more aggressive about system combination and tuning hyperparameters.)


The improvement from Switchboard can often lead to other tasks. This is the task that has been used for 20 years by the speech community - a well know entity for most people working on speech. It is very good to compare notes.


I would love to see a breakdown of the kinds of errors these systems make. WER is an interesting broad stroke, but it doesn't necessarily tell me how useful a given system will be for some given application[0] (unless, of course, it is 0). It'd be even more interesting to see comparative error analysis across the selection of these systems. A 0.06 point improvement is certainly impressive, especially this close to the end of the scale, but I'd be curious to see if it lost anything in getting there. It's one thing if this system is strictly better than it's predecessor. It's entirely another if it is now 10% better at recognizing instances of the word 'it', but has lost the ability to distinguish 'their' and 'they're'[1].

---

[0] It is like that any viability analysis would be on an by-application basis, so I don't pretend like I'm asking for an insignificant amount of work here!

[1] a crude, toy, and likely inaccurate example. Not trying to belittle the work.


IBM was asked this, at Interspeech, and answered that a lot of their errors were on very short words ('a', 'the', 'of', etc.)


Open source Kaldi gives you 7.8%, Microsoft didn't went too far.

Also, major issue with this kind of research is that they combined several systems in order to get best results. Most practical systems don't use combinations, they are too slow.


So this model won't be free software??? Odd, and bummer...

Also I'm note sure an error reduction of 20% (1-6.3/7.8) is to be considered small; depends on the particular challenge really. Like, sentiment analysis only starts to get interesting above 80% on some dataset, as much can be guessed correctly in very naive ways..

Human lvl on this task is estimated to be ~4% so we have quite a lot of ground to cover still..


There are 33% less errors with the Microsoft solution then with Kaldi... one could say that is quite significant.


Relative decrease in WER is not so significant for lower percentages. How about "we make 6 errors on 100 words but Kaldi makes 7".


It is cool anyone can use CNTK to produce something similar now


23% only


Maybe they can distillate the ensemble in a compact and efficient version for production.


Could you please provide a link?



The paper itself can be found here https://arxiv.org/pdf/1609.03528v1.pdf quite interesting to see that the failure rate is lower than the average human failure rate, can't wait to see how this will improve over the coming years.


Speech-to-text has to still go a long way when it comes to foreign accents. Google now's "Ok Google" initializer has about 3/10 hit-rate for my Indian accent speech.


'Ok Google' is a different problem though - needs to be a low resource, always-on listener with a low false-positive rate. That's quite a different problem space than general speech-to-text.

I think I get about 50% hit rate with 'OK Google' and I'm a native English speaker :).


I had the weirdest conversation with my phone this morning, and I'm a native English speaker with nothing really identifiable as an accent.

"Okay, Google." Nothing.

"Okay, Google." Nothing.

"Okay, Google fucking work or I am taking a hammer to this fucking phone." "DING!"

Apparently threats of violence still work against our machine overlords.


How about the inverse process -- speech synthesis? Anyone know what the state of the art is in that field? The tech has been getting steadily better but we still seem a ways away from passable machine-generated audiobooks, for example.


Like thins one ?

WaveNet: A Generative Model for Raw Audio https://deepmind.com/blog/wavenet-generative-model-raw-audio...


That's very cool, thanks for the link! From the comparison, the audio quality is clearly improved and they eliminated the sort of digital "wobble" that I usually associate with TTS. Intonation is still a bit off, though. Will check out their paper.


I've wondered the same thing! How much work is being done in the area of making Siri and OK Google sound more human-like?


Computer generated speech (Siri and voiceover) was one of the areas that Apple said they were applying AI techniques to improve.


From the linked arxiv paper, http://arxiv.org/abs/1609.03528 this is a very interesting use of CNTK to adapt image CNN techniques to speech recognition. Surprising that CNNs worked so well on speech audio. Full disclosure: I am a MSFT employee.


For the world record, CNTK delivered the best speech recognition system and TensorFlow delivered the best Go player (switched from Torch) :-)


If these speech to computer interfaces are so important, why don't we develop a dialect for humans to speak to computers more efficiently, kind of like the grafiti alphabet on the palm pilot but for speech?


OT, this concept was explored by British author Robert Westall in his 1983 SF dystopia Futuretrack 5 :

"Was it just coincidence that computerspeak, which we'd learned with such throat wrenching difficulty, was also a complaining, niggling, nit picking noise that gave any normal human being a pain in the arse?"

Now me, I used Graffiti a ton as well as the not quite so single stroke and therefore not so patent encumbered version that MS used on their late PDAs. I even implemented my own version on an early windows touchscreen netbook (angular delta and Levenshtein distance, match) so much so that it still pollutes my hand writing to this very day. I find it elegant and immediately understandable. Every one else thinks I'm writing in pig pen cipher.


Or the way we google. It would mean acknowledging that the listener (the computer) is severely limited. Actually it happens when speaking with foreigners with low proficiency in our language.


Acknowledging that the computer is severely limited seems like a good first step. Some people hate thinking of it that way though, and I have no idea why. It's annoying.



Communicating with computers was one specific goal of Loglan!


Here's an example, at least for code. "Using Python to Code by Voice" https://www.youtube.com/watch?v=8SkdfdXWYaI


While we're at it, maybe all my coworkers could speak in an American accent, too. That would just be so much easier.


Like Totally!


I wonder if we will get there gradually, as "power users" of Siri/OkGoogle/Cortana/etc, and features targeted to them, start to emerge. Initially it won't be couched as a new dialect but more like "shortcuts" which will become more sophisticated over time.


This is an interesting question. I wonder if the limitations are due to particularities of English and other common languages in general, or to particularities of human speech in general.


This poster[1] might be interesting. English spelling-pronunciation mapping is really complex compared to Italian for example. I wonder if Italian would be a better language for speech recognition. Most of speech recognition research is targeted at English, but with similar effort, maybe Italian would enable lower error rates.

[1]: http://web.psych.ualberta.ca/~chrisw/ML3/84.pdf


Didn't google announce some speech breakthrough last week?



Are speech recognition systems also paired with vision recognition systems to determine intent? Seems like that would be where research would be headed.


That's an interesting one. Not only for intent, but for getting the WER down. For example, my mother often mumbles but if I'm in front of her seeing her face I understand her perfectly, but if she's out of my sight where I can't read her lips I have trouble understanding her.


Why is this score a milestone?


is Microsoft's speech cloud api supporting this so that we can use it?


Others, such as Google, do the same. They develop some system in their labs, but put in production a worse system because it has lower resource usage.


No way they're running this system in production. These papers are pure marketing.

I'm sure their accuracy is fairly reasonable, though.


The best [speech] recognition engine in the world, and it hears the wrong word more than 6% of the time. Ouch.

I would have though the state of the art would be better, given anecdotal evidence from friends who write with speech to text programs, and love them.

Perhaps some of this is due to deliberately bad audio quality in the switchboard samples.


Mistakes are part of real life, just like how you spelled speach, instead of speech.


The parent comment has a failure rate of 1/38 words, which means a 2.6% failure rate, and this is one of the best commentators in the world, too.


The human error rate on this task is 4% (https://pdfs.semanticscholar.org/387e/7349b8e31e316c2a738060...). What is amazing is how consistently people overestimate human performance.


Even humans make mistakes: speach -> speech deliberatly -> deliberately I expect these systems to surpass human ability in the next few years.


My bet is that is not really about audio quality, but because in order to understand speech you need to fully understand the language. Try to write down something you hear in a foreign language. As someone who has learnt English as a foreign language I can assure you that at first I wasn't able to even separate between words, it just sounded as a continous stream of sound. Once you learn enough vocabulary you identify some words. Once you identify enough words you are able to fill the gaps using the context. With practice the whole process happens so fast you don't even notice.


I wish I only failed to recognize 6% of the words when listening to crappy teleconference calls...


You listen on teleconference calls?

I like it when people say "Can you repeat that, I was on mute"


Well, we have to put human into comparison in order to know exactly how impressive 6% metric is. I would guess it might be pretty close.


I wouldn't blame the audio quality. It is more about different humans have different kinds of accent which makes the same word sound differently. Considering the different accents, I would say 6% error rate is pretty good compared where we started.


Why shouldn't the audio quality factor in? Isn't it easier to understand 44.1 Khz 16 bit CD quality audio than 8khz 8 bit pcm (phone data).


It might be, but the resource demands of higher fidelity acoustic models slows processing down. 44.1/16 has an order of magnitude greater bitrate than 8/8.


I guess the point of this particular data set is to check performance on low quality phone audio


The base product use case has been to handle phone fidelity for many years. Think: legal dictation, retail digital recording hardware (phones), and medical transcription. Speech-to-text for recordings fed by a Telefunken U-47 is highly niche. :)


heh. It's not so much because of the microphones that the data sets are THAT narrow band - it's more phone bandwidth limitations. Even the cheapest electret and MEMS mics have pretty good freq response, far beyond 4khz (nyquist 8khz) this data set uses.

Now that bandwidth is becoming less of an issue, we will be getting less shitty sounding, wider bandwidth phone audio - https://en.wikipedia.org/wiki/Wideband_audio

Though if I had it my way, U87 or U47, or hey even SM7B would be mandatory for all speech recordings :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: