6.3% on Switchboard. This is of course in response to IBM getting 6.6%, which was in turn in response to Baidu getting...
Switchboard is kind of a lame evaluation set. It's narrowband, old, and doesn't contain all that much training data (100s of hours, whereas many newer systems are trained on 1000s or 10Ks of hours). And the quest for a lower Switchboard WER to publish means teams are now throwing extra training data at the problem, or using frankly unlikely-to-be-deployed techniques like speaker adaptation, impractically slow language models, or bidirectional acoustic models (which require the entire utterance before they can emit any results).
I really wish they would have stuck to just publishing a paper explaining was actually new here (ResNet for acoustic models? Cool!) rather than just a "let's see how low we can push this 20 year old benchmark" paper.
I'm not sure what your complaint is. The paper (on arxiv, linked in the blog post) describes the general techniques used.
Are you saying the benchmark is useless? It's old, yes, but it's extremely valuable to have a benchmark that allows one to assess system performance over time. It gives a good idea of the rate of progress and the distance still to go to match people - after ~16 years, computed are still about a third worse than humans, error rate wise.
Surely not "useless" but it doesn't reflect the way speech recognition is used today. Unless you're routinely listening in on two humans having a phone call conversation that they don't expect a computer to be hearing, which is what the test set actually contains.
If you looked at modern performance on test sets from the 1980s (like Resource Management or TIDIGITS) you might be under the impression that we'd achieved human-level accuracy levels years ago, but we clearly haven't. And similarly, what users expect from speech recognition today is in many ways much more demanding than it was in 2000: vocabularies are huge (think about all the words you could say to Google), latency needs be very low, and no one thinks it's acceptable to require users to perform enrollment any more.
So yes, just like other benchmarks, we should retire them after a few years. The fact that a modern computer could get 100,000 FPS on a video game from 2000 wouldn't be considered a "milestone."
I'm not the OP (who replied already) and I don't think old benchmarks are useless but I'm worried that teams trying to beat a dataset from an old competition for a sufficiently long time will inevitably overfit to the dataset, reducing the accuracy of their published results. That's even more so when the test set for the competition is available and there's nothing really keeping it from "creeping" into the training set at some point, maybe between different system versions.
What would really be useful is a sort of ongoing challenge where a training set stays up for a decade at least and the test set is never revealed (but can be used to test systems). Perhaps data could even be renewed every few years as long as new examples can be reliably collected in a similar enough manner with older data.
Speaker adaptation, unlikely to be deployed? There are plenty of really big production systems with deployed speaker adaptation, whether that just be saving CMVN stats or saving i-vectors. I've worked on a couple of them.
w.r.t. run time, though, agreed. Hearing the IBM folks say "... 10" in response to the "what's the RTF" question was funny.
(and, agreed, at this point the switchboard announcements are definitely just marketing.)
Yup! Good production systems shoot for RTF ~ 1.0. This means that they can usually answer almost as soon as the speech is ended, because recognition is streaming.
And it's _really easy_ to increase accuracy by taking more time, by: building bigger DNN acoustic models; exploring a larger search space of hypotheses; using a slower language model (like an RNN) to rescore hypotheses; considering more possible pronunciations; etc....
(ML is usually a space / time / accuracy trade-off, so if you get phat accuracy gains at the cost of significant slow down, I'm usually unimpressed. The deepmind TTS paper _was_ impressive because it went beyond the best we can do, so even though it was 90 minutes to generate 1 second of speech, it's cool because it shows where we can go. TBH all of these switchboard papers don't do a ton of new stuff, they just get more aggressive about system combination and tuning hyperparameters.)
The improvement from Switchboard can often lead to other tasks. This is the task that has been used for 20 years by the speech community - a well know entity for most people working on speech. It is very good to compare notes.
Switchboard is kind of a lame evaluation set. It's narrowband, old, and doesn't contain all that much training data (100s of hours, whereas many newer systems are trained on 1000s or 10Ks of hours). And the quest for a lower Switchboard WER to publish means teams are now throwing extra training data at the problem, or using frankly unlikely-to-be-deployed techniques like speaker adaptation, impractically slow language models, or bidirectional acoustic models (which require the entire utterance before they can emit any results).
I really wish they would have stuck to just publishing a paper explaining was actually new here (ResNet for acoustic models? Cool!) rather than just a "let's see how low we can push this 20 year old benchmark" paper.