Does the word error rate account for context? For example, if I heard "clown" in...

nl · on Sept 14, 2016

It measures against what humans write down when they listen. Humans generally use context, so I'd expect that they would record "cloud" if that was appropriate in context.

The Microsoft system also does this: it uses language modelling to attempt to model what word is more likely in a given context. This gives them a 18% word error rate reduction (see section 7 of the paper: http://arxiv.org/pdf/1609.03528v1.pdf).

gok · on Sept 14, 2016

All language models attempt to model what word is more likely in a given context :)

What is somewhat unusual about this approach is they use recurrent neural networks for the language modeling, as opposed to more traditional approached like a backoff n-gram model.

nl · on Sept 14, 2016

Most modern language modelling is RNN based now; eg: https://arxiv.org/abs/1602.02410 (which I think is SOTA).