Does the word error rate account for context? For example, if I heard "clown" instead of "cloud", I would probably treat it as something I misheard it in the context of the sentence and correct my understanding accordingly. I guess what I am trying to ask is that if the system reaches 4% (human error rate for same sample), can it really be considered to be on-par?
It measures against what humans write down when they listen. Humans generally use context, so I'd expect that they would record "cloud" if that was appropriate in context.
The Microsoft system also does this: it uses language modelling to attempt to model what word is more likely in a given context. This gives them a 18% word error rate reduction (see section 7 of the paper: http://arxiv.org/pdf/1609.03528v1.pdf).
All language models attempt to model what word is more likely in a given context :)
What is somewhat unusual about this approach is they use recurrent neural networks for the language modeling, as opposed to more traditional approached like a backoff n-gram model.