I wouldn't blame the audio quality. It is more about different humans have different kinds of accent which makes the same word sound differently. Considering the different accents, I would say 6% error rate is pretty good compared where we started.
It might be, but the resource demands of higher fidelity acoustic models slows processing down. 44.1/16 has an order of magnitude greater bitrate than 8/8.
The base product use case has been to handle phone fidelity for many years. Think: legal dictation, retail digital recording hardware (phones), and medical transcription. Speech-to-text for recordings fed by a Telefunken U-47 is highly niche. :)
heh. It's not so much because of the microphones that the data sets are THAT narrow band - it's more phone bandwidth limitations. Even the cheapest electret and MEMS mics have pretty good freq response, far beyond 4khz (nyquist 8khz) this data set uses.