Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wouldn't blame the audio quality. It is more about different humans have different kinds of accent which makes the same word sound differently. Considering the different accents, I would say 6% error rate is pretty good compared where we started.



Why shouldn't the audio quality factor in? Isn't it easier to understand 44.1 Khz 16 bit CD quality audio than 8khz 8 bit pcm (phone data).


It might be, but the resource demands of higher fidelity acoustic models slows processing down. 44.1/16 has an order of magnitude greater bitrate than 8/8.


I guess the point of this particular data set is to check performance on low quality phone audio


The base product use case has been to handle phone fidelity for many years. Think: legal dictation, retail digital recording hardware (phones), and medical transcription. Speech-to-text for recordings fed by a Telefunken U-47 is highly niche. :)


heh. It's not so much because of the microphones that the data sets are THAT narrow band - it's more phone bandwidth limitations. Even the cheapest electret and MEMS mics have pretty good freq response, far beyond 4khz (nyquist 8khz) this data set uses.

Now that bandwidth is becoming less of an issue, we will be getting less shitty sounding, wider bandwidth phone audio - https://en.wikipedia.org/wiki/Wideband_audio

Though if I had it my way, U87 or U47, or hey even SM7B would be mandatory for all speech recordings :)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: