Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Speech recognition is useful.

Now try to mute a video on youtube and understand what's being said from the automatic subtitles.

If you do it in english, be aware that it's the best performing language and all others are even worse.



For some reason, YouTube is not using a very good STT system now. The lack of sentence punctuation is particularly annoying. Transcriptions by Whisper and Gemini 1.5 Pro are much better. From a couple of weeks ago:

https://news.ycombinator.com/item?id=41199567#41201773

I expect that YouTube will up their transcription game soon, too.


I've tried whisper too. I made this: https://codeberg.org/ltworf/srtgen

Basically it's kinda useful to put time tags, but I need to manually fix each and every sentence. Sometimes I need to fix the time tags as well.

I just spoke about youtube because it's more popular and easy to test.


Sometimes speech-to-text machine learning models give very good results, however I think the key is that:

1. It's overwhelmingly more useful than the [no text] it was replacing, particularly for the deaf or if you want to search for keywords in a video.

2. When it fails, it tends to do so in ways that trigger human suspicion and oversight.

Those aren't necessarily true of some of the things people are shoehorning LLMs into these days, which is why I'm a lost more pessimistic about that technology.


Just today, I received a note from a gas technician, part handwritten, for the life of me I couldn't make out what he was saying, I asked ChatGPT and it surprisingly understood, rereading the original note I'm very sure it was correct.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: