Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.
> What good is a speech recognition tool that literally hears imaginary voices?
Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.
On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.
The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.
Yes, you are holding it wrong. The good of it is that it does not output imaginary voices when used with VAD.
Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself
So if a tool has a process to have it perform at its best then it's a problem?
Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?
What good is a speech recognition tool that literally hears imaginary voices?