I’m pretty sure I heard somewhere that the ear does autocorrelation rather than ...

I’m pretty sure I heard somewhere that the ear does autocorrelation rather than a Fourier transform, but I’m not sure how correct that is.

EDIT: Scene_Cast2’s comment below says it’s Empirical Mode Decomposition, not autocorrelation.

Anyway, I see no reason why spectrograms have to be fuzzy… a wide window size can locate frequencies very precisely while smoothing out fast variations in amplitude, which sounds pretty similar to how we hear things.

(Interestingly, when analysing the voice, linguists tend to use the opposite: a narrow window size, which smears out frequencies making the resonance bands more obvious, while allowing visualisation of fast glottal vibrations.)