Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Rendering live audio is quite demanding: the system has to deliver n seconds of audio data every n seconds to the audio hardware. If it doesn’t, the buffer runs dry, and the user hears a nasty glitch or crackle: that’s the hard transition from audio, to silence.

I think you can do better when a buffer runs dry. Instead of outputting silence, you could output the same frequency spectrum as you were right before the event. That way you will not hear any cracks or pops.

And of course you can fade out the effect when the buffer stays dry for more than a few seconds.

Obviously, you'd have to do some filtering when the audio continues because that too can introduce cracks.




There is really no such thing as 'instantaneous frequency response'. For any frequency to meaningfully exist, you need data for the corresponding period. e.g. If the audio contains content to 20hz, you need at least 1/40th to 1/20th of a second of data for that to materialize.

Put another way - what you are proposing it looping the buffer which is what some devices do, portable CD players were kinda notorious for it and it doesn't sound much better than cracks or pops. Computers also have a tendency to fall into buffer looping when the system hangs (which is likely the result of the failure mode of realtek codecs).


> There is really no such thing as 'instantaneous frequency response'

Yes that's true, I'm proposing something that uses an approximation of it.

Consider it from a different angle: the inner ear essentially performs a Fourier transform. At every moment the "instantaneous" spectrum determines which hair cells are triggered. Now what I propose is to keep triggering those same hair cells (and not any others) when the buffer runs dry. The exact way of accomplishing this is left as an exercise (though using short windows where you take a FFT could be a good approximation).


> The exact way of accomplishing this is left as an exercise

Perhaps you should undertake this exercise and let us know how it sounds :)

EDIT: In my experience with audio, when I have a bug that introduces even the slightest discontinuity (or even just a cusp) in the audio, well short of a pop to silence, I can still hear a "weirdness". Ears are pretty attuned to things that sound unnatural. I'm not confident that essentially "forging" the audio is going to sound natural.


What if you train a deep neural network on the song so far, so it can generate plausible-sounding music whenever the buffer drops?

You can even hang intentionally to generate original music!

(/s, please don't)


As another comment mentioned, this is done by conferencing software to deal with packet loss. It sounds like they either loop the previous DCT frame and gradually fade out, or feed the time domain output into a reverb, then cross-fade from 100% dry to 100% wet if the buffer is about to run out (someone on HN mentioned a while back that this approach was patented).

You could maybe make an argument that it would be useful in live music settings to prevent a bad situation from sounding even worse, and maybe you'd put it on some audio software so you can sort of still enjoy playing music on a crappy system, but really, it's best to have hardware and software that can 100% guarantee keeping up with audio processing.


I like the idea but one problem is that you usually encounter a buffer underrun when the cpu can't keep up, and so adding an extra step would require something like leaving enough processing headroom each buffer to halt it early and run the approximater.

edited typo


"Output the same frequency spectrum" - takes some considerable processing! We're usually working in time-domain, not frequency-domain. I think you're saying do a load of processing to fill the gap when you don't have time to do processing?


Audio software often works in the frequency domain too. CPUs have optimized instructions for this (video also uses them).

Also, the article speaks of multithreaded software, where deadlines can be missed because of complicated dependencies. The end stage where you correct for missing samples can work independently of them in its own thread.


For the record, I think this is a terrible idea.

However, the way I'd do it if needed: keep two or the most recent good buffers. When you need to synthesize, start running a phase vocoder based on the hop between those two. You get frozen sinusoids and some random noise for the bins that don't have one, and almost no cpu use on the happy path of no underruns.

Still, don't do it :)


> For the record, I think this is a terrible idea.

It really depends on who you are asking. Some people just hate those loud cracks and pops, and would love to have something that filters them out naturally.


So you always calculate the same frequency spectrum just in case of dropouts instead of getting the actual code right?

Sorry, but if you ever find yourself in sich a situation stop, pause, make yourself a tea and consider how wise the thing you are doing really is.


I actually think it's actively harmful to hide problems that can be otherwise fixed. If the CPU is too busy to keep filling the audio buffer, the solution is to increase the buffer size to put less stress on the scheduler. I recently reduced my buffer size in Ableton Live, but I knew I had to increase it because I could hear pops. If these pops were being covered up, I wouldn't have realized my buffer size was too small and I'd be unknowingly introducing subtle artifacts into every recording.


Ok, but a large buffer size means more latency.

Also, the settings you use for development do not have to be the same as those used in production.


> Ok, but a large buffer size means more latency.

Depending on the use-case, that may not be a problem. Not all things are latency-sensitive.


With anything besides headphones there will always be latency anyway. Roughly 1ms per foot from the speaker.


Based on nothing more than user experience, conferencing software like Zoom tends to do something that sounds quite like what you're describing, complete with the fade to silence after about half a second.

So it makes sense for certain live situations, but it wouldn't be desirable in studio recordings.


Assuming it's a viable strategy with regards to processing resources (hint: it's not for anything more than toys) you will have audible artifacts, especially around transients. Filtering and additional processing will only alter the signal even more.


Instead of outputting silence, you could output the same frequency spectrum as you were right before the event.

How does one detect whether it's a musical silence or a buffer underrun?


You'd probably do this at a layer where you have access to the buffer stats to know that the buffer is nearly empty.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: