Hacker News new | past | comments | ask | show | jobs | submit login
Audio Processing for Dummies (michaelfbryan.com)
395 points by gkbrk on Nov 10, 2019 | hide | past | favorite | 47 comments



I wrote this crate [1] as a compressor in Rust which is the opposite of a noise gate, as in gain reduction is applied after a threshold is passed instead of gain reduction applied if it is under a threshold.

If you want a really great approach to noise gating, a fixed threshold is fine but it works better when you apply it to the difference of two envelope followers - one with a short attack, long release (tracks input) and long attack, short release (tracks noise floor). Takes a bit to set it up, but it's a stupid simple way to get extremely effective gating and is easy to fine tune for your application. A lot of Voice Activity Detection (VAD) works this way; it's just a matter of tuning the coefficients and thresholds for your input.

Also useful reference for envelope following are the DAFX text [2], Will Pirkle's textbook on audio in C++ [3] and Zölzer's text [4]

[1] https://github.com/m-hilgendorf/rusty-compressor

[2] https://www.amazon.com/DAFX-Digital-Effects-Udo-Z%C3%B6lzer/...

[3] https://www.amazon.com/Designing-Audio-Effect-Plugins-C/dp/1...

[4] https://www.amazon.com/Digital-Audio-Signal-Processing-Z%C3%...

(pdfs can be found around the internet)


The examples used in the OP are helped by having an RF squelch to zero out the noise floor. If there was no squelch, the difficulties finding a good static (har har) threshold would have been much more apparent.


Can you explain what you mean by "squelch"? I'm assuming it's a kind of filter but I can't find it in the code.


Interesting project. In my experience as a re-recording mixer for film the best denoisers currently on the market work as a sort of spectral noise gates: so instead of running one noise gate over the whole audio spectrum, you split it into frequency bands each with their own thresholds. These thresholds are tuned in by “learning” the spectral shape of the background noise itself and usually the user can still modify the wheights with a curve (e.g. if you want to gate high frequencies more).

This has the benefit of producing more natural results while keeping speech understandable. It might interest you, because with a 2000x speed marging you could still crank the algorithm up.


I like the idea of breaking things up by frequency! Luckily my application (radio transmissions) is quite amenable to this naive noise gate, but I can imagine for more complex audio (e.g. if the radio doesn't have a built-in squelch) having tighter control over what triggers the gate would greatly improve quality.


Multi band compressors also help in this domain, choosing which frequency band to attenuate or boost. For multi channel audio, phase becomes important too. Not for nothing is a typical trick in mastering to restrict low frequencies to mono..


I always thought the primary reason for that was that stereo low frequencies, when recorded onto vinyl records, hugely increase the risk of the needle 'jumping' out of the groove and skipping into another one.


Is an audio stream really just a bunch of samples of volume? I had always assumed you needed to record different wavelengths of sound to capture different pitches. Is my mental model completely wrong? How do you get such variety in sound just from recording the volume?


PCM doesn't make any special case for different wavelengths. It's simply the air pressure over time (pressure on your ears, on the speaker's membrane, etc.) It doesn't directly correspond to the volume ("f(t)=1" is as silent at "f(t)=0").

So yes, an (uncompressed) audio stream is just a sequence of samples of pressure (or displacement, depending on the medium).

The variety you get comes from the high sampling frequency: a high sampling frequency allows representing vibrations of the air pressure at many frequencies.


>Is an audio stream really just a bunch of samples of volume?

Yes.

> I had always assumed you needed to record different wavelengths of sound to capture different pitches

Why? Look at a mono vinyl record. It has a single groove, which is the soundwave carved into the vinyl. Digital audio is the same, except you store a digital image of the groove.

>Is my mental model completely wrong?

Yes.

>How do you get such variety in sound just from recording the volume?

By recording the volume 44100 times per second.


super simplified explanation

Samples of pressure (or displacement, minor phase difference), so kinda yeah.

How one gets so much variety? By sampling it really often. If you sample it 40000 times per second you actually get the full information for all pitches that are between 0 and 20000Hz.

If you don’t sample it often enough you don’t get all the information needed to store the sound. So yes we kinda indirectly do capture all the different pitches.


Sound propogates by displacing the medium it's traveling in. So at every point in time we can only sample the displacement (amplitude/volume). Our cochlear hair cells do a Fourier transform to decompose the sound into component frequencies for the brain to analyze. Similarly, an audio stream is just samples of the amplitude over time and we need to do additional processing to extract the frequencies.


Yes, it's only a stream of samples.

The magic is the linearity of the signals. They are additive. The different signals pass through the same stream of samples without interacting.

If you add volumes from sine wave in 400 Hz to sine wave in 300 Hz, you can transfer them together in a stream of samples , then separate them later.

Mathematically: f(at + bt) = af(t) + bf(t)


When a speaker cone is moving in and out at 100hz you will hear a sine bass sound. But when the cone is also vibrating at 10000hz while moving in and out at 100hz there will be another pitch on top of the bass sound.

So different wavelengths added together will tell your brain there are different pitches at the same time.

When all those different wavelengths are using the same volume at the same time you can imagine that all those added waves will become a mess. That is what is called noise.

But when a high pitch wave is 'traveling' on top of a low pitch wave your brain can distinguish the different pitches.


Suppose you're at a rock concert. You hear several string instruments, drums, maybe a keyboard, and a voice or two all together at the same time.

What reaches your ears is continuous differences in the air pressure. That's more-or-less a single changing 'amplitude'. From that, your brain can pull apart the fundamental and overtone pitches from each source ... and sort them out so you can identify each one. Amazing technology ... almost indistinguishable from magic!

These days lots of people haven't seen audio on an oscilloscope ... which shows changes in amplitude with time. Look at this video. You'll see that at each point in time (left to right), there's ONE amplitude, whether he plays one string or more.

https://www.youtube.com/watch?v=Dd5dgajeIOA


You were probably mixing the time domain (just samples representing the desired speaker membrane position) up with the frequency domain (the energies of a punch of sines at different frequencies, which you get by running a FFT on the samples).

Both exist and are beeing used, but Samples are the default in uncompressed audio.


How do you get such variety in sound just from recording the volume?

https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampli...


You don't record each "waves" (nor is it even possible), you record the final result of all the waves superposed together, just like how your ears work.


>You don't record each "waves" (nor is it even possible)

Depends on what you mean by 'record'. Generally the air pressure is sampled instantaneously, but it can be broken down into frequency components before recording/perceiving. That's kind of how our ears work and in DSP you do an FFT (which you can then pass through an iFFT to synthesize the original signal).


It has been a while since I study DSP, so I'm not sure if FFT can mathematically recover all the original sine waves (FT can). But I don't think it is possible practically, at least.


Yes, if you use the FFT or any discrete FT, you can perfectly recreate the signal up to the Nyquist frequency. To perfectly recreate the whole signal, you'd need infinite Fourier coefficients.


Thanks


If you want to learn more, this article has a good intro to sampling theorem, among other things: http://people.xiph.org/~xiphmont/demo/neil-young.html


When samplerate is fixed, different pitches show up as waveforms having a different number of samples (per cycle). Each sample is a constant time step apart; you can count the number of samples (= length of time) of a period. Frequency is inversely proportional to period.


Thanks for all the explanations! Super helpful!


The interesting thing with this example in particular is that the audio output from the radio is already very well-gated thanks to an RF squelch (ie muted until a high enough RF level or pilot tone is detected - otherwise it would sound like an loud FM radio tuned to an empty band between transmissions and the gate wouldn't work). So internally there's already a voltage signal somewhere to indicate mute/unmute, with better precision than deriving it from the audio.

Of course it's a fun software exercise but if you were building a hardware product out of it you'd be better off splitting the audio into distinct transmissions based on the squelch signal rather than audio gate...

As it happens some modern avionic radio systems let you individually play back the most recent few messages for exactly the reasons the author describes (it's easy to miss)


> The interesting thing with this example in particular is that the audio output from the radio is already very well-gated thanks to an RF squelch

Author here. Luckily for my purposes I can rely on the radio already having a squelch, so I could get away with using the naive solution.

If it was the raw audio I'd probably need to implement something more sophisticated such as splitting the audio by frequency and giving each band a noise threshold (as atoav mentioned).


While this is an article about programming, I'd like to mention that if you need to remove background noise from an audio clip, then Audacity has an excellent filter for this. Of course that's not anywhere as adaptable as doing it in code yourself is.


If you don't care about open source and don't mind spending money, Izotope RX7 has much better denoising tools than Audacity. It's actually the main reason I now have Windows installed on my desktop (alongside Ubuntu).


I know a lot of things about audacity can be a bit clunky, but the quality of the denoise tool in Audacity is actually extremely good. I read its code one time, and I was impressed.

I'm not familiar with RX7, maybe it does better audio quality but it's probably also got a better interface/workflow. Which is definitely worth a lot in a professional setting.


Audacity is open source though so one could always dig into the code if they wanted.

Plus I don’t think it’s fair to say “not anywhere as acceptable as doing it in code yourself”. I understand the point of your disclaimer but as someone who’s been writing code for three decades, I really love the fact that we no longer have to write our own tools.


I mean adaptable. I don't know how much you could automate with Audacity in this regard. With your own code, you can.


Apparently Audacity supports automation from a variety of different languages[1]. Though I’ve not tried myself so I couldn’t comment how good it is. However you’re right that if you’re writing software that depends on such a tool it’s better to have something you can compile into your software (or at least an .so / .dll you can redistribute).

[1] https://wiki.audacityteam.org/wiki/Automation


Are there companies/projects that are working on cleaning up radio signals like that in real-time with digital signal processing ?

I used to work in the music industry, where there are many other real-time processing tools that can be applied to radio voice to increase clarity (eg: removing the static noise under speech, re-equalizing the limited bandwidth to increase perceptibility, de-reverberation etc..)

Considering such radio communications are often used in mission-critical scenarios, one would think clarity of speech would be a factor to consider.


There is software to do this. You can even configure your streaming / recording tool of choice to use that processed audio directly.

For the first couple of video courses I made my work flow was:

- Open REAPER (a software DAW with a business model like Sublime Text)

- Create and save a noise gate sample based on my specific room noise using its "ReaFIR" VST plugin until it was good

- Use a piece of software (ASIO Link Pro) to act as a virtual audio switch

- Open my video recording tool of choice (Camtasia)

- Configure Camtasia's microphone to be a virtual device that ASIO Link Pro created on my system

So now the end result is the audio being recorded live is already processed by REAPER and has no background noise. I also added in effects like a compressor and EQ too to offset my very thin microphone at the time. Totally seamless with no latency issues. You could do the same thing with OBS but OBS is even easier since you can directly use VST plugins so you don't even need REAPER or ASIO Link Pro running. You can download REAPER's VSTs separately.

Nowadays I just use hardware to do the same thing in real time so I don't need to think about it and I have flexibility in the recording tools I use, although the above approach works fine once you have it all configured. You would just open REAPER before your recording tool and it was ready to go.

Personally I would never go back to editing in post production for recording voices with a microphone. The amount of time the above saves is massive because it's all done on the fly with zero human intervention once it's set up. There's no complex import / export / import flow needed or fiddling with things for every recording.


Digital communication has a wholly different set of requirements than audio processing. With audio it is good enough to sound good, but digital communication has far more specific metrics of performance.

Digital communication is also far more structured, which makes it possible to implement a large number of techniques for improving signal quality.

Almost all digital communication systems would have something along: adaptive equalization, carrier tracking, symbol clock tracking, forward error correction, and many, many more techniques.


I think franky47 was referring to using DSP in the audio chain to clean it up, not digital modes for communications (P25/DMR/Tetra).


I agree that the technical requirements for transmission are very different than in a professional audio context. What I meant is that, once the signal has been acquired, there could be some additional processing to make it "sound better" (is: increase speech legibility), independently from the transport method.


Where can one read about these techniques and are there open source implementations?


I’ve only learned about this through my engineering studies, so I can only really recommend textbooks. I found Proakis’ Digital Communication quite good, but it doesn’t go very deep.

I don’t know of any online resources – but I’d love to have the time to write some signal processing for communication is fascinating and has a wide impact on everyday life.

Also wrt. implementations: I think GNU Radio might be a good place to look, but honestly the actual implementation of these algorithms is often very simple, it is the theory behind them that gets hairy.


You can lurk Bluetooth specification and everything around, for example.


Is this a good example of how things are done in Rust, or is it over-engineered?

I'm pretty sure you should be able to code a thing like a noise gate in 10-15 lines of code.

... and then there's this rather odd statement:

> If we want to use the NoiseGate in realtime applications we’ll need to make sure it can handle typical sample rates.

Maybe I just think it's funny. Back in the early 2000s I used to worry about that and even back then, it's... 44100 samples per second. A whole second to process no more than 44100 numbers, maybe twice that if you have stereo. Unless you're doing something really weird (and Rust should be pretty performant, right?) that shouldn't make the machine even blink.


I understand that implementing stuff in Rust can just be fun on its own, but, I mean, in case if somebody really just wants to solve the problem Michael had here:

sox -V3 N11379_KSCK.wav part.wav silence 1 3.0 0.1% 1 0.3 0.1% : newfile : restart


Very interesting. I am starting bird sound and bird ID. Hope this can help. Thanks.


That's awesome! One very simple thing you can do with that is pass the signal through a bandpass filter (say 2kHz-10kHz) before evaluating power levels to minimize false positives from wind and other environmental noises.

Another method that can be effective with the identification itself is to convert the signal to a spectrograph and classify the resulting images with a neural net.


Vote for the second, but you won't need a neural net, a sum-of-squared differences on a set of reference spectra would probably get you 99% there.


Isn't this really easy to do on these clips because there is no noise when there is no audio transmission? It would be awesome if you showed some examples where there is a bunch of noise between audio transmissions.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: