Check out also Codec2, Open Source as well, which offers really good quality down to 700 bit/s. It has been ported to small MCUs such as the ESP32, STM32 etc. also supported by Arduino libraries.
Yes, that is done on purpose. The goal of the project is to be used in HF/VHF voice communications so it makes sense. That low bandwidth usage allows it to be employed from ordinary HAM gear down to cheap LoRa modules, a feature that opens huge possibilities like building point to point or multipoint encrypted communications with portable devices not tied to cellphone towers, which in some areas of the world would be so useful these days.
>employed from ordinary HAM gear down to cheap LoRa modules, a feature that opens huge possibilities like building point to point or multipoint encrypted communications
I hope you aren't suggesting that encryption should be used on amateur bands.
It's not legal. Most countries prohibit encryption of Amateur Radio transmissions in most cases. Some countries have exceptions such as emergency communications or satellite control. [0]
Is there an open codec that concentrates on low CPU usage? I'm fine with it not being very bandwidth efficient.
Opus is a very good codec, but it's not amazing CPU-wise. I work on a VR world, and audio encoding is usually our most limiting factor when running on an VPS. We have the capability to negotiate codecs, so the high cpu/low bandwidth use case is already covered.
What I'm looking for specifically:
* Low CPU usage
* Support for high bitrate, suitable for music and sounds other than voice
You can bootleg your own fast lossless codec by doing delta-encoding on the raw PCM to get a lot of zeros and then feed it through an off-the-shelf fast compressor like snappy/lz4/zstandard/etc. It won't get remotely close to the dedicated audio algorithms, but I wouldn't be surprised if you cut your data size by a factor 2-4 and essentially no CPU cost compared to raw uncompressed audio.
Looks like my initial estimation of 2-4 was way off (when FLAC achieves ~2 this should've been a red flag), but you do get a ~1.36x reduction in space at basically memory read speed.
Using an encoding for second order differences with storing -127 <= d <= 127 using 1 byte and the others 2 bytes (for an input of 16-bit audio) I got a ratio of ~1.50 for something that can still operate entirely at RAM speed:
orig = samples.tobytes()
deltas = np.diff(samples, prepend=samples.dtype.type(0), axis=0) # Per-channel deltas.
delta_deltas = np.diff(deltas, prepend=samples.dtype.type(0), axis=0) # Per-channel second-order differences.
# Many small differences, encode almost all 1-byte differences using 1 byte,
# using 3 bytes for larger differences. Interleave channels and encode.
small = np.sum(np.abs(delta_deltas.ravel()) <= 127)
bootleg = np.zeros(small + (len(delta_deltas.ravel()) - small) * 3, dtype=np.uint8)
i = 0
for dda in delta_deltas.flatten():
if -127 <= dda <= 127:
bootleg[i] = dda + 127
i += 1
else:
bootleg[i] = 255
bootleg[i + 1] = (dda + 2**15) % 256
bootleg[i + 2] = (dda + 2**15) // 256
i += 3
compressed_bootleg = zstd.ZSTD_compress(bootleg)
print(len(compressed_bootleg))
decompressed_bootleg = zstd.ZSTD_uncompress(compressed_bootleg)
result = []
i = 0
while i < len(bootleg):
if bootleg[i] < 255:
result.append(decompressed_bootleg[i] - 127)
i += 1
else:
lo = decompressed_bootleg[i + 1]
hi = decompressed_bootleg[i + 2]
result.append(256*hi + lo - 2**15)
i += 3
decompressed_delta_deltas = np.array(result, dtype=samples.dtype).reshape(delta_deltas.shape)
decompressed_deltas = np.cumsum(decompressed_delta_deltas, axis=0, dtype=samples.dtype)
decompressed = np.cumsum(decompressed_deltas, axis=0, dtype=samples.dtype)
assert np.array_equal(samples, decompressed)
While I also want a low-computation codec that can save space, the historical use cases unfortunately assumes a lot more CPU power to be compensated for a lot less bandwidth, so there's little research in this area, and there's no real incentive to make something like ProRes and DNxHD as if you are editing audio the SSD speeds has been so fast that you'll run into CPU problems first.
No, g722 is still a wideband speech codec. Its available frequency goes up to 7 kHz. The uncompressed audio this thread began with goes up to 22 kHz. With g722 you're losing most overtones, or even all overtones from the top of a piano. Please don't use g722 for music apart from on-hold muzak.
I did a prototype of a 3D low-latency server side mixing system, based on a hypothetical 4k clients, @48k each being mixed with the 64 loudest clients.. Using Opus, forced to Celt mode only and running 256 stereo sample frames at 128kbps.. Worked well, using only 6 cores for that workload.. The mixing was trivial, but the decode and encode of 4k streams was entirely doable.. This issue at that rate was 1.5M network packets a second.. If I was to revisit it, I’d look at using a simple MDCT based codec, with a simple Psychacoustic model based on MPC (minus CVD) and modified for shorter frames + Mdct behaviour versus PQMF behaviour, without any Huffman coding or entropy coding.. And put that codec on the GPU.. Small tests I did using a 1080ti indicated ~1M clients could be decoded, mixed and encoded (same specs as above) problem is then how to handle ~370M network packets a second :)
Edit: Had high hopes for High Fidelity, and came very close to asking for a job there ;) Shame it’s kaput, didn’t know that :(
Those are interesting ideas, thanks! I'll have to try and play with that.
High Fidelity the company is still around, but they pivoted multiple times radically. Initially their plan was social VR of sorts. Then they tried to make a corporate product for meetings and such, and gave up on that right before COVID19 hit!
And after that they ripped out all the 3D and VR and scaled down to a 2D, overhead spatial audio web thing. Think something like Zoom, only you have an icon that you can move around to get closer or further to other people.
The original code still lives on, we picked it up and are working on improvements. Feel free to visit out Discord (see my profile).
Apparently RP1 team handle bigger crowd loads through muxing on the server but not sure exactly how that works out for spatial audio there is a Kent Bye Voices of VR podcast discussing how they got 4k users in the same shard.
Why is your VPS server encoding rather than clients? Are you combining talkers together into one source for doing crowds and avoiding N^2 or something and need to reencode after combining?
It's a community-led continuation of High Fidelity, a dead commercial project. They made their own proprietary codec with excellent performance we can't use and managed to have a couple thousand people in the same server.
I'm skeptical, those are almost always going to be implemented in hardware, so the complexity of a software encoder isn't a design concern.
There is some correlation between the cost of a hardware implementation and complexity of a software implementation. SBC is a very simple codec, but AptX and LC3 might not be much better than Opus.
If you are OK with moderately high bitrates, you might prefer something simpler like an ADPCM scheme. It's pretty damn easy to implement ADPCM, certainly a lot less math heavy than MDCT-based schemes, and they achieve good quality at a somewhat higher bitrate (I have no data, but I'd guess 200-250%~ish.)
Please I wish something like that came out. Same with modern tech with cassette and minidisc. Something new but with that kind of hardware. I love the physical mediums!
Subjectively beating Opus on quality-to-bit rate is quite impressive, but I noticed the samples had some interesting audible artifacts. I wonder where these come from, and if they're related in any way to this codec using machine learning techniques.
It'd be interesting to see what the lift would be to get encoding & decoding running in webassembly/wasm. Further, it'd be really neat to try to take something like the tflife_model_wrapper[1] and to get it backed by something like tsjs-tflite[2] perhaps running atop for example tfjs-backend-webgpu[3].
Longer run, the web-nn[4] spec should hopefully simplify/bake-in some of these libraries to the web platform, make running inference much easier. But there's still an interesting challenge & question, that I'm not sure how to tackle; how to take native code, compile it to wasm, but to have some of the implementation provided else-where.
At the moment, Lyra V2 can already use XNNPACK[4], which does have a pretty good wasm implementation. But trying to swap out implementations, so for example we might be able to use the GPU or other accelerators, could still have some good benefits on various platforms.
Things missing that I think would have to be added before this could become a widely used standard:
* Variable bitrate. For many uses, the goal isn't 'fill this channel', but instead 'transmit this audio stream at the best quality possible'. That means filling the channel sometimes, but other times transmitting less data (ie. When there is silence, or when the entropy of the speech being sent is low - for example the user is saying something very predictable).
* Handling all types of audio. Even something designed for phone calls will occasionally be asked by users to transmit music, sound effects, etc. The codec should do an acceptable job at those other tasks.
That's true, but in this era of high quality voice and video calls, it's not that uncommon for someone to want to play a song or even a live instrument, so some capability for handling that intelligently seems important.
I believe the limitation was actually not the mics. You can fit a lot more phone calls into a given bandwidth if you heavily restrict the bandwidth of each phone call.
As an ML person I totally get the how NN works to accomplish this and it's very cool. What's really cool is how they get this to work in real time with little/no latency.
https://www.rowetel.com/?page_id=452
https://github.com/deulis/ESP32_Codec2
https://www.arduino.cc/reference/en/libraries/codec2/