Computing Adler32 Checksums at 41 GB/s

dougall · on Aug 7, 2022

Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )

nigeltao · on Aug 8, 2022

Here's another SIMD implementation, with commentary: https://github.com/google/wuffs/blob/main/std/adler32/common...

Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.

"Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...

pizza · on Aug 7, 2022

Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup!

[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...

[1] https://github.com/veluca93/fpnge

[2] https://twitter.com/richgel999/status/1485976101692358656

wooosh · on Aug 7, 2022

Unfortunately I haven’t had the time to do a proper benchmark, and the fpng test executable only decodes/encodes a single image which produces very noisy/inconclusive results. However, I’m under the impression that it doesn’t make a large difference in terms of overall time.

fpnge (which I wasn’t aware of until now) appears to already be using a very similar (identical?) algorithm, so I suspect the relative performance of fpng and fpnge would not be significantly impacted by this change.

Nyan · on Aug 7, 2022

As someone who has been recently optimising fpnge, Adler32 computation is pretty much negligible regarding overall runtime. The Huffman coding and filter search take up most of the time. (IIRC fpng doesn't do any filter search, but Huffman encoding isn't vectorized, so I'd expect that to dominate fpng's runtime)

bob1029 · on Aug 7, 2022

If image encode/decode speed is the only concern, libjpegturbo is going to be orders of magnitude faster than any of these lossless schemes. With jpeg, you could encode 1080p bitmaps in <10 milliseconds (per thread) on any consumer PC made in the last decade.

The frequency domain is a really powerful place to operate in when you are dealing with this amount of data.

pizza · on Aug 7, 2022

That's not true. libjpeg-turbo is ~50 MB/s last I tried - plus it's not lossless. fjxl and fpnge are basically an order of magnitude faster than that. libjpeg-turbo isn't even the fastest jpeg codec - you should check out the (relatively obscure) libmango - roughly 1 gbps decode on a 2020 macbook pro - or nvJPEG for GPU-based JPEG decoding. Supposedly there's even faster GPU-based decoders than nvJPEG, too.

bob1029 · on Aug 7, 2022

> GPU-based

How does this impact the overall latency of encoding a single image?

averne_ · on Aug 7, 2022

I've written an open-source driver for the decoding side of the nvjpg module found in the Tegra X1 (ie. earlier hardware revision than the one in the A100).

I did some quick benchmarks against libjpeg-turbo, if that can give you an idea. I expect encoding performance would be similar.

https://github.com/averne/oss-nvjpg#performance

pizza · on Aug 7, 2022

Probably quite a bit, I don't know. The typical use case is to load up thousands of JPEGs at once to get good throughput despite copy overhead. You can see here the benchmark against jpeg-turbo: https://developer.nvidia.com/blog/leveraging-hardware-jpeg-d...

ebiggers · on Aug 8, 2022

Note that libdeflate has used essentially the same method since 2016 (https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32...), though I recently switched it to use a slightly different method (https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/ad...) that performs more consistently across different families of x86 CPUs.

josephg · on Aug 8, 2022

Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?

diroussel · on Aug 8, 2022

What are your requirements? Tamper resistance? Error detection? Error correction? Speed vs time vs space trade off?

josephg · on Aug 8, 2022

Just error detection for corrupt or partial disk writes.

TAForObvReasons · on Aug 7, 2022

While micro-optimizations are interesting, there are two questions left unanswered:

- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.

- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?

londons_explore · on Aug 7, 2022

If checksum calculation was any substantial portion of image decoding, I think that would be a strong case for simply not checking the checksum.

If you put corrupted data into a PNG decoder, I don't think it's awfully important to most users whether they get a decode error or a garbled image out.

wooosh · on Aug 7, 2022

This was actually considered, and other libraries do ignore checksums, or at least have options to:

https://github.com/richgel999/fpng/issues/9

NavinF · on Aug 8, 2022

>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).

That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).

I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.

Great writeup either way.

wooosh · on Aug 8, 2022

Yes, this is on a single 8GB 2667MHz DIMM in a laptop.

edit: For dual channel RAM, I would suspect the throughput depends on how the kernel decides to map physical memory to virtual addresses.

adrian_b · on Aug 8, 2022

The memory is already mapped by the BIOS/EFI firmware, before the kernel takes control.

By default, whenever the memory modules used in all different channels have the same size, e.g. two 8 GB modules, the firmware maps the modules with interleaved addresses, to ensure a double throughput for 2 channels, or triple/quadruple/etc. for workstation/server motherboards with more memory channels.

jiggawatts · on Aug 7, 2022

I hope this brilliant work has been merged into the relevant open source libraries.

Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.

It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…

benreesman · on Aug 8, 2022

Effective hiring managers are always paying attention in the hopes of noticing the people doing this excellent stuff and asking whether “now” or “soon” is the right time to offer such people high paying jobs.

I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

wooosh · on Aug 8, 2022

> I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

Well, I intend to finish high school at a minimum before pursuing employment :)

benreesman · on Aug 8, 2022

Someone as far ahead of the curve as you clearly are might enjoy the second chapter of Coders at Work, which is an interview with Brad Fitzpatrick. bfitz wrote everything from memcached to big chunks of TailScale and much in between.

He went to university in CS but would have been bored to sleep if he didn’t have something else going on, so he founded and ran LiveJournal simultaneously.

I bet someone on this thread knows him and I bet he’d take the time to offer some pointers to an up-and-comer like yourself. I’ve never met him, but I did some business with Six Apart in a previous life and people say he’s a really nice guy.

Either way, keep it up!

james-redwood · on Aug 8, 2022

It works well in science (think the Nobel Prizes), and there's certainly more than enough money floating around in the tech community for it to work.

daniel-cussen · on Aug 7, 2022

I love this kind of writeup. This is my idea of fun: speedups.

profquail · on Aug 8, 2022

zlib-ng also has adler32 implementations optimized for various architectures: https://github.com/zlib-ng/zlib-ng

Might be interesting to benchmark their implementation too to see how it compares.