Hacker News new | past | comments | ask | show | jobs | submit login
Computing Adler32 Checksums at 41 GB/s (wooo.sh)
98 points by wooosh on Aug 7, 2022 | hide | past | favorite | 27 comments



Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )


Here's another SIMD implementation, with commentary: https://github.com/google/wuffs/blob/main/std/adler32/common...

Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.

"Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...


Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup!

[0] https://github.com/libjxl/libjxl/tree/main/experimental/fast...

[1] https://github.com/veluca93/fpnge

[2] https://twitter.com/richgel999/status/1485976101692358656


Unfortunately I haven’t had the time to do a proper benchmark, and the fpng test executable only decodes/encodes a single image which produces very noisy/inconclusive results. However, I’m under the impression that it doesn’t make a large difference in terms of overall time.

fpnge (which I wasn’t aware of until now) appears to already be using a very similar (identical?) algorithm, so I suspect the relative performance of fpng and fpnge would not be significantly impacted by this change.


As someone who has been recently optimising fpnge, Adler32 computation is pretty much negligible regarding overall runtime. The Huffman coding and filter search take up most of the time. (IIRC fpng doesn't do any filter search, but Huffman encoding isn't vectorized, so I'd expect that to dominate fpng's runtime)


If image encode/decode speed is the only concern, libjpegturbo is going to be orders of magnitude faster than any of these lossless schemes. With jpeg, you could encode 1080p bitmaps in <10 milliseconds (per thread) on any consumer PC made in the last decade.

The frequency domain is a really powerful place to operate in when you are dealing with this amount of data.


That's not true. libjpeg-turbo is ~50 MB/s last I tried - plus it's not lossless. fjxl and fpnge are basically an order of magnitude faster than that. libjpeg-turbo isn't even the fastest jpeg codec - you should check out the (relatively obscure) libmango - roughly 1 gbps decode on a 2020 macbook pro - or nvJPEG for GPU-based JPEG decoding. Supposedly there's even faster GPU-based decoders than nvJPEG, too.


> GPU-based

How does this impact the overall latency of encoding a single image?


I've written an open-source driver for the decoding side of the nvjpg module found in the Tegra X1 (ie. earlier hardware revision than the one in the A100).

I did some quick benchmarks against libjpeg-turbo, if that can give you an idea. I expect encoding performance would be similar.

https://github.com/averne/oss-nvjpg#performance


Probably quite a bit, I don't know. The typical use case is to load up thousands of JPEGs at once to get good throughput despite copy overhead. You can see here the benchmark against jpeg-turbo: https://developer.nvidia.com/blog/leveraging-hardware-jpeg-d...


Note that libdeflate has used essentially the same method since 2016 (https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32...), though I recently switched it to use a slightly different method (https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/ad...) that performs more consistently across different families of x86 CPUs.


Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?


What are your requirements? Tamper resistance? Error detection? Error correction? Speed vs time vs space trade off?


Just error detection for corrupt or partial disk writes.


While micro-optimizations are interesting, there are two questions left unanswered:

- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.

- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?


If checksum calculation was any substantial portion of image decoding, I think that would be a strong case for simply not checking the checksum.

If you put corrupted data into a PNG decoder, I don't think it's awfully important to most users whether they get a decode error or a garbled image out.


This was actually considered, and other libraries do ignore checksums, or at least have options to:

https://github.com/richgel999/fpng/issues/9


>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).

That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).

I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.

Great writeup either way.


Yes, this is on a single 8GB 2667MHz DIMM in a laptop.

edit: For dual channel RAM, I would suspect the throughput depends on how the kernel decides to map physical memory to virtual addresses.


The memory is already mapped by the BIOS/EFI firmware, before the kernel takes control.

By default, whenever the memory modules used in all different channels have the same size, e.g. two 8 GB modules, the firmware maps the modules with interleaved addresses, to ensure a double throughput for 2 channels, or triple/quadruple/etc. for workstation/server motherboards with more memory channels.


I hope this brilliant work has been merged into the relevant open source libraries.

Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.

It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…


Effective hiring managers are always paying attention in the hopes of noticing the people doing this excellent stuff and asking whether “now” or “soon” is the right time to offer such people high paying jobs.

I doubt my current garage band could afford the OP just this moment, but I sure wish we could!


> I doubt my current garage band could afford the OP just this moment, but I sure wish we could!

Well, I intend to finish high school at a minimum before pursuing employment :)


Someone as far ahead of the curve as you clearly are might enjoy the second chapter of Coders at Work, which is an interview with Brad Fitzpatrick. bfitz wrote everything from memcached to big chunks of TailScale and much in between.

He went to university in CS but would have been bored to sleep if he didn’t have something else going on, so he founded and ran LiveJournal simultaneously.

I bet someone on this thread knows him and I bet he’d take the time to offer some pointers to an up-and-comer like yourself. I’ve never met him, but I did some business with Six Apart in a previous life and people say he’s a really nice guy.

Either way, keep it up!


It works well in science (think the Nobel Prizes), and there's certainly more than enough money floating around in the tech community for it to work.


I love this kind of writeup. This is my idea of fun: speedups.


zlib-ng also has adler32 implementations optimized for various architectures: https://github.com/zlib-ng/zlib-ng

Might be interesting to benchmark their implementation too to see how it compares.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: