Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup!
Unfortunately I haven’t had the time to do a proper benchmark, and the fpng test executable only decodes/encodes a single image which produces very noisy/inconclusive results. However, I’m under the impression that it doesn’t make a large difference in terms of overall time.
fpnge (which I wasn’t aware of until now) appears to already be using a very similar (identical?) algorithm, so I suspect the relative performance of fpng and fpnge would not be significantly impacted by this change.
As someone who has been recently optimising fpnge, Adler32 computation is pretty much negligible regarding overall runtime.
The Huffman coding and filter search take up most of the time. (IIRC fpng doesn't do any filter search, but Huffman encoding isn't vectorized, so I'd expect that to dominate fpng's runtime)
If image encode/decode speed is the only concern, libjpegturbo is going to be orders of magnitude faster than any of these lossless schemes. With jpeg, you could encode 1080p bitmaps in <10 milliseconds (per thread) on any consumer PC made in the last decade.
The frequency domain is a really powerful place to operate in when you are dealing with this amount of data.
That's not true. libjpeg-turbo is ~50 MB/s last I tried - plus it's not lossless. fjxl and fpnge are basically an order of magnitude faster than that. libjpeg-turbo isn't even the fastest jpeg codec - you should check out the (relatively obscure) libmango - roughly 1 gbps decode on a 2020 macbook pro - or nvJPEG for GPU-based JPEG decoding. Supposedly there's even faster GPU-based decoders than nvJPEG, too.
I've written an open-source driver for the decoding side of the nvjpg module found in the Tegra X1 (ie. earlier hardware revision than the one in the A100).
I did some quick benchmarks against libjpeg-turbo, if that can give you an idea. I expect encoding performance would be similar.
Probably quite a bit, I don't know. The typical use case is to load up thousands of JPEGs at once to get good throughput despite copy overhead. You can see here the benchmark against jpeg-turbo: https://developer.nvidia.com/blog/leveraging-hardware-jpeg-d...
Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?
While micro-optimizations are interesting, there are two questions left unanswered:
- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.
- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?
If checksum calculation was any substantial portion of image decoding, I think that would be a strong case for simply not checking the checksum.
If you put corrupted data into a PNG decoder, I don't think it's awfully important to most users whether they get a decode error or a garbled image out.
>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).
That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).
I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.
The memory is already mapped by the BIOS/EFI firmware, before the kernel takes control.
By default, whenever the memory modules used in all different channels have the same size, e.g. two 8 GB modules, the firmware maps the modules with interleaved addresses, to ensure a double throughput for 2 channels, or triple/quadruple/etc. for workstation/server motherboards with more memory channels.
I hope this brilliant work has been merged into the relevant open source libraries.
Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.
It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…
Effective hiring managers are always paying attention in the hopes of noticing the people doing this excellent stuff and asking whether “now” or “soon” is the right time to offer such people high paying jobs.
I doubt my current garage band could afford the OP just this moment, but I sure wish we could!
Someone as far ahead of the curve as you clearly are might enjoy the second chapter of Coders at Work, which is an interview with Brad Fitzpatrick. bfitz wrote everything from memcached to big chunks of TailScale and much in between.
He went to university in CS but would have been bored to sleep if he didn’t have something else going on, so he founded and ran LiveJournal simultaneously.
I bet someone on this thread knows him and I bet he’d take the time to offer some pointers to an up-and-comer like yourself. I’ve never met him, but I did some business with Six Apart in a previous life and people say he’s a really nice guy.