Seems like you can do this sort of speed up even without the 256 constraint. Jus...

nicula · 2025-03-09T18:10:14 1741543814

Some people already mentioned this in the r/cpp discussion. Small correction: 256 is not the correct number of iterations, since if all elements in that slice are even, then your 8-bit counter will wrap-around to zero, which can lead to a wrong answer. What you want is 255 iterations.

I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.

ack_complete · 2025-03-09T21:51:20 1741557080

255 is not ideal either because it results in a partial vector at the end of either 15 or 31 elements. 256-V where V is the vector size is better, so 240 for SSE2/NEON or 224 for AVX2.

This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7.

nicula · 2025-03-09T22:34:41 1741559681

Great observations, thanks!

I wrote the code that you suggested (LMK if I understood your points): https://godbolt.org/z/jW4o3cnh3

And here's the benchmark output, on my machine: https://0x0.st/8SsG.txt (v1 is std::count_if(), v2 is the optimization from my blog post, and v3 is what you suggested).

v2 is faster, but v3 is still quite fast.

ack_complete · 2025-03-09T22:55:45 1741560945

Yeah, the difference you're seeing is likely due to the inner loop doing so few iterations, in addition to not being unrolled. A hand-rolled version doing 63 iterations of an x4 unrolled loop should be able to saturate the execution core (it'll be bottlenecked by load throughput). But it'll be tough to get the compiler to generate that without intrinsics.

meisel · 2025-03-10T01:04:55 1741568695

The fastest thing I would think is to process 255 vectors at a time, using an accumulation vector with the k-th byte in the vector representing the # of evens seen for the k-th byte of the vectors in that 255-vector chunk. But, I don’t know how to make autovectorization do that. I could only do that with intrinsics.