I agree server CPUs are underprovisioned for memBW. Each core's share is 2-4 GB/...

ryao · 2025-01-11T11:37:55 1736595475

In an ideal situation where your arrays both fit in L1 cache and are in L1 cache, yes. However, in typical real world situations, you will not have them fit in L1 cache and then what will happen after the reads are issued will look like this:

  * Some time passes
  * Load 1 finishes
  * Some time passes
  * Load 2 finishes
  * FMA executes

As we are doing FMA on arrays, this is presumably part of a tight loop. During the first few loop iterations, the CPU core’s memory prefetcher will figure out that you have two linear access patterns and that your code is likely to request the next parts of both arrays. The memory prefetcher will then begin issuing loads before your code does and when the CPU issues a load that has already been issued by the prefetcher, it will begin waiting on the result as if it had issued the load. Internally, the CPU is pipelined, so if it can only issue 1 load per cycle, and there are two loads to be issued, it does not wait for the first load to finish and instead issues the second load on the next cycle. The second load will also begin waiting on a load that was done early by the prefetcher. It does not really matter whether you are issuing the AVX-512 loads in 1 cycle or 2 cycles, because the issue of the loads will occur in the time while we are already waiting for the loads to finish thanks to the prefetcher beginning the loads early.

There is an inherent assumption in this that the loads will finish serially rather than in parallel, and it would seem reasonable to think that the loads will finish in parallel. However, in reality, the loads will finish serially. This is because the hardware is serial. On the 9800X3D, the physical lines connecting the memory to the CPU can only send 128-bits at a time (well, 128-bits that matter for this reasoning; we are ignoring things like transparent ECC that are not relevant for our reasoning). An AVX-512 load needs to wait for 4x 128-bits to be sent over those lines. The result is that even if you issue two AVX-512 reads in a single cycle, one will always finish first and you will still need to wait for the second one.

I realize I did not address L2 cache and L3 cache, but much like system RAM, neither of those will keep up with 2 AVX-512 loads per cycle (or 1 for that matter), so what will happen when things are in L2 or L3 cache will be similar to what happens when loads come from system memory although with less time spent waiting.

It could be that you will end up with the loop finishing a few cycles faster with the 2 AVX-512 read per cycle version (because it could make the memory prefetcher realize the linear access pattern a few cycles faster), but if your loop takes 1 billion cycles to execute, you are not going to notice a savings of a few cycles, which is why I think being able to issue 2 AVX-512 loads instead of 1 in a single cycle does not matter very much.

Does my explanation make sense?

janwas · 2025-01-11T14:08:18 1736604498

OK, we agree that L1-resident workloads see a benefit. I also agree with your analysis if the loads actually come from memory.

Let's look at a more interesting case. We have a dataset bigger than L3. We touch a small part of it with one kernel. That is now in L1. Next we do a second kernel where each of the loads of this part are L1 hits. With two L1 ports, the latter is now twice as fast.

Even better, we can work on larger parts of the data such that it still fits in L2. Now, we're going to do the above for each L1-sized piece of the L2. Sure, the initial load from L2 isn't happening as fast as 2x64 bytes per cycle. But still, there are many L1 hits and I'm measuring effective FMA throughput that is _50 times_ as high as the memory bandwidth would allow when only streaming from memory. It's simply a matter of arranging for reuse to be possible, which admittedly does not work with single-pass algorithms like a checksum.

Do you find this reasoning convincing?