Typical desktops have 2 64 bit dimm into 2 channels (64 bits wide each) or 1 channel (128 bits wide).
The M1 Mac's seem to be 8 channels x 16 bits, which is the same bandwidth as a desktop (although running the ram at 4266 MHz is much higher than usual). The big win is you can have 8 cache misses in flight instead of 2. With 8 cores, 16 GPU cores, and 16 ML cores I suspect the M1 has more in flight cache misses than most.
The DDR4 bus is 64-bit, how can you have a 128-bit channel??
Single channel DDR4 is still 64-bit, it's only using half of the bandwidth the CPU supports. This is why everyone is perpetually angry at laptop makers that leave an unfilled SODIMM slot or (much worse) use soldered RAM in single-channel.
> The big win is you can have 8 cache misses in flight instead of 2
Only if your cache line is that small (16 bit) I think? Which might have downsides of its own.
> The DDR4 bus is 64-bit, how can you have a 128-bit channel??
Less familiar with the normal on laptops, but most desktop chips from AMD and Intel have two 64 bit channels.
> Which might have downsides of its own.
Typically for each channel you send an address, (a row and column actually), wait for the dram latency, and then get a burst of transfers (one per bus cycle) of the result. So for a 16 bit wide channel @ 3.2 Ghz with a 128 byte cache line you get 64 transfers, one ever 0.3125 ns for a total of 20ns.
Each channel operates independently, so multiple channels can each have a cache miss in flight. Otherwise nobody would bother with independent channels and just stripe them all together.
Here's a graph of cache line throughput vs number of threads.
So with 1,2 you see an increase in throughput, the multiple channels are helping. 4 threads is the same as two, maybe the L2 cache has a bottleneck. But 8 threads is clearly better than 4.
It's pretty common for hardware to support both. On the Zen1 Epyc's for instance some software preferred a consistent latency from stripped memory over the NUMA aware latency with separate channels where the closer dimms have lower latency and the further dimms had higher.
I've seen similar on Intel servers, but not recently. This isn't however typically something you can do at runtime, just boottime, at least as far as I've seen.
But doesn't that only help if you have parallel threads doing independent 16 bit requests? If you're accessing a 64 bit value, wouldn't it still need to occupy four channels?
Depends. Cachelines are typically 64-128 bytes long and sometimes depending on various factors that might be across on memory channel, or spread across multiple memory channels, somewhat like a RAID-0 disk. I've seen servers (opterons I believe) that would allow mapping memory per channel or across channels based on settings in BIOS. Generally non-NUMA aware OS ran better with stripped memory and NUMA aware OSs ran better non-stripped.
So striping a caching line across multiple channels goes increase bandwidth, but not by much. If the dram latency is 70ns (not uncommon) and your memory is running at 3.2 GHz on a single 64 bit wide channel you get 128 bytes in 16 transfers. 16 transfers at 3.2GHz = 5ns. So you get a cache line back in 75ns. With 2 64 bit channels you can get 2 cache lines per 75ns.
So now with a 128 bit wide channel (twice the bandwidth) you wait 70ns then get 8 transfers @ 3.2GHz = 2.5ns. So you get a cache line back in 72.5ns. Clearly not a big difference.
So the question becomes for a complicated OS with a ton of cores do you want one cacheline per 72.5ns (the stripped config) or two cachlines per 75ns (the non-stripped config).
In the 16 bit 8 channel (assuming the same bus speed and latency) you get 8 cacheline per 90ns. However not sure what magic apple has but I'm seeing very low memory latencies on the M1, on the order of 33ns! With all cores busy I'm seeing cacheline througput of a cacheline per 11ns or so.
I believe modern superscalar architectures can run instructions out of order if they don't rely on the same data, so when paused waiting for a cache miss, the processor can read ahead in the code, and potentially find other memory to prefetch. I may be wrong about the specifics, but these are the types of tricks that modern CPUs employ to achieve higher speed.
Sure, but generally a cacheline miss will quickly stall, sure you might have a few non-dependent instructions in the pipeline, but running a CPU at 3+GHz and waiting 70ns is an eternity. Doubly so when you can execute multiple instructions per cycle.
The M1 Mac's seem to be 8 channels x 16 bits, which is the same bandwidth as a desktop (although running the ram at 4266 MHz is much higher than usual). The big win is you can have 8 cache misses in flight instead of 2. With 8 cores, 16 GPU cores, and 16 ML cores I suspect the M1 has more in flight cache misses than most.