It does seem the benchmark has its data in cache, based on the timings. If the b...

It does seem the benchmark has its data in cache, based on the timings.

If the benchmark were only measuring L1 latency, what would that imply about the ‘scaling by inverse clock speed’ bit? My guess is as follows. Chips with higher clock rates will be penalised: (a) it is harder to decrease latencies (memory, pipeline length, etc) in absolute terms than run at a higher clock speed to maybe do non-memory things faster; and (b) if you’re waiting 5ns to read some data, that hurts you more after the scaling if your clock speed is higher. The fact that the M1 wins after the scaling despite the higher clock rate suggests to me that either they have a big advantage on memory latency or there’s some non-memory-latency advantage in scheduling or branch prediction that leads to more useful instructions being retired per cycle.

But maybe I’m interpreting it the wrong way.