> My understanding of a prefetcher is to get the data ASAP. It typically prefetches from cache, not from memory (if it prefetches from mem it's so slow it's useless).
The prefetcher runs at whatever level it's at (L1/L2) and fetches from whatever level has the memory. So the L2 prefetcher may be grabbing from L3, and may be grabbing from SDRAM.
That's the ram LATENCY ON RANDOM ACCESS. If you extend an access to fetch the next sequential line because you think it will be used, you don't pay any of the latency penalty-- you might need to strobe a column access but words keep streaming out. For this reason sequential prefetch is particularly powerful. Even if we're just retrieving from a single SDRAM channel, it's just another 3ns to continue on to retrieve the next line. (DDR4-2400 is 2400MT/s, 8 transactions to fill a line, 8/2400000000=3.3n)
> I see, but I think it does better these days, it'll remember a stride and prefetch by the next stride. Even striding backwards through mem (cache)! But it won't cross a page boundary.
Sure, the even/odd access extender is just one very simple prefetcher that is a part of modern intel processors that I included for illustration. And we're completely ignoring software prefetch.
Go ahead, do the experiment. Run a memory-heavy workload and look at cache miss rates. Then turn off prefetch and see what you get. Most workloads, you'll get a lot more misses. ;)
That thread is orthogonal, but what's there supports exactly what I'm saying: prefetch improves effective bandwidth to SDRAM at all layers.
A second, successive streamed fetch is basically free from a latency perspective. If you're missing, and have to go to memory, there's a very high chance that L2 is going to prefetch the next line into a stream buffer, and you won't miss to SDRAM next time.
It's reached the point that now that the stream prefetchers hint to the memory controller that a queued access is prefetch, so the memory controller can choose based on contention whether to service the prefetch or not.
Most of what you seem to talk about is L1 prefetch; I agree if L1 prefetch misses all the way to RAM you are probably screwed. The fancy strategies you mention, etc, are mostly L1 prefetch strategies. But L2 has its own prefetcher, and it's there to get rid of memory latency and increase effective use of memory bandwidth...
While we're talking about it... even the SDRAM itself has a prefetcher for burst access ;) Though it's kinda an abuse that it's called as such.
The prefetcher runs at whatever level it's at (L1/L2) and fetches from whatever level has the memory. So the L2 prefetcher may be grabbing from L3, and may be grabbing from SDRAM.
> ram latency 36 cycles + 57 NS / 62 cycles + 100NS depending
That's the ram LATENCY ON RANDOM ACCESS. If you extend an access to fetch the next sequential line because you think it will be used, you don't pay any of the latency penalty-- you might need to strobe a column access but words keep streaming out. For this reason sequential prefetch is particularly powerful. Even if we're just retrieving from a single SDRAM channel, it's just another 3ns to continue on to retrieve the next line. (DDR4-2400 is 2400MT/s, 8 transactions to fill a line, 8/2400000000=3.3n)
> I see, but I think it does better these days, it'll remember a stride and prefetch by the next stride. Even striding backwards through mem (cache)! But it won't cross a page boundary.
Sure, the even/odd access extender is just one very simple prefetcher that is a part of modern intel processors that I included for illustration. And we're completely ignoring software prefetch.
Go ahead, do the experiment. Run a memory-heavy workload and look at cache miss rates. Then turn off prefetch and see what you get. Most workloads, you'll get a lot more misses. ;)
https://github.com/deater/uarch-configure/blob/master/intel-...