*>They are not solely reliant on process node shrinks for performance uplifts li...

ryao · 2025-01-08T11:04:35 1736334275

Those standards are plumbing to connect things to the CPU. The last major innovations that Intel had in the CPU itself were implementing CISC in RISC with programmable microcode in the Pentium and SMT in the Pentium 4. Everything else has been fairly incremental and they were reliant on their process node advantage to stay on top. There was Itanium too, but that effort was a disaster. It likely caused Intel to stop innovating and just rely on its now defunct process node advantage.

Intel’s strategy after it adopted EM64T (Intel’s NIH syndrome name for amd64) from AMD could be summarized as “increase realizable parallelism through more transistors and add more CISC instructions to do key work loads faster”. AVX512 was that strategy’s zenith and it was a disaster for them since they had to cut clock speeds when AVX-512 operations ran while AMD was able to implement them without any apparent loss in clock speed.

You might consider the more recent introduction of E cores to be an innovation, but that was a copy of ARM’s big.little concept. The motivation was not so much to save power as it was for ARM but to try to get more parallelism out of fewer transistors since their process advantage was gone and the AVX-512 fiasco had showed that they needed a new strategy to stay competitive. Unfortunately for Intel, it was not enough to keep them competitive.

Interestingly, leaks from Intel indicate that Intel had a new innovation in development called Royal Core, but Pat Gelsinger cancelled it last year before he “resigned”. The cancellation reportedly lead to Intel’s Oregon design team resigning.

menaerus · 2025-01-08T21:16:25 1736370985

> AVX512 was that strategy’s zenith and it was a disaster for them since they had to cut clock speeds when AVX-512 operations ran while AMD was able to implement them without any apparent loss in clock speed.

AMD up until zen 5 didn't have a full AVX-512 support so not exactly a fair comparison. Intel designs don't suffer from that issue AFAIU for couple of iterations already.

But I agree with you, I always thought and I still do that Intel has a very strong CPU core design but where AMD changed the name of the game IMHO is the LLC cache design. Hitting as much as ~twice lower LLC latency is insane. To hide this big of a difference in latency, Intel has to pack larger L2+LLC cache sizes.

Since LLC+CCX design scales so well AMD is also able to pack ~50% more cores per die, something Intel can't achieve even with the latest Granite Rapids design.

These two reasons let alone are big things for data center workloads so I really wonder how Intel is going to battle that.

ryao · 2025-01-09T04:04:08 1736395448

AVX-512 is around a dozen different ISA extensions. AMD implemented the base AVX-512 and more with Zen 4. This was far more than Intel had implemented in skylake-X where their problems started. AMD added even more extensions with Zen 5, but they still do not have the full AVX-512 set of extensions implemented in a single CPU and neither does Intel. Intel never implemented every single AVX-512 extension in a single CPU:

https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

It also took either 4 or 6 years for Intel to fix its downclocking issues, depending on whether you count Rocket Lake as fixing a problem that started in enterprise CPUs, or require Sapphire Rapids to have been released to consider the problem fixed:

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...

menaerus · 2025-01-09T08:22:10 1736410930

Ok, fair enough, I didn't explain myself very well. What I more specifically meant is that AMD up until zen5 could not

  (1) drive 2x AVX-512 computations
  (2) handle 2x AVX-512 memory loads + 1x AVX-512 memory store

in the same clock.

The latter makes a big impact wrt available memory BW per core, at least when it comes to the workloads whose data is readily available in L0 cache. Intel in these experiments is crushing AMD by a large factor simply because their memory controller design is able to sustain 2x64B loads + 1x64B stores in the same clock. E.g. 642 GB/s (Golden Cove) vs 334 GB/s (zen4) - this is a big difference and this is something that Intel had for ~10 years whereas AMD was able to solve this with zen5, basically only with the end of 2024.

Former one limits the theoretical FLOPS/core capabilities since single AVX-512 FMA operation in zen4 is implemented as two AVX2 uops occupying both FMA slots per clock. This is also big and, again, this is something where Intel had a lead up until zen5.

Wrt downclocking issues, they had a substantial impact with Skylake implementation but with Ice Lake this was a solved issue and this was in 2019. I'm cool with having ~97% of max freq budget available with heavy AVX-512 workloads.

OTOH AMD is also very thin with this sort of information and some experiments show that turbo boost clock frequency on zen4 lowers from one CCD to another CCD [1]. It seems like zen5 exhibits similar behavior [2].

So, although AMD is displaying continuous innovation for the past several years this is only because they had a lot to improve. Their pre-zen (2017) designs were basically crap and could not compete with Intel who OTOH had a very strong CPU design for decades.

I think that the biggest difference in CPU core design really is in the memory controller - this is something Intel will need to find an answer to since AMD matched all the Intel strengths that it was lacking with zen5.

[1] https://chipsandcheese.com/p/amds-zen-4-part-3-system-level-...

[2] https://chipsandcheese.com/p/amds-ryzen-9950x-zen-5-on-deskt...

ryao · 2025-01-09T17:16:40 1736443000

System memory is not able to sustain such memory bandwidth so it seems like a moot point to me. Intel’s CPUs reportedly cannot sustain such memory bandwidth even when it is available:

https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...

menaerus · 2025-01-09T18:28:02 1736447282

Not sure I understood you. You think that AVX-512 workload and store-load BW are irrelevant because main system memory (RAM) cannot keep up with the speed of CPU caches?

ryao · 2025-01-10T05:30:51 1736487051

I think the benefits of more AVX-512 stores and loads per cycle is limited because the CPU is bottlenecked internally as shown in the slides from TACC I linked:

https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...

Your 642 GB/s figure should be for a single Golden Cove core, and it should only take 3 Golden Cove cores to saturate the 1.6 TB/sec HBM2e in Xeon Max, yet internal bottlenecks prevented 56 Golden Cove cores from reaching the 642 GB/s read bandwidth you predicted a single core could reach when measured. Peak read bandwidth was 590 GB/sec when all 56 cores were reading.

According to the slides, peak read bandwidth for a single Golden Cove core in the sapphire rapids CPU that they tested is theoretically 23.6GB/sec and was measured at 22GB/sec.

Chips and Cheese did read bandwidth measurements on a non-HBM2e version of sapphire rapids:

https://chipsandcheese.com/p/a-peek-at-sapphire-rapids

They do not give an exact figure for multithreaded L3 cache bandwidth, but looking at their chart, it is around what TACC measured for HBM2e. For single threaded reads, it is about 32 GB/sec from L3 cache, which is not much better than it was for reads from HBM2e and is presumably the effect of lower latencies for L3 cache. The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache. That is also significantly below your 642 GB/sec prediction.

The 450 GB/sec bandwidth out of L1 cache is likely a side effect of the low latency L1 accesses, which is the real purpose of L1 cache. Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache. When L1 cache bandwidth does count, the speed boost will last a maximum of about 180ns, which is negligible.

What bandwidth CPU cores should be able to get based on loads/stores per clock and what bandwidth they actually get are rarely ever in agreement. The difference is often called the Von Neumann bottleneck.

menaerus · 2025-01-10T17:19:16 1736529556

> Your 642 GB/s figure should be for a single Golden Cove core

Correct.

> That is also significantly below your 642 GB/sec prediction.

Not exactly the prediction. It's an extract from one of the Chips and Cheese articles. In particular, the one that covers the architectural details of Golden Cove core and not Sapphire Rapids core. See https://chipsandcheese.com/p/popping-the-hood-on-golden-cove

From that article, their experiment shows that Golden Cove core was able to sustain 642 GB/s in L1 cache with AVX-512.

> They do not give an exact figure for multithreaded L3 cache bandwidth,

They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.

> The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache.

If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.

And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.

> Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache

With that said, both Intel and AMD are limited by the system memory bandwidth and both are somewhere in the range of ~100ns per memory access. The actual BW value will depend on the number of cores per chip but the BW is roughly the same since it heavily depends on the DDR interface and speed.

Does that mean that both Intel and AMD are basically of the same compute capabilities for workloads that do not fit into CPU cache?

And AMD just spent 7 years of their engineering efforts to implement what now looks like a superior CPU cache design and vectorized (SIMD) execution capabilities only to be applicable very few (mostly unimportant in grand scheme of things) workloads that actually fit into the CPU cache?

I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.

janwas · 2025-01-10T20:44:04 1736541844

It is not that the entire working set has to fit within SRAM. Kernels that reuse portions of their inputs several times, such as matmul, can be compute bound and there AMD's AVX-512 shines.

menaerus · 2025-01-11T08:09:01 1736582941

Parent comment I am responding to is arguing that CPU caches are not that relevant because the CPU for bigger workloads is anyways bottlenecked by the system memory BW. And thus, AVX-512 is irrelevant because it can only provide compute boost for a very small fraction of time (reciprocal to the size of the L1 cache).

I am in disagreement with that obviously.

ryao · 2025-01-11T08:37:25 1736584645

Your description of what I told you is nothing like what I wrote at all. Also, the guy here is telling you that AVX-512 shines on compute bound workloads, which is effectively what I have been saying. Try going back and rereading everything.

menaerus · 2025-01-11T09:05:09 1736586309

Sorry, that's exactly what you said and the reason why we are having this discussion in the first place. I am guilty of being too patient with trolls such as yourself. If you're not a troll, then you're clueless or detached from reality. You're just spitting a bunch of incoherent nonsense and moving goalposts when lacking an argument.

ryao · 2025-01-11T10:06:02 1736589962

I am a well known OSS developer with hundreds of commits in OpenZFS and many commits in other projects like Gentoo and the Linux kernel. You keep misreading what I wrote and insist that I said something I did not. The issue is your lack of understanding, not mine.

I said that supporting 2 AVX-512 reads per cycle instead of 1 AVX-512 read per cycle does not actually matter very much for performance. You decided that means I said that AVX-512 does not matter. These are very different things.

If you try to use 2 AVX-512 reads per cycle for some workload (e.g. checksumming, GEMV, memcpy, etcetera), then you are going to be memory bandwidth bound such that the code will run no faster than if it did 1 AVX-512 read per cycle. I have written SIMD accelerated code for CPUs and the CPU being able to issue 2 SIMD reads per cycle would make zero difference for performance in all cases where I would want to use it. The only way 2 AVX-512 reads per cycle would be useful would be if system memory could keep up, but it cannot.

janwas · 2025-01-11T10:31:50 1736591510

I agree server CPUs are underprovisioned for memBW. Each core's share is 2-4 GB/s, whereas each could easily drive 10 GB/s (Intel) or 20+ (AMD).

I also agree "some" (for example low-arithmetic-intensity) workloads will not benefit from a second L1 read port.

But surely there are other workloads, right? If I want to issue one FMA per cycle, streaming from two arrays, doesn't that require maintaining two loads per cycle?

ryao · 2025-01-11T11:37:55 1736595475

In an ideal situation where your arrays both fit in L1 cache and are in L1 cache, yes. However, in typical real world situations, you will not have them fit in L1 cache and then what will happen after the reads are issued will look like this:

  * Some time passes
  * Load 1 finishes
  * Some time passes
  * Load 2 finishes
  * FMA executes

As we are doing FMA on arrays, this is presumably part of a tight loop. During the first few loop iterations, the CPU core’s memory prefetcher will figure out that you have two linear access patterns and that your code is likely to request the next parts of both arrays. The memory prefetcher will then begin issuing loads before your code does and when the CPU issues a load that has already been issued by the prefetcher, it will begin waiting on the result as if it had issued the load. Internally, the CPU is pipelined, so if it can only issue 1 load per cycle, and there are two loads to be issued, it does not wait for the first load to finish and instead issues the second load on the next cycle. The second load will also begin waiting on a load that was done early by the prefetcher. It does not really matter whether you are issuing the AVX-512 loads in 1 cycle or 2 cycles, because the issue of the loads will occur in the time while we are already waiting for the loads to finish thanks to the prefetcher beginning the loads early.

There is an inherent assumption in this that the loads will finish serially rather than in parallel, and it would seem reasonable to think that the loads will finish in parallel. However, in reality, the loads will finish serially. This is because the hardware is serial. On the 9800X3D, the physical lines connecting the memory to the CPU can only send 128-bits at a time (well, 128-bits that matter for this reasoning; we are ignoring things like transparent ECC that are not relevant for our reasoning). An AVX-512 load needs to wait for 4x 128-bits to be sent over those lines. The result is that even if you issue two AVX-512 reads in a single cycle, one will always finish first and you will still need to wait for the second one.

I realize I did not address L2 cache and L3 cache, but much like system RAM, neither of those will keep up with 2 AVX-512 loads per cycle (or 1 for that matter), so what will happen when things are in L2 or L3 cache will be similar to what happens when loads come from system memory although with less time spent waiting.

It could be that you will end up with the loop finishing a few cycles faster with the 2 AVX-512 read per cycle version (because it could make the memory prefetcher realize the linear access pattern a few cycles faster), but if your loop takes 1 billion cycles to execute, you are not going to notice a savings of a few cycles, which is why I think being able to issue 2 AVX-512 loads instead of 1 in a single cycle does not matter very much.

Does my explanation make sense?

janwas · 2025-01-11T14:08:18 1736604498

OK, we agree that L1-resident workloads see a benefit. I also agree with your analysis if the loads actually come from memory.

Let's look at a more interesting case. We have a dataset bigger than L3. We touch a small part of it with one kernel. That is now in L1. Next we do a second kernel where each of the loads of this part are L1 hits. With two L1 ports, the latter is now twice as fast.

Even better, we can work on larger parts of the data such that it still fits in L2. Now, we're going to do the above for each L1-sized piece of the L2. Sure, the initial load from L2 isn't happening as fast as 2x64 bytes per cycle. But still, there are many L1 hits and I'm measuring effective FMA throughput that is _50 times_ as high as the memory bandwidth would allow when only streaming from memory. It's simply a matter of arranging for reuse to be possible, which admittedly does not work with single-pass algorithms like a checksum.

Do you find this reasoning convincing?

ryao · 2025-01-11T00:22:53 1736554973

> They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.

They do not. The chip has 105MB L3 cache and they tested on 128MB of memory. This exceeds the size of L3 cache and thus, it is not a proper test of L3 cache.

> If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.

Sapphire Rapids uses Golden Cove cores.

> And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.

This would explain the discrepancy between your calculation and the L1 cache performance, although being able to get that level of bandwidth only out of L1 cache is not very useful for the reasons I stated.

> I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.

You seem to view CPU performance as being determined by memory bandwidth rather than computational ability. Upon being correctly told L1 cache memory bandwidth does not matter since the bottleneck is system memory, you assume that only system memory performance matters. That would be true if the primary workload of CPUs were memory bandwidth bound workloads, but it is not since the primary workloads of CPUs is compute bound workloads. Thus, how fast CPUs read from memory does not really matter for CPU workloads.

The purpose of a CPU’s cache is to reduce the von Neumann bottleneck by cutting memory access latency. That way the CPU core spends less time waiting before it can use the data and it can move on to a subsequent calculation. How much memory throughput CPUs get from L1 cache is irrelevant to CPU performance outside of exceptional circumstances. There are exceptional circumstances where cache memory bandwidth matters, but they are truly exceptional since any importan workload where memory bandwidth matters is offloaded to a GPU because a GPU often has 1 to 2 orders of magnitude more memory bandwidth than a CPU.

That said, it would be awesome if the performance of a part could be determined by a simple synthetic benchmark such as memory bandwidth, but that is almost never the case in practice.

menaerus · 2025-01-11T08:45:30 1736585130

> They do not. The chip has 105MB L3 cache and they tested on 128MB of memory. This exceeds the size of L3 cache and thus, it is not a proper test of L3 cache.

First, you claimed that there was no L3 BW test. Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?

Please do tell what you consider a "proper test of L3 cache"? And why do you consider their test invalid?

I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.

> Sapphire Rapids uses Golden Cove cores.

Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.

Chips and Cheese are actually guilty of doing that but it's because they're lacking more HW to compare against. So some figures that you find in their articles can be misleading if you are not aware of it.

> You seem to view CPU performance as being determined by memory bandwidth rather than computational ability.

But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.

> That would be true if the primary workload of CPUs were memory bandwidth bound workloads, but it is not since the primary workloads of CPUs is compute bound workloads. Thus, how fast CPUs read from memory does not really matter for CPU workloads.

I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?

Any workload besides the most simplest one is at some point bound by the memory BW.

> The purpose of a CPU’s cache is to reduce the von Neumann bottleneck by cutting memory access latency.

> That way the CPU core spends less time waiting before it can use the data and it can move on to a subsequent calculation.

> How much memory throughput CPUs get from L1 cache is irrelevant to CPU performance outside of exceptional circumstances.

You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.

> but they are truly exceptional since any importan workload where memory bandwidth matters is offloaded to a GPU because a GPU often has 1 to 2 orders of magnitude more memory bandwidth than a CPU.

99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?

Or do you think they might be memory-bound but are missing out for not being offloaded to the GPU?

ryao · 2025-01-11T09:15:28 1736586928

> First, you claimed that there was no L3 BW test.

I claimed that they did not provide figures for L3 cache bandwidth. They did not.

> Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?

You should be grateful that a professional is taking time out of his day to explain things that you do not understand.

> Please do tell what you consider a "proper test of L3 cache"? And why do you consider their test invalid?

You cannot measure L3 cache performance by measuring the bandwidth on a region of memory larger than the L3 cache. What they did is a partially cached test and it does not necessarily reflect the true L3 cache performance.

> I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.

You just described a generic memory bandwidth test that does not test L3 cache bandwidth at all. Chips and Cheese’s graphs show performance at different amounts of memory to show the performance of the memory hierarchy. When they exceed the amount of cache at a certain level, the performance transitions to different level. They did benchmarks on different amounts of memory to get the points in their graph and connected them to get a curve.

> Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.

The Xeon Max chips with its HBM2e memory is the one place where 2 AVX-512 loads per cycle could be expected to be useful, but due to internal bottlenecks they are not.

Also, for what it is worth, Intel treats AVX-512 as a server only feature these days, so if you are talking about Intel CPUs and AVX-512, you are talking about servers.

> But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.

I never claimed AVX-512 workloads were irrelevant. I claimed doing more than 1 load per cycle on AVX-512 was not very useful for performance.

Intel losing its lead in the desktop space to AMD is due to entirely different reasons than how many AVX-512 loads per cycle AMD hardware can do. This is obvious when you consider that most desktop workloads do not touch AVX-512. Certainly, no desktop workloads on Intel CPUs touch AVX-512 these days because Intel no longer ships AVX-512 support on desktop CPUs.

To be clear, when you can use AVX-512, it is useful, but the ability to do 2 loads per cycle does not add to the usefulness very much.

> I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?

This is not a well formed question. See my remarks further down in this reply where I address your fabricated 99% figure for the reason why.

> Any workload besides the most simplest one is at some point bound by the memory BW.

Simple workloads are bottlenecked by memory bandwidth (e.g. BLAS levels 1 and 2). Complex workloads are bottlenecked by compute (e.g. BLAS level 3). A compiler for example is compute bound, not memory bound.

> You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.

There is no contradiction. The cache is there to hide latency. The TACC explanation of how queuing theory applies to CPUs makes it very obvious that memory bandwidth is inversely proportional to memory access times, which is why the cache has more memory bandwidth than system RAM. It is a side effect of the actual purpose, which is to reduce memory latency. That is an attempt to reduce the von Neumann bottleneck.

To give a concrete example, consider linked lists. Traversing a linked list requires walking random memory locations. You have a pointer to the first item on the list. You cannot go to the second item without reading the first. This is really slow. If the list is frequently accessed to be in cache, then the cache will hide the access times and make this faster.

> 99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?

99% is a number you fabricated. Asking if something is CPU bound only makes sense when you have a GPU or some other accelerator attached to the CPU that needs to wait on commands from the CPU. When there is no such thing, asking if it is CPU bound is nonsensical. People instead discuss being compute bound, memory bandwidth bound or IO bound. Technically, there are three ways to be IO bound, which are memory, storage and network. Since I was already discussing memory bandwidth bound work loads, my inclusion of IO bound as a category refers to the other two subcategories.

By the way, while memory bandwidth bound workloads are better run on GPUs than CPUs, that does not mean all workloads on GPUs are memory bandwidth bound. Compute bound workloads with minimal branching are better done on GPUs than CPUs too.

markhahn · 2025-01-08T15:26:54 1736350014

e cores are more like atom - intel owes no credit to arm.

wtallis · 2025-01-08T22:34:41 1736375681

Intel's E cores are literally derived from the Atom product line. But the practice of including a heterogeneous mix of CPU core types was developed and proven and made mainstream within the ARM ecosystem before being hastily adopted by Intel as an act of desperation (dragging Microsoft along for the ride).