About 94.9 GB/s DRAM bandwidth for the Core Ultra 7 258V they measured. Aren't Intel going to respond to the 200GB/s bandwidth of the M1 Pro introduced 3 years ago? Not to mention 400GB/s of Max and 800GB/s of the Ultra?
Most of the bandwidth comes from cache hits, but for those rare workloads larger than the caches, Apples products may be 2-8x faster?
AMD Strix Halo, to be launched in early 2025, will have a 256-bit memory interface for LPDDR5x of 8 or 8.5 GHz, so it will match M1 Pro.
However, Strix Halo, which has a much bigger GPU, is designed for a maximum power consumption for CPU+GPU of 55 W or more (up to 120 W), while Lunar Lake is designed for 17 W, which explains the choices for the memory interfaces.
Sorry, I meant the frequency of the transfers, not that of some synchronization signal, so 8000 or 8500 MT/s, as you say.
However that should have been obvious, because there will be no LPDDR5x of 16000 MT/s. That throughput might be reached in the future, but a different memory standard will be needed, perhaps a derivative of the MRDIMMs that are beginning to be used in servers (with multiplexed ranks).
MHz and MT/s is really the same unit. What differs is the quantity that is measured, e.g. the frequency of oscillations and the frequency of transfers. I do not agree with the method of giving multiple names to a unit of measurement in order to suggest what kind of quantity has been measured. The number of different quantities that can be measured with the same unit is very large. If the method of giving multiple unit names had been applied consistently, there would have been a huge number of unit names. I believe that the right way is to use a unique unit name, but to always specify separately what kind of quantity had been measured, because having only a numeric value and the unit is never sufficient information, without having also which quantity has been measured.
Lunar Lake is very clearly a response to the M1, not its larger siblings: the core counts, packaging, and power delivery changes all line up with the M1 and successors. Lunar Lake isn't intended to scale up to the power (or price) ranges of Apple's Pro/Max chips. So this is definitely not the product where you could expect Intel to start using a wider memory bus.
And there's very little benefit to widening the memory bus past 128-bit unless you have a powerful GPU to make good use of that bandwidth. There are comparatively few consumer workloads for CPUs that are sufficiently bandwidth-hungry.
Is the full memory bandwidth actually available to the CPU on M-series CPUs? Because that would seem like a waste of silicon to me, to have 200+ GB/s of past-LLC bandwidth for eight cores or so.
200+ Gb/sec (IIRC Anandtech measured 240 Gb/sec) per a cluster.
Pro is one cluster, Max is two clusters and Ultra is four clusters, so the accumulative bandwidth is 200, 400 and 800 Gb/sec respectively[*].
The bandwidth is also shared with GPU and NPU cores within the same cluster, so on combined loads it is plausible that the memory bus may become fully saturated.
[*] Starting with M3, Apple has pivoted Pro models into more Pro and less Pro versions that have differing memory bus widths.
I think the number of people interested in running ML models locally might be greatly overestimated [here]. There is no killer app in sight that needs to run locally. People work and store their stuff in the cloud. Most people just want a lightweight laptop, and AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them. Production quality models are pretty much cloud only, and I don’t think open source models, especially ones viable for local inference will close the gap anytime soon. I’d like all of those things to be different, but I think that’s just the way things are.
Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.
I have some difficulty with estimating how heavy Recall’s workload is, but either way, I have little faith in Microsoft’s ability to implement this feature efficiently. They struggle with much simpler features, such as search. I wouldn’t be surprised if a lot of people disable the feature to save battery life and improve system performance.
Huh? All the files are local and models are gonna be made to have a lot of or ultimately all of them in the "context window". You can't really have an AI for your local documents on the cloud because the cloud doesn't have access. Same logic for businesses. The use case follows from data availability and barriers.
We've observed the same on web pages where more and more functionality gets pushed to the frontend. One could push the last (x) layers of the neural net for example, to the frontend, for lower expense and if rightly engineered, better speed and scalability.
AIs will be local, super-AIs still in the cloud.
Local AIs will be proprietary and they will have strings to the mothership
The strings will have business value both related to the consumer and for further AI training.
From what I’ve seen, most people tend to use cloud document platforms. Microsoft has made Office into one. This has been the steady direction for the last few years; they’ll keep pushing for it, because it gives them control. Native apps with local files is an inconvenient model for them. This sadly applies to most other types of apps. On many of these cloud platforms, you can’t even download the project files.
> You can't really have an AI for your local documents on the cloud because the cloud doesn't have access
Yes, up to the cloud it all goes. They don’t mind, they can charge you for it. Microsoft literally openly wants to move the whole OS to the cloud.
> Same logic for businesses
Businesses hate local files. They’re a huge liability. When firing people it’s very convenient that you can just cut someone off by blocking their cloud credentials.
> We've observed the same on web pages where more and more functionality gets pushed to the frontend
It will never go all the way.
> One could push the last (x) layers of the neural net for example, to the frontend, for lower expense and if rightly engineered, better speed and scalability
I’ll believe it when I see it. I don’t think the incentive is there. Sounds like a huge complicating factor. It’s much simpler to keep everything running in the cloud, and software architects strongly prefer simple designs. How much do these layers weigh? How many MB/GB of data will I need to transfer? How often? Does that really give me better latency than just transferring a few KBs of the AIs output?
> I think the number of people interested in running ML models locally might be greatly overestimated [here]. There is no killer app in sight that needs to run locally. People work and store their stuff in the cloud. Most people just want a lightweight laptop, and AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them. Production quality models are pretty much cloud only, and I don’t think open source models, especially ones viable for local inference will close the gap anytime soon. I’d like all of those things to be different, but I think that’s just the way things are.
Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.
Local ML isn't a CPU workload. The NPUs in mobile processors (both laptop and smartphone) are optimized for low power and low precision, which limit how much memory bandwidth they can demand. So as I said, demand for more memory bandwidth depends mainly on how powerful the GPU is.
M3 Pro is 150 GB/s (and that should be compared to Lunar Lake's nominal memory bandwidth of 128 GB/s) and the cheapest model with it starts at $2000 ($2400 if you want 36 GB of RAM).
At those price levels, PC laptops have discrete GPUs with their own RAM with 256 GB/s and up just for the GPU.
True, but I thought Intel might start using more channels to make that metric look less unbalanced in Apple's favour. Especially now that they are putting RAM on package.
Not really, the killer is latency, not throughput. It's very rare that a CPU actually runs out of memory bandwidth. It's much more useful for the GPU.
95GB/s is 24GB/s per core, at 4.8Ghz that's 40 bits per core per cycle. You would have to be doing basically nothing useful with the data to be able to get through that much bandwidth.
For scientific/technical computing, which uses a lot of floating-point operations and a lot of array operations, when the memory is limiting the performance almost always the limit is caused by the memory throughput and almost never by the memory latency (in correctly written programs, which allow the hardware prefetchers to do their job of hiding the memory latency).
The resemblance to the behavior of GPUs is not a coincidence, but it is because the GPUs are also mostly doing array operations.
So the general rule is that the programs dominated by array operations are sensitive mostly to the memory throughput.
This can be seen in the different effect of the memory bandwidth on the SPECint and SPECfp benchmark results, where the SPECfp results are usually greatly improved when memory with a higher throughput is used, unlike the SPECint results.
You are right that it's a limiting factor in general for that use case, just not in the case of this specific chip - this chip has far less cores per lanes, so latency will be the limiting factor. Even then, I assure you that no scientific workload is going to be consuming 40 bits/clock/core. It's just a staggering amount of memory, no correctly written program would hit this, you'd need to have abysmal cache hit ratios.
This processor has two lanes over 4 P-cores. Something like an EPYC-9754 has 12 lanes over 128 cores.
I agree that for CPU-only tasks, Lunar Lake has ample available memory bandwidth, but high memory latency.
However, the high memory bandwidth is intended mainly for the benefit of its relatively big GPU, which might have been able to use even higher memory throughputs.
40 bits per clock in a 8-wide core gets you 5 bits per instruction, and we have AVX512 instructions to feed, with operand sizes 100x that (and there are multiple operands).
Modern chips do face the memory wall. See eg here (though about Zen 5) where they in the same vein conclude "A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth."
The throughput of the AVX-512 computation instructions is matched to the throughput of loads from the L1 cache memory, on all CPUs.
Therefore to reach the maximum throughput, you must have the data in the L1 cache memory. Because L1 is not shared, the throughput of the transfers from L1 scales proportionally with the number of cores, so it can never become a bottleneck.
So the most important optimization target for the programs that use AVX-512 is to ensure that the data is already located in L1 whenever it is needed. To achieve this, one of the most important things is to use memory access patterns that will trigger the hardware prefetchers, so that they will fill the L1 cache ahead of time.
The main memory throughput is not much lower than that of the L1 cache, but the main memory is shared by all cores, so if all cores want data from the main memory at the same time, the performance can drop dramatically.
The processors that hit this wall have many many cores per memory lane. It's just not realistic for this to be a problem with 2 lanes of DDR5 feeding 4 cores.
These cores cannot process 8 AVX512 instructions at once, in fact they can't do it at all, as it's disabled on consumer Intel chips.
Also, AVX instructions operate on registers, not on memory, so you cannot have more than one register being loaded at once.
If you are running at ~4 instruction per clock, to actually go ahead and saturate 40 bits per clock on 64 bit loads, you'd need 1/6 of instructions to hit main memory (not cache)!
There might be a chicken-and-egg situation here - one often hears that there’s no point having wider SIMD vectors or more ALU units, as they would spend all their time waiting for the memory anyway.
The width and count of the SIMD execution units are matched to the load throughput from the L1 cache memory, which is not shared between cores.
Any number of cores with any count and any width of SIMD functional units can reach the maximum throughput, as long as it can be ensured that the data can be found in the L1 cache memories at the right time.
So the limitations on the number of cores and/or SIMD width and count are completely determined by whether in the applications of interest it is possible to bring the data from the main memory to the L1 cache memories at the right times, or not.
This is what must be analyzed in discussions about such limits.
CPUs generally achieve around 4-8 FLOPS per cycle. That means 256-512 bits per cycle. We're all doing AI which means matrix multiplications which means frequently rereading the same data bigger than the cache, and doing one MAC with each piece of data read.
The importance of the matrix multiplication algorithm is precisely due to the fact that it is the main algorithm where the ratio between computational operations and memory transfers can be very large, therefore the memory bandwidth is not a bottleneck for it.
The right way to express a matrix multiplication is not that wrongly taught in schools, with scalar products of vectors, but as a sum of tensor products between the column vectors of the first matrix with those row vectors of the second matrix that share with them the same position of the element on the main diagonal of the matrix.
Computing a tensor product of two vectors, with the result accumulated in registers, requires a number of memory loads equal to the sum of the lengths of the vectors, but a number of FMA operations equal to the product of the lengths (i.e. for square matrices of size NxN, there are 2N loads and N^2 FMA for one tensor product, which multiplied with N tensor products give 2N^2 loads and N^3 FMA operations for the matrix multiplication).
Whenever the lengths of both vectors are are no less than 2 and at least one length is no less than 3, the product is greater than the sum. With greater vector lengths, the ratio between product and sum grows very quickly, so when the CPU has enough registers to hold the partial sum, the ratio between the counts of FMA operations and of memory loads can be very great.
Is it though? The matmul of two NxN matrices takes N^3 macs and 2*N^2 memory access. So the larger the matrices, the more the arithmetic dominates (with some practical caveats, obviously).
Most of the bandwidth comes from cache hits, but for those rare workloads larger than the caches, Apples products may be 2-8x faster?