> Arm also believes that Blackhawk will provide “great” LLM performance. I will assume that this has to do with big CPU IPC performance improvements as Arm says that its Cortex CPU is the #1 AI target for developers. I didn’t run the developer or app survey myself, but this makes sense to me as most AI inference in the datacenter is run on the CPU. The NPU and GPU can be an efficient way to run AI, but a CPU is the easiest and most pervasive way, which is why developers target it. A higher-performing CPU obviously helps here, but as the world moves increasingly to smaller language models, Arm’s platform with higher-performing CPU and GPU combined with its tightly integrated ML libraries and frameworks will likely result in a more efficient experience on devices.
I feel like they are conflating 2 unrelated things?
There is "CPU inference" which is indeed pervasive and huge, and there is "generative ai llm inference" which is absolutely anemic on CPU, especially outside of tiny-context toy model demos, in spite of the popularity of llama.cpp and other cpu platforms
.
I would hope smartphone users aren't running anything but extremely tiny llms on CPU. The IGP is not theoretically, but currently massively more efficient than any 2024 CPU could be.
4-bit quantized llama gets borderline-acceptable performance on an iPad CPU[1], and there's no reason they can't further optimise CPU perf for this task. IIUC, LLM inference is mostly constrained by memory size and bandwidth (hence the aggressive quantization), rather than raw compute.
I guess it depends on how you're defining "great" LLM performance and "CPU performance supremacy versus Apple"
Personally I would say "borderline-acceptable performance" with a 7B parameter model at 4-bit quantisation is nowhere near "great" or "supremacy". You can fit that on ~7GB of RAM, using your grandpa's 2016-era nvidia 1080.
To take the ARM-CPU-LLM-performance crown from Apple you have to beat a fully equipped M2 Max, which provides 96GB of memory with 400GB/s of bandwidth. That's enough to run 70B parameter models in 4-bit quantisation, and it can go toe-to-toe with a dual-nvidia-4090 system that draws a kilowatt of power.
low bit depth quantized puts GPU's at quite a disadvantage, since they can't use their massively parallel floating point units, and often spend a lot of time packing and unpacking numbers.
However, minor tweaks to the design of GPU's could probably make them 5x quicker at directly processing quantized data (a 4 bit multiplier is so small you can afford to do thousands of them per clock cycle in a tiny silicon area), and I'm sure the next generation of GPU's will have that.
Isn't a big part of the strategy to have a fast core, do all the work quickly and spend most of the time in sleep. If you can do your work quicker you spend more time in sleep and save power?
I don't really care about my phone being faster, but I would always take better battery life.
No, but I have noticed that modified smartphone processors (like Apple’s M series) are now faster and more efficient than traditional desktop and laptop processors. I welcome their introduction into Windows laptops.
I’ve also noticed that there are emerging AI applications that cannot be run on-device by today’s smartphones. I would very much like an LLM Siri that doesn’t need a network connection.
what does “a synergy of mobile design and desktop design” mean? It’s the same cores, but more of them. iirc, they’ve made cache structures larger as well.
All of these terms are largely marketing as far as I’m aware, yes Intel is behind in lithography but not by as much as the difference between 3nm and 10nm would suggest and I’m pretty sure the transistor/feature sizes are not 1/3 the size of Intel’s processes.
It will be interesting to see what happens with the Intel 18A process and who builds chips with them. Intel could also rapidly regain the performance crown if these are good processes too.
Nope, my phones are like my computers nowadays, they only get replaced when they break down.
Even in terms of OS features, hardly anything that makes me feel like updating to the latest Android version, other than Google finally acknowledging Java isn't going anywhere and started following up on language updates (Android 12 onwards).
Naturally something that OEMs aren't that happy about.
I went from an iPhone 11 Pro to 15 Pro recently, and four year's difference is definitely noticeable, but it's not a huge deal (honestly I upgraded mostly for 5G and the 120hz screen).
I just upgraded from a Pixel 3 to a Pixel 8, and while I've read that it's a slow phone, I don't see any performance issue.
I also bought a 120€ tablet with a Helio G99 CPU and it's so much faster than my old Galaxy Tab S3 and Lenovo Tab4 8 Plus.
To me it feels like we have reached the point where these devices are fast enough like it was when laptops matured, where performance was no longer an issue and upgrades were made when the device no longer worked or Microsoft made it obsolete due to Windows 11 not being allowed to run on it.
I think the people who say the Pixel 8 slow are gamers. Even a cheap phone is fast enough for the vast majority of people unless they play 3D games on it.
Im not totally sure that someone coming from a pixel 3 has the experience necessary to determine what is fast or slow in this day and age. There could've been enormous leaps forward (there have) and just as enormous regressions (allegeded) that you simply missed.
I am also using a Xiaomi Poco X3 NFC and a Gigaset GX6 [0], but not as a daily driver. 8 tablets and 3 phones which are currently turned on, and two phones which are turned off (Pixel 3 and Moto G7 Plus). I don't think I lack experience.
No, but this is the way it goes, new software will be written that will push current phone hardware to its limits, and then we'll start the cycle over again.
> Unfortunately the memory limit on iOS Safari is rather low. It’s limited to 384MB on version 15, it’s lower on earlier versions, and it’s probably device specific as well.
> As I said it depends on the device as well but it's always between 200-400MB
Its a decision, not a hardware limitation. Helps keep web bloat in check, but it’s a worse experience for the user. I’m sure Apple’s happy to suggest a native app without such limits tho
In general iPhones have less RAM than roughly corresponding Android device. This is because iOS apps on average take less RAM to run because iOS manages memory more tightly and Android uses Java which is very memory hungry.
A web app though should be a situation where the memory usage will be roughly equivalent between the two platforms, so that seems reasonable.
If 6GB RAM iPhone runs out of memory, then how can it be better at memory management than Samsung phone with 3GB RAM, which does not run out of memory.
My problem is 16MB (8000*5000px) images in web browser which iPhone cannot handle. Most powerful phone every year, but 16MB is too much.
That’s the same person who said that Safari ran out of memory on stuff their Android doesn’t.
Their point wasn’t “iPhones have more RAM than comparable Android devices” (as you say, that comparison is entirely unreasonable based on those devices), it was “this is the RAM on the devices where I observed this behaviour”. If the iPhone in question has more RAM than the Android, yet still OOMs where the Android doesn’t, that’s a pretty strong argument that iOS isn’t magically thriftier than Android at managing memory.
Oryon is based on the IP Qualcomm gained from the Acquisition of Nuvia, who developed this IP based on a very narrow license they got from ARM to develop for the server market.
Qualcomm is trying to transfer everything Nuvia has developed under that specific narrow ARM-license to the broad license of Qualcomm and use it for "powering flagship smartphones, next-generation laptops, and digital cockpits, as well as Advanced Driver Assistance Systems, extended reality and infrastructure networking solutions"
ARM has filed a lawsuit that this was never in scope of the license of Nuvia, the court-ruling is still pending on that one...
The only form of settlement I can imagine is for Qualcomm transferring the Nuvia IP to ARM and be allowed to license it back with favorable conditions.
Which would probably lead to it being integrated into a nextgen Blackhawk architecture, merging both paths and making Qualcomm a licensee of Blackhawk.
Seems so. The Snapdragon 8 Gen 3 benchmarks that dropped yesterday with the Samsung S24 Ultra put it on par or above Apple's A17 Pro or whatever their most high end phone SoC is.
To be fair, Geekbench is a relatively poor representation of real world performance. It's heavily used since it's one of the few benchmarks that works across all major platforms and architectures, making comparisons easy, but not necessarily realistic.
ARM should gain about 30% in a single gen to be competitive with Apple/Qualcomm processors. It's going to be interesting to see if they can achieve this. Ultimately, competition is good for the consumer. Android phones are so far behind Apple, it is not funny anymore.
The Cortex X4 Running at 3.3.Ghz on a N4 gets GB6 ~2250. #681.8/Ghz
The Apple A17 Running at 3.8 Ghz on a N3 gets GB6 ~2950. #776.3/Ghz
On a clock per clock basis, A17 is only about 14% faster. Consider X4 had 15% IPC uplift and resulted in real world 11% performance improvement on GB6. And they are claiming X5 would have the largest YoY IPC uplift, which I think we could consider to be 15%+ or 20%, a Cortex X5 would have similar if not slightly better than A17 performance on a Clock to Clock basis.
And it would be good enough for Microsoft / ARM PC.
Apple lets their CPUs and GPUs run at a higher max frequency/power draw, but you're absolutely right that in clock-to-clock performance they're only marginally better in both CPU and GPU performance.
If you clock CPUs low enough, many CPUs in the past could match Apple in clock-to-clock performance. This is because clock speed does not scale with performance.
"The absolute top end Android processors get about 76% of the performance of Apple's processors" isn't the strong argument you may think it is tbh. Per clock performance is uninteresting (it's not like performance is linear with clock speed anyway, and target clock speed is a huge part of the CPU, so I don't even understand why you'd want to adjust for it even in principle in this comparison).
Per-clock performance is THE metric. Apple can't sustain those peak clocks for more than a few seconds before throttling down. Once both chips are running at a sustainable 2-2.5GHz, the IPC starts mattering a lot.
If Apple can't sustain those clock speeds for long, that's reflected in the benchmark result. Benchmarks and real-world performance are the only metrics which matter in the end.
And higher clock speed doesn't proportionally improve either real world metrics or benchmark results, so "benchmark score divided by clock speed" is a useless metric.
The CPU peaked out at 14 Watts in multicore Geekbench. That's close to the peak CPU power consumption of the entire M1 chip in devices many times larger than an iPhone.
GeekerWan had it throttling 200-300MHz when simply running specInt/specFP. It essentially throttles down to the same speed of the iPhone 14 at slightly higher wattage.
For mobile devices, real-world peak CPU performance hasn't gotten much better than my aging iPhone 12 because most of the extra performance has come at the expense of heat/power.
I assume this comment is just here to supply various vaguely CPU-related technical info in case someone is curious, not to argue against my point? Because if it's the former then that's fine, but if it's the latter it doesn't really hit the mark
>it's not like performance is linear with clock speed anyway
It doesn't scale well beyond certain clock speed, which is why extreme overclocking dont get you a linear performance boost. But at the ranges mobile SoC it is about as linear as it gets. i.e Clocking your SoC from 3Ghz to 3.3Ghz will get somewhere around ~10% improvement in most workload.
>and target clock speed is a huge part of the CPU
Target clock speed is a huge part of considering for power usage.
As you suggest, the performance benefits of increased clock speeds is highly workload specific. I agree. It's roughly linear for some workloads up to a point, and highly non-linear for other workloads (especially memory bandwidth constrained workloads, which is a relevancy case given that it's SoCs we compare, not stand-alone CPUs).
Yes, target clock speed is a big part in considering power usage, but also fab process, memory architecture, etc. Adjusting for clock speed makes no sense, though adjusting for power consumption might.
I feel like they are conflating 2 unrelated things?
There is "CPU inference" which is indeed pervasive and huge, and there is "generative ai llm inference" which is absolutely anemic on CPU, especially outside of tiny-context toy model demos, in spite of the popularity of llama.cpp and other cpu platforms .
I would hope smartphone users aren't running anything but extremely tiny llms on CPU. The IGP is not theoretically, but currently massively more efficient than any 2024 CPU could be.