Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ARM's "Blackhawk" CPU Is a Plan to Have the Best Smartphone CPU Core (moorinsightsstrategy.com)
85 points by ksec on Jan 17, 2024 | hide | past | favorite | 84 comments


> Arm also believes that Blackhawk will provide “great” LLM performance. I will assume that this has to do with big CPU IPC performance improvements as Arm says that its Cortex CPU is the #1 AI target for developers. I didn’t run the developer or app survey myself, but this makes sense to me as most AI inference in the datacenter is run on the CPU. The NPU and GPU can be an efficient way to run AI, but a CPU is the easiest and most pervasive way, which is why developers target it. A higher-performing CPU obviously helps here, but as the world moves increasingly to smaller language models, Arm’s platform with higher-performing CPU and GPU combined with its tightly integrated ML libraries and frameworks will likely result in a more efficient experience on devices.

I feel like they are conflating 2 unrelated things?

There is "CPU inference" which is indeed pervasive and huge, and there is "generative ai llm inference" which is absolutely anemic on CPU, especially outside of tiny-context toy model demos, in spite of the popularity of llama.cpp and other cpu platforms .

I would hope smartphone users aren't running anything but extremely tiny llms on CPU. The IGP is not theoretically, but currently massively more efficient than any 2024 CPU could be.


4-bit quantized llama gets borderline-acceptable performance on an iPad CPU[1], and there's no reason they can't further optimise CPU perf for this task. IIUC, LLM inference is mostly constrained by memory size and bandwidth (hence the aggressive quantization), rather than raw compute.

[1] https://github.com/ggerganov/llama.cpp/discussions/844


I guess it depends on how you're defining "great" LLM performance and "CPU performance supremacy versus Apple"

Personally I would say "borderline-acceptable performance" with a 7B parameter model at 4-bit quantisation is nowhere near "great" or "supremacy". You can fit that on ~7GB of RAM, using your grandpa's 2016-era nvidia 1080.

To take the ARM-CPU-LLM-performance crown from Apple you have to beat a fully equipped M2 Max, which provides 96GB of memory with 400GB/s of bandwidth. That's enough to run 70B parameter models in 4-bit quantisation, and it can go toe-to-toe with a dual-nvidia-4090 system that draws a kilowatt of power.


Apple achieves that performance on the IGP.

Once the context grows, even M2 Max CPU alone really starts to chug.


low bit depth quantized puts GPU's at quite a disadvantage, since they can't use their massively parallel floating point units, and often spend a lot of time packing and unpacking numbers.

However, minor tweaks to the design of GPU's could probably make them 5x quicker at directly processing quantized data (a 4 bit multiplier is so small you can afford to do thousands of them per clock cycle in a tiny silicon area), and I'm sure the next generation of GPU's will have that.


Are there situations where GPUs are compute bottlenecked at batch size 1? For the time being at least, that's how local models will be used.


Everyone's using staged speculative decoding now and beam search. So effectively N=250 or so to get out a bit more performance.


At long contexts.


> in spite of the popularity of llama.cpp and other cpu platforms

Just to clarify: llama.cpp has also added some GPU acceleration over time.


Yes, but not anything usable on IGPs.


Is this finally what the Sophia-Antipolis team has been working on? IIRC, the last few years have all been Austin.


Interesting, when did it move?


Arm’s teams work at the same time. One does a longer term design while the other is shipping.


In all seriousness - has anyone noticed a lack of performance on their phone recently?


Isn't a big part of the strategy to have a fast core, do all the work quickly and spend most of the time in sleep. If you can do your work quicker you spend more time in sleep and save power?

I don't really care about my phone being faster, but I would always take better battery life.


Race to sleep!


"hurry up and wait" has never felt so good.


All CPUs idle at that same speed.


But some enter/leave idle faster than others.


The Cortex X cores have probably gone past the sweet spot into the zone where they use more energy to go faster.


No, but I have noticed that modified smartphone processors (like Apple’s M series) are now faster and more efficient than traditional desktop and laptop processors. I welcome their introduction into Windows laptops.

I’ve also noticed that there are emerging AI applications that cannot be run on-device by today’s smartphones. I would very much like an LLM Siri that doesn’t need a network connection.


Calling Apple Mx "a modified phone" CPU is doing it a disservice. It's a synergy of mobile design with desktop design.


what does “a synergy of mobile design and desktop design” mean? It’s the same cores, but more of them. iirc, they’ve made cache structures larger as well.


Most Intel processors are still made on 10nm, which Intel deceptively calls “Intel 7”.

Even the best 10nm processor can’t physically compete with a 3nm one on efficiency.


All of these terms are largely marketing as far as I’m aware, yes Intel is behind in lithography but not by as much as the difference between 3nm and 10nm would suggest and I’m pretty sure the transistor/feature sizes are not 1/3 the size of Intel’s processes.

It will be interesting to see what happens with the Intel 18A process and who builds chips with them. Intel could also rapidly regain the performance crown if these are good processes too.


Intel 10nm is actually smaller than TSMC N7. It has also had two massive performance upgrades which should put it around and maybe ahead of TSMC N6.


So Ryzen? Same fabs and nodes as Apple.


Nope, my phones are like my computers nowadays, they only get replaced when they break down.

Even in terms of OS features, hardly anything that makes me feel like updating to the latest Android version, other than Google finally acknowledging Java isn't going anywhere and started following up on language updates (Android 12 onwards).

Naturally something that OEMs aren't that happy about.


I went from an iPhone 11 Pro to 15 Pro recently, and four year's difference is definitely noticeable, but it's not a huge deal (honestly I upgraded mostly for 5G and the 120hz screen).


I just upgraded from a Pixel 3 to a Pixel 8, and while I've read that it's a slow phone, I don't see any performance issue.

I also bought a 120€ tablet with a Helio G99 CPU and it's so much faster than my old Galaxy Tab S3 and Lenovo Tab4 8 Plus.

To me it feels like we have reached the point where these devices are fast enough like it was when laptops matured, where performance was no longer an issue and upgrades were made when the device no longer worked or Microsoft made it obsolete due to Windows 11 not being allowed to run on it.


I think the people who say the Pixel 8 slow are gamers. Even a cheap phone is fast enough for the vast majority of people unless they play 3D games on it.


Im not totally sure that someone coming from a pixel 3 has the experience necessary to determine what is fast or slow in this day and age. There could've been enormous leaps forward (there have) and just as enormous regressions (allegeded) that you simply missed.


I am also using a Xiaomi Poco X3 NFC and a Gigaset GX6 [0], but not as a daily driver. 8 tablets and 3 phones which are currently turned on, and two phones which are turned off (Pixel 3 and Moto G7 Plus). I don't think I lack experience.

[0] https://www.gigaset.com/hq_en/gigaset-gx6/


No, but this is the way it goes, new software will be written that will push current phone hardware to its limits, and then we'll start the cycle over again.


No, serious question, how/when would I even notice that? I rarely do resource intensive stuff on my phone


Yes, my iPhone 14 runs out of memory in safari.

It never happens on same web app with 5 year old Samsung...


This unsourced SO answer says:

> Unfortunately the memory limit on iOS Safari is rather low. It’s limited to 384MB on version 15, it’s lower on earlier versions, and it’s probably device specific as well.

> As I said it depends on the device as well but it's always between 200-400MB

Its a decision, not a hardware limitation. Helps keep web bloat in check, but it’s a worse experience for the user. I’m sure Apple’s happy to suggest a native app without such limits tho


Apple/Webkit support said that 16MB image is gigantic and not meant for web. (Big image js viewer - zooming and panning)

But no problem for very old low end android phone or IE11 in 10 year old laptop...


In general iPhones have less RAM than roughly corresponding Android device. This is because iOS apps on average take less RAM to run because iOS manages memory more tightly and Android uses Java which is very memory hungry.

A web app though should be a situation where the memory usage will be roughly equivalent between the two platforms, so that seems reasonable.


Was written by LLM trained on social media or what? That’s not how any of it works.


Probably, it does not make any sense.

If 6GB RAM iPhone runs out of memory, then how can it be better at memory management than Samsung phone with 3GB RAM, which does not run out of memory.

My problem is 16MB (8000*5000px) images in web browser which iPhone cannot handle. Most powerful phone every year, but 16MB is too much.


The original poster didn't specify what the RAM was on the two phones...


I do development on iOS. This is exactly how this works.


Samsung A5 2017 has 3GB ram vs 6GB in iPhone 14


Those are the phones? Well then, that's weird!


I don't think comparing a 2017 phone to a 2023 one is a fair comparison by any stretch.

If you want something more similar I would suggest the standard "Samsung galaxy S23" which has 8Gb compared to the IPhones 6Gb


That’s the same person who said that Safari ran out of memory on stuff their Android doesn’t.

Their point wasn’t “iPhones have more RAM than comparable Android devices” (as you say, that comparison is entirely unreasonable based on those devices), it was “this is the RAM on the devices where I observed this behaviour”. If the iPhone in question has more RAM than the Android, yet still OOMs where the Android doesn’t, that’s a pretty strong argument that iOS isn’t magically thriftier than Android at managing memory.


How it is not comparable?

2017 phone does NOT run out of memory, but 2023 phone model RUNS out of memory?

Newer phone is worse.


Ah my bad, misinterpreted it


The market for this core will be a lot smaller since Qualcomm will be using Oryon instead. I guess Mediatek will be happy.


Remains to be seen.

Oryon is based on the IP Qualcomm gained from the Acquisition of Nuvia, who developed this IP based on a very narrow license they got from ARM to develop for the server market.

Qualcomm is trying to transfer everything Nuvia has developed under that specific narrow ARM-license to the broad license of Qualcomm and use it for "powering flagship smartphones, next-generation laptops, and digital cockpits, as well as Advanced Driver Assistance Systems, extended reality and infrastructure networking solutions"

ARM has filed a lawsuit that this was never in scope of the license of Nuvia, the court-ruling is still pending on that one...

The related court-filing is worth a read: https://s3.documentcloud.org/documents/22273195/arm-v-qualco...


Obviously the lawsuit has to be settled before Qualcomm can release Oryon but at this point they've invested so much in it I think they will settle.


The only form of settlement I can imagine is for Qualcomm transferring the Nuvia IP to ARM and be allowed to license it back with favorable conditions.

Which would probably lead to it being integrated into a nextgen Blackhawk architecture, merging both paths and making Qualcomm a licensee of Blackhawk.

But hey, I'm no lawyer ;)


I predict they'll pay the server royalty that Nuvia negotiatied plus some penalty.


What this news suggests is that Cortex X5 will be similar to M3 and hence competitive against Oryon.


Is Qualcomm actually catching up with Apple on CPU? (I haven’t been tracking it).


Seems so. The Snapdragon 8 Gen 3 benchmarks that dropped yesterday with the Samsung S24 Ultra put it on par or above Apple's A17 Pro or whatever their most high end phone SoC is.


Single core still lags way behind. Snapdragon 8 Gen 3 has same single core at a 3 year old A15.

They are staying even / besting apple in multicore through adding more cores than apple does.


To be fair, Geekbench is a relatively poor representation of real world performance. It's heavily used since it's one of the few benchmarks that works across all major platforms and architectures, making comparisons easy, but not necessarily realistic.


This is regarding reference CPU designs from ARM. Qualcomm plans to use the Oryon cores from their Nuvia acquisition.


Yes, Qualcomm Oryon is tied with Apple A17/M3 although it's coming out 9-16 months later.


The 0ryon hasn't come out yet, so it's hard to tell what's really happening with it.


Can we really make confident predictions for chips that are a year out from release?


It's finished and benchmarks are out there.



Performance also depends on how well the cores integrates with manufacturing processes of TSMC and Samsung.


But, will it have mainline support?


CPU would, but I think GPU will remain a blob


ARM should gain about 30% in a single gen to be competitive with Apple/Qualcomm processors. It's going to be interesting to see if they can achieve this. Ultimately, competition is good for the consumer. Android phones are so far behind Apple, it is not funny anymore.


They are not far off.

The Cortex X4 Running at 3.3.Ghz on a N4 gets GB6 ~2250. #681.8/Ghz

The Apple A17 Running at 3.8 Ghz on a N3 gets GB6 ~2950. #776.3/Ghz

On a clock per clock basis, A17 is only about 14% faster. Consider X4 had 15% IPC uplift and resulted in real world 11% performance improvement on GB6. And they are claiming X5 would have the largest YoY IPC uplift, which I think we could consider to be 15%+ or 20%, a Cortex X5 would have similar if not slightly better than A17 performance on a Clock to Clock basis.

And it would be good enough for Microsoft / ARM PC.


Clockspeed is influenced by design choices. ARM can't just crank up the clock speed to match Apple's chip.

Another thing I'd like to see is perf/watt.


Apple lets their CPUs and GPUs run at a higher max frequency/power draw, but you're absolutely right that in clock-to-clock performance they're only marginally better in both CPU and GPU performance.


If you clock CPUs low enough, many CPUs in the past could match Apple in clock-to-clock performance. This is because clock speed does not scale with performance.


>This is because clock speed does not scale with performance.

You may want to have a word with AMD and Intel about that.

>many CPUs in the past could match Apple in clock-to-clock performance.

Are we talking about synthetic benchmarks or real world work loads.


Clock speed does not scale linearly with performance.

Synthetic. Something like Geekbench.


"The absolute top end Android processors get about 76% of the performance of Apple's processors" isn't the strong argument you may think it is tbh. Per clock performance is uninteresting (it's not like performance is linear with clock speed anyway, and target clock speed is a huge part of the CPU, so I don't even understand why you'd want to adjust for it even in principle in this comparison).


Per-clock performance is THE metric. Apple can't sustain those peak clocks for more than a few seconds before throttling down. Once both chips are running at a sustainable 2-2.5GHz, the IPC starts mattering a lot.


If Apple can't sustain those clock speeds for long, that's reflected in the benchmark result. Benchmarks and real-world performance are the only metrics which matter in the end.

And higher clock speed doesn't proportionally improve either real world metrics or benchmark results, so "benchmark score divided by clock speed" is a useless metric.


Geekerwan has a great review that covers this in their iPhone 15 Pro review.

https://www.youtube.com/watch?v=iSCTlB1dhO0

The CPU peaked out at 14 Watts in multicore Geekbench. That's close to the peak CPU power consumption of the entire M1 chip in devices many times larger than an iPhone.

GeekerWan had it throttling 200-300MHz when simply running specInt/specFP. It essentially throttles down to the same speed of the iPhone 14 at slightly higher wattage.

For mobile devices, real-world peak CPU performance hasn't gotten much better than my aging iPhone 12 because most of the extra performance has come at the expense of heat/power.


I assume this comment is just here to supply various vaguely CPU-related technical info in case someone is curious, not to argue against my point? Because if it's the former then that's fine, but if it's the latter it doesn't really hit the mark


Just want to say thank you. Sometimes I get tired and I can't be bothered to do their homework to get all the links for them.

But I am glad on HN people will fill the gap.


>it's not like performance is linear with clock speed anyway

It doesn't scale well beyond certain clock speed, which is why extreme overclocking dont get you a linear performance boost. But at the ranges mobile SoC it is about as linear as it gets. i.e Clocking your SoC from 3Ghz to 3.3Ghz will get somewhere around ~10% improvement in most workload.

>and target clock speed is a huge part of the CPU

Target clock speed is a huge part of considering for power usage.


As you suggest, the performance benefits of increased clock speeds is highly workload specific. I agree. It's roughly linear for some workloads up to a point, and highly non-linear for other workloads (especially memory bandwidth constrained workloads, which is a relevancy case given that it's SoCs we compare, not stand-alone CPUs).

Yes, target clock speed is a big part in considering power usage, but also fab process, memory architecture, etc. Adjusting for clock speed makes no sense, though adjusting for power consumption might.


So the A17 still benchmarks over 30% faster than the Cortex X4 in single threaded tasks. Sounds about right based on my subjective experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: