Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What?!

Of course it does in real life - unless you're working on very small amounts of data, cache level latencies (where Intel chips - non-atom at any rate - generally have much lower latencies) and cache pre-fetchers and branch prediction units (where Intel is generally 5/6 years ahead) can make or break the difference between the FP units being constantly busy or regularly stalled waiting for data.




In mainstream raw-flops workloads (things like lapack), a correct implementation re-uses the data from each load many times such that the FPUs are not “stalled waiting for data”. Unless the software implementation is terrible, the memory hierarchy does not pose a significant bottleneck for these tasks, and even older ARM designs like Cortex-A9 can achieve > 75% FPU utilization, comparable to x86 cores.

There are more specialized HPC workloads (sparse matrix computation, for example) where gather and scatter operations are critical, and the efficiency of the memory hierarchy comes much more into play (but in these cases even current x86 designs are stalled waiting for data). There are also streaming workloads (which you seem to reference) where you have O(1) FPU op data element, which stress raw memory throughput and prefetch. However, one doesn't typically use these to make a general claim about which core is "more competitive (FLOPS-wise)”, precisely because they are so dependent on other parts of the system.


fair enough - I guess I might be biased in my interpretation of what "normal" use-cases are...


I'm talking about real-world usage - not benchmarks that only stretch the NEON FP units in unrealistic situations.


What "real-world usage" are you talking about, specifically?

EDIT: looking at your comment history, you seem to be focused on VFX tasks, which tend to be entirely bound by memory hierarchy; even on x86 the FPU spends most of its time waiting for data. For a workload like that, you absolutely want to buy the beefiest cache/memory system you can, but that shouldn’t be confused with a processor being more competitive “flops-wise".


I don't know why u guys are arguing, he wouldn't have much use this ARM part. Where this will excel is in CRUD web apps. The ARM part it there just to shuttle data between the 10Ge and the 128GB of ram.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: