What?! Of course it does in real life - unless you're working on very small amou...

stephencanon · on Jan 29, 2014

In mainstream raw-flops workloads (things like lapack), a correct implementation re-uses the data from each load many times such that the FPUs are not “stalled waiting for data”. Unless the software implementation is terrible, the memory hierarchy does not pose a significant bottleneck for these tasks, and even older ARM designs like Cortex-A9 can achieve > 75% FPU utilization, comparable to x86 cores.

There are more specialized HPC workloads (sparse matrix computation, for example) where gather and scatter operations are critical, and the efficiency of the memory hierarchy comes much more into play (but in these cases even current x86 designs are stalled waiting for data). There are also streaming workloads (which you seem to reference) where you have O(1) FPU op data element, which stress raw memory throughput and prefetch. However, one doesn't typically use these to make a general claim about which core is "more competitive (FLOPS-wise)”, precisely because they are so dependent on other parts of the system.

berkut · on Jan 29, 2014

fair enough - I guess I might be biased in my interpretation of what "normal" use-cases are...

berkut · on Jan 29, 2014

I'm talking about real-world usage - not benchmarks that only stretch the NEON FP units in unrealistic situations.

stephencanon · on Jan 29, 2014

What "real-world usage" are you talking about, specifically?

EDIT: looking at your comment history, you seem to be focused on VFX tasks, which tend to be entirely bound by memory hierarchy; even on x86 the FPU spends most of its time waiting for data. For a workload like that, you absolutely want to buy the beefiest cache/memory system you can, but that shouldn’t be confused with a processor being more competitive “flops-wise".

sitkack · on Jan 29, 2014

I don't know why u guys are arguing, he wouldn't have much use this ARM part. Where this will excel is in CRUD web apps. The ARM part it there just to shuttle data between the 10Ge and the 128GB of ram.