Fujitsu Switches Horses for Post-K Supercomputer, Will Ride ARM into Exascale

monocasa · on July 1, 2016

> Designing a manycore CPU, which the Post-K processor will almost certainly be, with a simpler RISC core as its base, is inherently more efficient than trying to do that with a more complex architecture like SPARC64.

Honestly, ARM is a very complex architecture these days. There's somewhere in the neighborhood of a thousand instructions. It's very close to x86 in complexity. Tons of cruft built up over the past thirty years too (although it was a cleaner architecture than x86 to start off with so it has that going for it).

trsohmers · on July 1, 2016

While the conversion from x86 to Intel's internal microcode obviously adds complexity, internally Intel has been a RISC load/store architecture since 2006. Since ARM moved to ARMv8, I groaningly refer to v8 as "x86 v2" with most implementations with ridiculous pipelines and OoO execution. The only actually good ARMv8 implementation I've seen doesn't actually implement ARMv8 at all internally... NVIDIA's (now basically dead) Denver architecture does Transmeta style codemorphing to support ARMv8 instructions while having a much simpler (and efficient) internal architecture.

scriptdevil · on July 2, 2016

Denver isn't close to dead. I work in the Denver team at NVIDIA. If you mean market adoption, yes - Nexus 9 is the only major device to have shipped with Denver cores as of today. But development is still very active internally.

lgeek · on July 2, 2016

Somewhat off topic, but I wish NVIDIA published more information about Denver's DBO and the reasoning behind some design decisions. For example the company line is to just generate 'decent code' and let the DBO take care of it [0], but that clearly doesn't always work in practice [1], which requires a lot of trial & error and some reverse engineering to generate good code for it. Also, the decision to always invalidate the translation when writing to an executable page (or whatever other super slow operation is done in that case) seems odd (and affects the performance of some JIT approaches) since ARM software is expected to explicitly flush the caches when modifying code anyway.

[0] https://www.youtube.com/watch?v=oEuXA0_9feM

[1] https://github.com/ssvb/tinymembench/wiki/Nexus-9-(Tegra-TK1...

lgeek · on July 2, 2016

> Since ARM moved to ARMv8, I groaningly refer to v8 as "x86 v2"

I'll have to disagree, I think ARM mostly did the right thing with the A64 instruction set. The AArch32 instruction sets were just getting ridiculous (just having multiple instruction sets for one architecture is pretty messed up), with A32, T32 and a bunch of other extensions (ThumbEE, Jazelle) which pretty much no one used. In addition, the A32 and T32 instruction sets have grown organically over time, with new instructions being crammed in any unused / invalid space, which is how we ended up with one encoding doing this one thing, except if one register is the SP or PC, then that is a whole other instruction or with immediate values being split up over 2, 3, 4 fields. Also, a few features had turned out to be pretty bad ideas: for example access to the PC as a general purpose register is overly complicated to implement in OOO cores (verification cost) and comes with performance penalties, but it still had to be supported; the IT instruction is another fun one, especially in the context of dealing with exceptions.

A64 pretty much takes the set of instructions which has emerged over time, removes the bits which are not suitable any more or were bad ideas in retrospect, fits them cleanly in a new encoding space and removes some of the flexibility which was getting in the way (e.g. VFP having either 16 or 32 double precision registers). I expect AArch32 to be gone from most high performance implementations in a few years, with a software DBT option remaining available for legacy software which still hasn't been ported, which is when the implementers will be able to fully benefit from having a simplified AArch64 compared to AArch32.

mrpippy · on July 2, 2016

> I expect AArch32 to be gone from most high performance implementations in a few years

Apple has been requiring iOS apps to include 64-bit binaries for more than a year, and in iOS 10 they're publicly shaming 32-bit apps [1] on launch. Wouldn't be surprised at all if 32-bit support is gone from iOS and Apple's SoCs by iOS 12 / 2018.

[1] http://appleinsider.com/articles/16/06/15/ios-10-warns-users...

monocasa · on July 1, 2016

Eh, OoOE and pipelines in the dozens of stages make a lot of sense at the gate count we're talking about. Sure, you won't get the best pure performance/watt, but you'll get a nice balance of raw performance and perf/watt if you play your cards right (which I'd argue Apple did with their cores, and ARM is finally catching up with their A73s it looks like). And I'm not a super huge fan of the raw ISA of the chip being hidden from you a la DAISY/Crusoe/Denver. And that's coming from someone who's haevily inspired by their techniques while writing a high performance machine to machine JIT. I'd love raw access to their underlying VLIW code though (maybe similar to 90s era Apple nanokernel with machine tranlator that allows you to start a native process context).

bogomipz · on July 1, 2016

When you say "raw ISA" and "underlying VLIW code"are you referring to the microcops/mirocde or is this something specific to ARM chips?

monocasa · on July 2, 2016

I'm talking about IBM's DAISY, Transmeta's Crusoe, and Nvidia's Denver. These are more or less fairly standard VLIW machines that are designed to only run a JIT compiler for another machine architecture (PowerPC, x86, and ARM respectively). This JITed code is the only other code running on the machine, so they basically look like native versions of the architectures they emulate.

Edit: These machines also do a lot of cool stuff under the hood. DAISY's native binary format looks more like a CFG than we pure instruction steam. The Crusoe has some really sweet exposed speculation hardware. I'm sure Denver does some really cool stuff too, there's just not a whole lot of publicly available information on it.

bogomipz · on July 2, 2016

Ah neat, I think I remember hearing about the transmeta one back when. These weren't terribly successful though right?

What is the acronym "CFG", I wasn't able to find that anywhere.

grandinj · on July 2, 2016

Control flow graph

setpatchaddress · on July 1, 2016

What would you rather have seen ARM do for a 64-bit ISA expansion? (Genuine question; I have no opinion.)

trsohmers · on July 1, 2016

Going through them would be a book in itself (The ARMv8 reference manual is itself a couple thousand pages, which I have only gone through a fraction of). I think RISC-V is the closest thing to what I would like in a simple general purpose ISA, even though it is not perfect (no branch delay slots, no post modifies/register writebacks, don't like the immediate encoding). It is a hell of a lot better than the mess ARM made, and the basic (IMAFD) instruction reference can be kept to a single page ;)

stephencanon · on July 1, 2016

> no branch delay slots

We fought that battle 15 years ago, and delay slots lost (for general-purpose CPUs).

rayiner · on July 1, 2016

Wait you like branch delay slots?!

trsohmers · on July 1, 2016

I love branch delay slots, and they are a key feature for getting highest performance in the architecture I am working on... if you have things like guaranteed latency and static scheduling (like we do with the REX Neo architecture), delay slots are VERY useful in tight loops. I understand not including them in a general purpose design with branch prediction, but I'm not a fan of branch prediction and its effects on determinism.

coderjames · on July 2, 2016

I'd never heard of the Rex Neo until your post. Looks very interesting!

Symmetry · on July 1, 2016

From the perspective of being implemented in a modern, high performance architecture the extra complexity caused by thousands of instructions is tiny. Here are some properties of the ISA that matter far more:

Ease of decode - You want your instruction stream to be dense to fit easily in I-cache. You want it to be self-synchronizing for easy multiple decode. You want the input registers to be at predictable offsets from the instruction start to speed up fetching them.

Data Dependancies - The design of your OoO execution engine will depend on the maximum number of outputs and especially inputs an instruction can have. Flags are inputs too. If some instructions can only partially update your flags (like x86) then the flags count as multiple inputs now.

Atomic execution - Handling precise exceptions from timer interrupts or page faults or whatever mean you have to go back to some coherent state. This is easier if your instructions only touch memory at most once each. Especially difficult is multiple writes to memory in an instruction but read-modify-write is also bad. Read-modify and modify-write are pretty ok but still add a bit of complexity.

And one can argue about whether it's better to have 16 or 32 registers but with x86-64 everybody these days has one of those and is good enough. Well, actually SPARC with its register windows has way more than 32 and that might be Fujitsu's biggest motivation in moving to ARM.

snaky · on July 2, 2016

> The ForwardCom project includes both a new instruction set architecture and the corresponding ecosystem of software standards, application binary interface (ABI), memory management, development tools, library formats and system functions

http://www.agner.org/optimize/blog/read.php?i=634

honkhonkpants · on July 2, 2016

That's pretty nice. It's become more and more irritating to program Intel's vector hacks, and presumably they will soon give us kilobit-long registers and vector operations upon those, which isn't going to make the instruction coding any more compact.

digi_owl · on July 1, 2016

I get the impression that there are no free lunch to be had in computing, and we are effectively hitting the end point unless we have overlooked some way to spread a workload across multiple cores that makes it all scale linearly.

In effect thermodynamics (or some relative) once more have us by the balls...

cordite · on July 1, 2016

In the general space of computing, what kinds of alternatives to the Xeon Phi are there that want to do their own mini-supercomputer tasks? (very subjective on task definition, but for now let's suppose things that are hard to do on a GPU effectively).

It seems like the Parallella tried to get close, but did not succeed in breaking into the market. (On Amazon, looks to be marketed more for a high powered raspberry pi alternative...)

Which seems to be more useful? The power to compile same code with alternate flags (Xeon Phi style); cross-compile to other arch's (x86 host -> ARM binary); JIT bytecode, LLVM bitcode, or Intermediate-Representation formats?

What kind of access patterns would be most common for a hobbyist or an enterprise to cater to? For example, one issue the Xeon Phi has is memory controller contention, which makes it less optimal for less structured relational analysis.

hangonhn · on July 1, 2016

I don't get why people keep reporting this as ARM unseating x86 when in this specific case its ARM replacing SPARC. Is there a perspective I am missing?

trhway · on July 1, 2016

interesting, to say the least, choice for Fujitsu, i mean they do have SPARC/RISC experience and given that the current top dog in the world is Alpha/RISC based https://en.wikipedia.org/wiki/Sunway_TaihuLight , it seem strange to make such a huge bet on completely new architecture - Top500 has no ARM systems at least at the top of the list.

Beside big picture, there are such pesky details as "Fortran for ARM" HPC compiler. All these bearded and not so guys who walked Sun hallways for years ... The Intel HPC compilers also well established. What is available for ARM in that department?

beautifulpeople · on July 2, 2016

I think you're missing the big picture here. Looks like Cray has been looking at ARM for awhile for Fast Forward 2 (US DoE I think, announced 2014: http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsA...). Looks like Barcelona Supercomputing Center also deployed ARM based test systems quite awhile ago (http://nvidianews.nvidia.com/news/barcelona-supercomputing-c..., http://gizmodo.com/5898633/can-an-arm-based-supercomputer-be...).

gnufx · on July 2, 2016

For what it's worth, there's current ARM propaganda [huge PDF!] at http://emit.tech/EMiT2016/Adeniyi-Jones-EMiT2016-Barcelona.p... (and incidentally something related to a more interesting sort of ARM-based system from the same meeting: http://emit.tech/EMiT2016/Navaridas-EMiT2016-Barcelona.pdf). That may not be terribly relevant to Fujitsu, though.

gnufx · on July 2, 2016

In what way is the Sunway Alpha-based? Dongarra says the ISA isn't.

This hardly looks "completely new" compared with Sunway, for instance, especially given work with it in Europe.

Fujitsu have their own compiler -- amongst their own essentially everything for K -- but GCC is "well-established" in HPC and you can see the Fortran-based HPC-type software in Fedora and Debian aarch64. (Yes, there's a lot more to it than that.)

timthorn · on July 2, 2016

Such pesky details haven't been forgotten: http://www.nag.co.uk/content/nag-broaden-64-bit-armv8-ecosys... https://www.nag.co.uk/content/nag-core-arm-performance-libra...

gnufx · on July 2, 2016

I'd be surprised if that's relevant to Fujitsu, though.

I wonder how much better the library is than the free alternatives.

undersuit · on July 1, 2016

The perspective you're missing is the x86 didn't replace SPARC in this case?