The chicken-and-egg thing also applies to GPUs btw., Nvidia & PGI have supported...

cbkeller · on Nov 27, 2018

That's a good point. Hierarchical parallelism is becoming increasingly important, so having one language that can be used both within-node and between-node is very convenient, and could add to the lock-in factor.

m_mueller · on Nov 27, 2018

Good point and this is btw. exactly where Nvidia is heading. There will be a point in the future where you just program kernels and/or map/reduce functions and/or library functions and then call them to execute on a GPU cluster, passing in a configuration for network topology, node-level topology (how many GPUs, how are they connected) and chip-level topology (grid+block size).

The address space will be shared on the whole cluster, supported by an interconnect that’s so fast that most researcher can just stop caring about communication / data locality (see how DGX-2 works).

wahern · on Nov 27, 2018

> The address space will be shared on the whole cluster, supported by an interconnect that’s so fast that most researcher can just stop caring about communication / data locality

There will always be people who will care because locality will always matter (thanks, physics). Improvements in technology may make it easier and cheaper to solve today's problems, but as technology improves we simply begin to tackle new, more difficult problems.

Today's chips provide more performance than whole clusters from 20 years ago and can perform yesterday's jobs on a single chip. But that doesn't mean clusters stopped being a thing.

See also The Myth of RAM, http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

m_mueller · on Nov 27, 2018

I do think there’s a paradigm shift coming. it’s a combination of the ongoing shift away from latency- to throughput oriented design with the capabilities shown in new interlinks, especially nvlink/nvswitch. This allows DGX-2 to already cover a fair amount of what would otherwise have to be programmed for midsized clusters - if it can be made to scale one more order of magnitude (i.e. ~ 10 DGX) I think there’s not much left that wouldn’t fit there but would fit something like Titan. There’s not that much so embarassingly parallel that the communication overhead doesn’t constrain it, and if doesn’t, you again don’t care much about data locality as it becomes trivial (e.g. compute intensive map function).

pjmlp · on Nov 27, 2018

C++ and Fortran support on CUDA was one of the big reasons why OpenCL was left behind.

They now at least support C++14, but driver support seems to still not be quite there, from what I get reading the interwebs.