New GPU-Accelerated Supercomputers Change the Balance of Power on the TOP500

improbable22 · on June 28, 2018

What's the definition of "one supercomputer" for the purposes of TOP500?

For example, why doesn't one of Google's warehouses qualify? Or the whole of Google, for that matter. A bit of googling didn't find my anything very satisfactory.

bbatha · on June 28, 2018

One thing google et all are missing from a typical super computer is infiniband style interconnects. They provide integrations with parallel data libraries like mpi and offer “3d” networking that will take into account physical distance between nodes and can do single rack mesh networking to avoid the overhead of switching. Despite google having lots of compute power they probably can’t leverage it in the way that the LINPACK benchmarks need.

monocasa · on June 28, 2018

The interconnects have gotten waaaayyy better in datacenters over the past five years or so when compared to Infiniband. Stuff like FPGAs doing data plane routing, and all of the "converged Ethernet" standards like RoCE have really narrowed the gap between Ethernet and Infiniband.

gnufx · on June 29, 2018

However much Ethernet has changed, it's not appropriate for typical HPC use (c.f. IB, OPA, Aries, ...). I investigated Cisco's UCS at some length, for instance, which they had persuaded an important person to buy, and that's at least partly the baby of an MPI person. (I recall Infiniband was designed as a "datacentre" thing.)

monocasa · on June 29, 2018

Can you expand on why?

jedbrown · on June 28, 2018

HPL is not especially sensitive to network latency. There are many data centers that could run HPL and get on the list, but don't care to pull that (relatively expensive) stunt. Among scientific applications, a significant fraction really depend on the high-end networks while others would be fine without.

bbgm · on June 28, 2018

In the past EC2-based clusters have made the Top500. Somewhat unique then because they were the only virtualized systems, but those were just 10 gig ethernet and HPL runs really well on those. In all cases we did it on relatively small number of machines before they had been publicly launched (essentially we used HPL as a stress test).

(Work at AWS)

wumpus · on June 29, 2018

Look at the % of peak for ethernet systems vs various Infiniband speeds, or Intel Omni-Path, and you'll see that there's a significant different in the % of peak for Ethernet vs Infiniband vs Omni-Path.

DannyBee · on June 29, 2018

This is just wrong. 100 gigabit ethernet switches are commonplace in the datacenters of companies like google, facebook, etc.

Facebook even open sourced their second generation 100 gigabit switch design years ago.

Combine this with the fact that LINPACK is not a benchmark that involves a lot of communication in the relative scheme of things. If any of them big players cared, they could probably destroy the top LINPACK numbers instantly.

There are HPC benchmarks that they would have a harder time with, just not LINPACK.

improbable22 · on June 28, 2018

Thanks, this is interesting. It would be somehow satisfying if their LINPACK benchmarks would actually not be beaten by Google et al. (And their real workloads too.)

But how tightly can you really connect 27000 GPUs? Would be curious if anyone has a more technical article handy about what's different.

jcranmer · on June 28, 2018

The list of top supercomputers isn't a list of which systems have the most ALUs that you can shove floats through (though that is definitely a strong correlate). The difficult part in HPC is actually being able to keep those ALUs fed with floats. In large HPC applications, the communication is the principle bottleneck in being able to scale up [1]. Communication patterns for HPC application also tend to very much have a bursty everybody-is-sending-at-the-same-time pattern, which makes it very easy to saturate a typical star-like Ethernet network configuration (supercomputers typically use a torus or mesh-style interconnect).

For GPUs, one trick you can do is to do GPU-to-GPU communication that bypasses the CPU. I don't believe the hardware that extends this to do CPU-less transfer systems across different nodes is common on non-HPC systems.

[1] One of the main criticisms of LINPACK as a benchmark is that it is a low-communication benchmark. Essentially, you're doing O(n^3) computation on O(n^2) communication. In many benchmarks, such as grid simulation, the ratio of computation is communication is constant with respect to size.

stephencanon · on June 28, 2018

> One of the main criticisms of LINPACK as a benchmark is that it is a low-communication benchmark. Essentially, you're doing O(n^3) computation on O(n^2) communication. In many benchmarks, such as grid simulation, the ratio of computation is communication is constant with respect to size.

This is critical and poorly communicated to most people outside the HPC world.

bbatha · on June 28, 2018

> But how tightly can you really connect 27000 GPUs?

Not all that well currently, NVidia and others are working on GPU specific interconnects[0] but they don't have anywhere near the scale of traditional interconnects which have supported hundreds of thousands of nodes by the late 90s. On of the big challenges in modern super computer programming is in fact keeping the GPUs hot, which can often mean offloading work that needs high memory usage to CPUs.

Unfortunately my knowledge here is a little dated, I interned at Los Alamos National Lab from 2008 - 2012 when they were doing a lot of rearchitecting of old codes for RoadRunner, the first peta-scale computer. It used Cell chips in accelerator cards and predicated a lot of the challenges in GPU programming, but did not fully elucidate them. For instance we didn't have CUDA!

If I had to take my guess the first exa-scale computer is going to be the one that solves the GPU interconnect problem at scale.

0: https://www.nvidia.com/en-us/data-center/nvlink/

wumpus · on June 29, 2018

Roadrunner was... special... in that regard, requiring even more effort and bizarreness than the typical HPC GPU setup. I remember that the pre-install plan for Roadrunner Linpack was a 50 page document. Also, it's worth noting that GPU HPC computing was already in full swing around the same time: CUDA was first released in June, 2007, which is the same time that Roadrunner released its first Top500 entry.

shaklee3 · on June 29, 2018

Exactly. First there was nvswitch, which dramatically increased the bandwidth over pcie. But that didn't scale to a large number of GPUs. Then there was nvswitch, which solved the scaling problem inside a node. I wouldn't be surprised if the next leap is something like nvlink cables between nodes that don't need traditional routing capabilities.

shaklee3 · on June 29, 2018

First there was nvlink, rather.

InclinedPlane · on June 29, 2018

Infiniband used to be great but in an era of 40 or 100 gig ethernet it's not particularly special anymore.

DannyBee · on June 29, 2018

Right.

As https://en.wikipedia.org/wiki/InfiniBand#cite_note-20 says

"2009: of the top 500 supercomputers in the world, Gigabit Ethernet is the internal interconnect technology in 259 installations, compared with 181 using InfiniBand.[20]"

gnufx · on June 29, 2018

There are (at least) two parameters that are relevant for the fabric, and latency is the critical one for HPC. Have you seen a materials science computation, of the sort that typically takes much of the time on HPC systems, running on 1GbE and SDR IB, even on three nodes? It sometimes happened on our system by mistake in the 2009 era.

HPL is simply not a good measure of the utility of a system for HPC, which generally accepted amongst HPC people, who deserve to be listened to. There's a good reason you don't see petascale DFT calculations on one of these "cloud provider"s which comprise so much of the top 500. (That might work on Azure, or something else with an Infiniband-ish fabric, but I've still not seen it.)

You can, and to some extent should, model network effects with systems like Simgrid or Dimemas.

InclinedPlane · on June 30, 2018

And that was almost a decade ago.

dooglius · on June 28, 2018

I believe anything that can perform the LINPACK benchmark is eligible, though to qualify the owner of the computer would have to voluntarily run the benchmark and submit their results. Google has chosen not to submit any results, probably because they have better things to do with their warehouses than run benchmarks.

ychen306 · on June 28, 2018

The difference between a supercomputer and a data center is how "connected" the computations are; supercomputer optimizes the communication between nodes. To put it another way, a data center does a lot of work but, most of the time, for different applications (services) whose dependencies are "sparse".

stephencanon · on June 28, 2018

Because they don't submit results. You have to enter to win.

Tangentially, Top500 results are based on one benchmark (latency of enormous double precision matrix triangular factorization), which is relatively far removed from what Google is optimizing for.

dekhn · on June 28, 2018

Definition of supercomputer varies, but TOP500 is based on LINPACK. I am not aware of Google running any LINPACK benchmarks on their hardware (except maybe in Cloud VMs)?

As others will say, the classic Google warehouses weren't really supercomputers, but more like massive clusters with a high cross-sectional bandwidth, but with very high latency, and they didn't run an MPI stack.

deepnotderp · on June 28, 2018

It's all in the Interconnects. The hard part of supercomputing is moving data, not computing.

DannyBee · on June 29, 2018

What makes you think google doesn't have good enough interconnects in their data centers? infiniband is not that impressive anymore

"2009: of the top 500 supercomputers in the world, Gigabit Ethernet is the internal interconnect technology in 259 installations, compared with 181 using InfiniBand.[20] "

zeusk · on June 28, 2018

or the hyperscale clouds (AWS, Azure, GCP).

jcranmer · on June 28, 2018

TOP500 doesn't include distributed systems. Essentially, every computer on TOP500 is a single computer than you can log onto. By contrast, Google's data warehouse would qualify as a large cluster of individual systems.

Note that not all supercomputers are on TOP500. Blue Waters is perhaps the most notable one to not bother reporting its performance (it would probably have been #1 had it done so when it came out, and today it would fall around 13th or so).

dwyerm · on June 28, 2018

I'm not sure that's true. At the very least, EC2 made a showing with C3 instances that made it to #64 in 2013.

https://www.top500.org/system/178321

bbgm · on June 28, 2018

CC2 was #42, which was cool just because of the number.

https://aws.amazon.com/blogs/aws/next-generation-cluster-com...

throwaway2048 · on June 29, 2018

What you are talking about is called "single system image", that is a single address space, storage is globaly visible, etc.

This concept is pretty much completely dead in the supercomputing world, and has been for decades.

mrb · on June 29, 2018

The number one supercomputer on the TOP500 list, Summit, is able to majority-attack about 95% of cryptocurrencies that are GPU-mined: https://twitter.com/zorinaq/status/1007005472505978880 That's one advantage that ASIC-mined currencies have over them. Specialized chips raise the security bar so high that the pre-existing installed base of GPUs cannot attack them.

Arbalest · on June 29, 2018

That sounds good in theory, but I can't help but wonder if this has actually contributed to the huge power draw. Sure they're more efficient, but due to their limited availability, it encourages the big players to consolidate, knowing the barriers to entry are very high for would be competitors. This results in a technological arms race among the biggest players, confident that there will be no added competitors who will come out of nowhere. As they add to their holdings, they also necessarily increase their power consumption, making for an opportunity cost of that otherwise cheap power to other uses.

mrb · on June 29, 2018

The trend of power consumption is no different between GPUs and ASICs. Either way, miners will always be competing to add more and more capacity.

Arbalest · on June 29, 2018

Well my point was that, the confidence of not having serious new competition encourages even further investment, beyond the alternative. This is because their future returns are more predictable.

fulafel · on June 29, 2018

Supercomputers are expensive because of the investment in interconnect & i/o bandwidth & latency.

So using one for a trivially parallelizable, low-communication task like crypto mining would be wasteful - very low bang for the buck.

(There are other cryptanalysis workloads that benefit though, eg parallel number field sieve).

mrb · on June 29, 2018

You are right that Summit is overbuilt if it was going to be used just for raw crypto mining hashpower. However, even so, it would be quite profitable... The last double spend attack on BTG stole ~$20M. There is no reason to think it couldn't steal $50M or so. Repeat the attack on a handful of other cryptocurrencies and you would quickly recoup the cost of Summit ($200M)...

tux1968 · on June 29, 2018

Genuinely curious about what would make this stealing? Isn't it just playing the crypto game better than the competition? What makes it criminal?

bmer · on June 28, 2018

At least for Intel processors, one can run the LINPACK benchmark using pre-built binaries provided by Intel: https://software.intel.com/en-us/articles/intel-linpack-benc...

There must be something similar for AMD processors too, but I can't find it with some quick duckduckgo. Perhaps someone else can link it?

Just a silly thing to compare your PC with the big dogs.

67_45 · on June 28, 2018

I wish that you could buy a simple computer where all processing is integrated. The cores form a pyramid with a few really fast ones on top and tons and tons of slow ones below. All is exposed, with no speculation, in a very low level raw API. All abstractions like speculation etc are layers on top ala vulkan

zackmorris · on June 28, 2018

This might be a good time to ask: my main reservation about TensorFlow is that it's a subset of general purpose computing, so will always be limited to niches like AI or physics simulations or protein folding. If we look at something like MATLAB (or GNU Octave) as general-purpose vector computing, then we need some kind of bridge between the two worlds. I couldn't find much other than this:

https://www.quora.com/How-can-I-connect-Matlab-to-TensorFlow

Does anyone have any ideas for moving towards something more general?

geoalchimista · on June 28, 2018

If you are not confined by the MATLAB platform, then there are a few options. For example, Julia [1] is a general-purpose numerical computing language that has more or less similar syntax with MATLAB for vector/matrix computation, and there is a Julia package TensorFlow.jl [2] that allows you to call TensorFlow in Julia. There are also quite a few packages in development to adapt Julia to GPU-based computation.

And, to be fair, the NumPy/SciPy stack of Python can also be seen as a general-purpose vector computing platform. My feeling is that MATLAB had its days in the 90s. It's just that the most cutting-edge technologies do not seem to be developed in MATLAB any more.

[1] https://github.com/JuliaLang/julia

[2] https://github.com/malmaud/TensorFlow.jl

dahart · on June 28, 2018

Maybe something like Jupyter or plain Python with GPU enabled numpy is what you’re looking for?

Tensorflow is a library and not a language, it’s meant to be used from a host language that is Turing complete. It’s goal is to make construction of graphs of vector evaluations easier and more performant, but not really to provide a general purpose computation environment, it’s assumed you’re calling it from a general purpose computation environment.

So, if you’re using Tensorflow, you normally have general purpose computing available to you, with the option to bake your vector tasks into graphs easily and/or speed them up.

Jupyter is becoming a decent alternative to Matlab, and you have many options for running vector computations from python, with or without a GPU.

joe_the_user · on June 28, 2018

I don't know about tensor flow in particular but are little-known methods of running "general purpose" parallel programs on GPUs. Specifically, H. Dietz' MOG, "Mimd on GPU". It's a shame the project hasn't gotten more attention imo.

http://aggregate.org/MOG/

See: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy for explanations of terms.

zackmorris · on July 2, 2018

Sorry for my late reply, I just wanted to thank you because your comment is an exceptionally good example of what I was trying to get at with my longwinded explanations. Compiling MIMD to SIMD is the future of programming, although it seems that companies will try every other course of action before they realize that.

bmer · on June 28, 2018

What do you mean by "TensorFlow is a subset of general purpose computing, and thus will always be limited to niches"? It's not clear to me at all what one could mean by this. Doesn't TensorFlow have to use matrix math deep down (just like any other digital computing system)?

zackmorris · on June 28, 2018

I'm not sure if you or albertzeyer asked first, but what I meant by that is that MATLAB is similar to any other C-like language, except uses the vector as its primitive instead of something like integer or float. That's really all there is to it. Other than a few details about notation, every major concept of MATLAB stems from that and is easily understood and predictable. MATLAB (or non-proprietary analogs like GNU Octave) let you write C-like code and then its runtime deals with architecture and optimization details internally (so there are no limitations on vector size or number of samplers or anything like that).

Whereas things like TensorFlow, CUDA, OpenCL, OpenGL etc seem to deal more with DSP processing of buffer(s). They all have their own abstractions and lingo which work extremely well for certain use cases, kind of like domain-specific languages (DSLs).

The end result is that it's trivial (at least in theory) to go from a TensorFlow implementation to a MATLAB implementation. But it's very difficult to go the other direction. Another way to think of this is that any solution written in TensorFlow can be run by MATLAB, but the reverse is not necessarily true. Trying to run MATLAB code within TensorFlow might encounter hardware limitations or other restrictions that makes the code run thousands of times slower.

Now I could be wrong about this - maybe they truly are equivalent. But until I'm able to transpile MATLAB code directly to TensorFlow or OpenCL or whatever and have it be performant, I'm going to continue working under this assumption.

albertzeyer · on June 29, 2018

In TensorFlow, a vector (or a tensor) is also one of the fundamental datatypes.

> The end result is that it's trivial (at least in theory) to go from a TensorFlow implementation to a MATLAB implementation. But it's very difficult to go the other direction.

I would say just the opposite. From Matlab to TF should be trivial, but the other way not so much. Or can you give me an example in Matlab which would be hard to translate to TF?

I can give you one in TF which would be very hard in Matlab: E.g. how to implement async SGD, with a parameter server and all that logic?

zackmorris · on June 29, 2018

I'm much more familiar with MATLAB than I am with TensorFlow so I might be wrong about the lowest-level computation stuff. Part of it is an architecture issue. I can imagine writing a transpiler from MATLAB that automatically dices up the matrix calls to work over a distributed system, maybe with something like Go or Elixir running under the hood and sending the jobs in batches to other processors, then joining all of those results, which abstracts away most concurrency issues.

I find it more difficult to think about TensorFlow this way, because to me it seems more similar to OpenCL or OpenGL, where you are dealing with multiple cores reading from a single memory space and transpiling the TensorFlow code to something like a shader. That type of code works more like SIMD or VLIW and has trouble with things like branching or dynamic codepaths or even reading/writing from/to a computer's main memory.

So what I'm looking for is a way to write at the high-level abstraction of MATLAB syntax and then have the runtime dice that up internally as TensorFlow or Elixir or whatever (that part doesn't matter to me as much). Maybe Julia can do that but I haven't learned it yet. But I want to stay away from the boilerplate - things like binding buffers or carrying the mental burden of having separate vector and system memories.

bmer · on June 29, 2018

Yeah, I am pretty certain you're not right in this understanding. I am not the most knowledgeable, but basically all I know is scientific computing, so here are my two cents:

Any C-like language does not use the "vector" as its "primitive" (not clear what you mean by primitive, and I am instead interpreting it as "machine type"), since integers, floats, characters, are the "primitives" (machine types), which are the things that the hardware itself "knows" how to store using a consistent system ultimately involving groups of bits.

On top of these primitives, single-type arrays (e.g. basic C arrays) are built, for instance a string is an array of characters.

Using C/typical hardware, a simple array is literally stored in memory "contiguously". An array only needs two pieces of information to define: what is the memory address of the first element, and how many elements are there. Thus, if you want the 6th element, the machine literally finds the first element, and then moves 6 memory "blocks" (C arrays can only have a single type, so each element needs the same amount of space to store) forwards to the 6th element, and then gives you back what it finds there. The simplest/fastest interpretations of a C array don't even care about how many elements there are, so if your array has 5 things, but you ask for the 6th thing, you just get back the data stored where the 6th thing should have been. Typically, this is garbage, but maybe it's super valuable (e.g. the first character of a stored password is put there..."overflow vulnerability"...to get the second character, ask for the 7th element, etc. etc. etc.)

Okay, going off topic. Back to the point.

Other data structures that can be built upon primitives are structs, enums, etc., but the array is important relevant to us because it's easy to think of an array as a linear algebra vector. C does not implement dot products natively, but one can easily write a function called "dot_product" that takes two lists of integers, or floats and returns a scalar integer or float. Some higher level languages (e.g. MATLAB) do exactly this, and save you the work of implementing all of linear algebra again.

Matrices are more tricky: for usual non-parallelized hardware, they are still stored contiguously as an array of primitives, but some extra data might go along with this array to tell you how many elements you have to pass before you're onto the next row. Again, MATLAB just provides this sugar-coating.

So where do GPUs come in?

Well, think of a simple grayscale image on a monitor, and note that it is built up of little pixels: so, this image can be thought of as a matrix of integers that range from 0 to 255 inclusive (256 values total), where the (i, j) corresponds to how light/dark each (i, j) pixel is: if your monitor is a square that contains 1000 pixels by 1000 pixels, you can represent it by a 1000x1000 matrix. Using C-like languages on typical hardware, you are basically representing your 1000x1000 matrix as a 1,000,000 long array. Imagine you want to transform each pixel (independently of the others) by applying some function you have written --- in typical non-parallelized hardware, you would go through each of the million entries in that array, and apply your function, one after the other (serially).

This is the sort of operation you typically have to do when you want to transform images on a screen. You can imagine that doing it serially would get more and more tiresome, as your images get more detailed, your monitors get higher resolution, etc., so purpose built hardware called GPUs were created, which can represent a matrix as a true machine type: a true "primitive", where memory is actually (i, j) addressable. You can pass matrices to the thing, and get matrices back. If you can do matrices, you can also obviously do vectors. Most importantly, you can give the GPU instructions that say "this functions should be applied to each pixel, and it doesn't care about other pixels when transforming one", and the GPU will apply that function in parallel. It will do a 1,000,000 calculations at once (assuming it can store a 1000x1000 matrix, otherwise it might have to get into some other abstractions, but now we are going off topic).

Eventually people figured that any problem that can ultimately be thought of as "I have X data points, and I want to apply a function on each data point independent of the other" could be run efficiently on a GPU.

Machine learning is just one application where you need to deal with lots of matrices.

People have come up with languages that know about matrices as a machine type (early examples being graphics shader languages), and TensorFlow fits in somewhere in this space, with in-built sugar-coating for things the creators thought were "relevant". I have never used TensorFlow though, so someone else can probably give you more detail.

--------------------------------------

Some things I have not mentioned:

* vectorization: basically, how to do vector operations "smarter" on non-parallelized hardware, something that many languages (e.g. MATLAB, NumPy) and hardware now support

--------------------------------------

Anyway: "transpiling" vector/matrix operations from MATLAB/NumPy to TensorFlow/OpenCL/Cuda is a breeze conceptually (but I bet it's kind of boring problem for advanced programmers). If a transpiler doesn't exist, it's probably because no one has put in the work to open source it. One example of a Python+NumPy to Cuda "transpiler" is Continuum's Numba: http://numba.pydata.org/numba-doc/0.38.0/cuda/index.html

The devs there are also thinking about OpenCL, and some progress was made. This is the current state of that task: https://github.com/numba/numba/pull/582

SPIR-V is basically a standardized "middle language" that "transpilers" can use. What you do is translate from language X -> SPIR-V -> OpenCL/Vulkan/whatever -> hardware drivers -> machine language

And SPIR-V is work in progress: https://en.wikipedia.org/wiki/Standard_Portable_Intermediate...

There is also similar work for MATLAB, but its done by the owners: https://www.mathworks.com/matlabcentral/answers/25973-matlab...

Also, I have to echo other comments about how no one really cares about MATLAB that much in sci-comp, probably because it's proprietary and expensive. Heavy work is usually done in Fortran (legacy code), C/C++, or experimental languages like Julia, or languages that are more flexible/open-source like Python (NumPy/SciPy are just wrappers around lots of C/Fortran).

People who use MATLAB tend to be those who were just "introduced to it" (e.g. through school) and have never hit any limitations that need them to switch to more flexible languages. Or they don't even realize that more flexible languages exist? It's a matter of comfort/goals (I don't advocate that everything should be done in C, and MATLAB is great for quickly doing many things, but many other things that have been done in MATLAB would have been less painful/faster if it used...say, Python+NumPy, because then you can use everything the Python ecosystem has to offer, and aren't limited to the MATLAB ecosystem).

ChrisRackauckas · on June 29, 2018

>say, Python+NumPy, because then you can use everything the Python ecosystem has to offer, and aren't limited to the MATLAB ecosystem

That has an implicit assumption that Python's ecosystem is better for this kind of work than MATLAB which just isn't true in many areas of scientific computing. Take differential equations for example. MATLAB has matcont which is a great bifurcation analysis library which is unrivaled by Python (PyDSTool is a toy in comparison). Python doesn't really have usable DDE solvers. Python's ODE (SciPy, Sundials wrappers, etc) solvers are much less developed than MATLAB's, doing okay since it's wrapping some standard software but with a lot less flexibility than MATLAB. And you can keep going. And to top it off, compilation with Numba or Cython doesn't work well in the case of differential equations since there are a lot of small function calls. So it's not so clear that Python is good at scientific computing at all, and in fact in some regions like differential equations it's quite a step downwards. Generally, the ecosystem in this area is well developed in Julia + the commercial offerings (MATLAB, Maple, Mathematica).

That said, Python's ecosystem outside of scientific computing is so much better than MATLAB's. However, at that point I would think Julia is a good choice.

bmer · on June 29, 2018

> MATLAB has matcont which is a great bifurcation analysis library which is unrivaled by Python (PyDSTool is a toy in comparison).

I know lots of people who work with dynamical systems (math bio), and most people don't use MATLAB's matcont, as it is a toy compared to XPP/AUT. People tolerate XPP/AUT's archaic userface for its power, and can also export data easily to Python for grahping.

Also, the DDE thing is no longer true: https://aip.scitation.org/doi/10.1063/1.5019320

> Python's ODE (SciPy, Sundials wrappers, etc) solvers are much less developed than MATLAB's

What do you mean by "much less developed"? ODE solving is an "old problem" in the sense that it has been optimally addressed in C/Fortran/C++ and its just better to make wrappers around that existing code. ODEPACK/SUNDIALS/PETSc are great examples of a cutting edge standard ODE/PDE solvers which have Python wrappers, and if you've got wrappers around them, you're going to be hard pressed to find anything better.

Then there are people who are developing new ways of integrating ODEs numerically, and for them, MATLAB is only good as a first pass/prototype thing. A professor I know working on something like this doesn't bother using MATLAB for it (C++/Python).

> compilation with Numba or Cython doesn't work well in the case of differential equations since there are a lot of small function calls

Again, I have no clue what you mean by this. I just had a paper published which modelled a system involving hundreds of ODEs, with every involved function in the dynamics being compiled using numba's "nopython" mode. The stuff is blazing fast, only about 10 times slower than a C implementation. The structure of the program looked like this:

scipy.integrate.odeint -> f -> sub-function ... -> sub-function ...

`f` (the function to compute derivatives at each time point) and `sub-function`(s) (which are, I guess, the small function calls you were referring to?) were compiled using numba's nopython mode. It was super easy. You can have a look at the relevant code here: https://github.com/bzm3r/numba-ncc/blob/master/core/dynamics...

So I don't get what you mean by "Numba/Cython doesn't work well in the case of DEs since there are a lot of small function calls".

Note that MAPLE/Mathematica are not in the same camp as MATLAB, since they are primarily symbolic mathematics engines. Both are fantastic though, I agree.

No complaints about Julia. Awesome stuff. Just like Rust. I am not a huge Python fan, in the sense that I am actively moving away from it to "better tools", but I still think Python mostly beats MATLAB.

ChrisRackauckas · on June 29, 2018

>I know lots of people who work with dynamical systems (math bio), and most people don't use MATLAB's matcont, as it is a toy compared to XPP/AUT. People tolerate XPP/AUT's archaic userface for its power, and can also export data easily to Python for grahping.

No, XPP/AUT is almost strictly less powerful than matcont. It can recognize and handle a much smaller set of bifurcations. Even PyCont (part of PyDSTool) can do some things XPP/AUT can't (though there it's much more of a tradeoff). XPP/AUT is good enough for most math bio though since these higher order bifurcations are much more rare to actually find in models.

>Also, the DDE thing is no longer true: https://aip.scitation.org/doi/10.1063/1.5019320

That's matching dde23 which is only non-stiff with constant lags. Still very very simple and cannot handle a lot of DDEs. Mathematica and MATLAB handles state-dependent DDEs. Maple, Julia, and Fortran via Harier's RADAR5 handle stiff state-dependent DDEs.

>The stuff is blazing fast, only about 10 times slower than a C implementation.

This is the overhead I was mentioning. I say it's slow since it's 10x slower than the C implementation. If that's fast enough for you, that's fine, but there's still a lot to be gained there.

>Note that MAPLE/Mathematica are not in the same camp as MATLAB, since they are primarily symbolic mathematics engines. Both are fantastic though, I agree.

Look at their differential equation solver merits in full detail and you'll see that there's a ton of things these cover that Python libraries don't. I was surprised at first two, but they aren't just symbolic engines. Maple has some of the best stuff for stiff DDEs for example, and Mathematica's Verner + interpolation setup is very modern and matches the Julia stuff while MATLAB/SciPy etc. is still using Dormand-Prince (dopri5, ode45).

I am not saying Python's libraries aren't fine. They definitely are fine if you don't need every little detail and if you don't need every lick of speed. But as you said, it's leaving behind 10x on the table. Also, a lot of its integrators don't allow complex numbers. Also it doesn't have access to much IMEX and exponential integrator stuff. So the Python libraries are fine, but the are far from

bmer · on June 29, 2018

Hmm. Interesting stuff. I am surprised that Julia has come along so far. Where can I read more about Julia's integrators?

ChrisRackauckas · on June 30, 2018

It's all in the docs for DifferentialEquations.jl. For example, here's the page for the first order ODE solvers: http://docs.juliadiffeq.org/latest/solvers/ode_solve.html .

Koshkin · on June 29, 2018

> does not use the "vector" as its "primitive"

Well, doesn't the SIMD architecture make vectors "primitives", even if in some limited sense?

bmer · on June 29, 2018

As I said:

-------------------------------------

Some things I have not mentioned:

* vectorization: basically, how to do vector operations "smarter" on non-parallelized hardware[SIMD?], something that many languages (e.g. MATLAB, NumPy) and hardware now support

-------------------------------------

albertzeyer · on June 28, 2018

Why do you think TensorFlow is a subset of general purpose computing? What do you think what is missing? I think nothing is really missing, only that it's maybe more difficult to perform certain kind of tasks. But compared to Matlab/Octave, I don't really see much lacking (in the platform). I would even say the opposite, that the Matlab/Octave platform seems to me like a subset of what TensorFlow offers.

zackmorris · on June 28, 2018

I wasn't sure who was first, so see my reply to bmer here: https://news.ycombinator.com/item?id=17420821

ianbertolacci · on June 28, 2018

> niches like AI or physics simulations or protein folding These are 99% of HPC workloads. Nothing `niche` about them.

fnbr · on June 28, 2018

At work, we have a bunch of vectorized computations that we run in Tensorflow, as it's a convenient way to get GPU-optimized code, so that is still an option (albeit an awkward one).

You could also use something like CUDA, or OpenGL to do this; there are some Python libraries to do basic numerical work, such as PyCUDA or gnumpy.

harias · on June 28, 2018

CUDA libraries fit your description I guess.

davrosthedalek · on June 28, 2018

Since TFA talks about deep learning so much, I wonder how many of the applications run on these machines actually are deep learning, or can make use of the tensor cores in some other way.

godelski · on June 28, 2018

A lot of people are using GPUs for many other things than ML. The big advantage is the number of cores, and people that run on super computers write algorithms that are highly parallelized (otherwise what's the point). GPUs are getting fast enough that the number of cores they share is gaining an edge. Also the memory on them is MUCH faster than that on a CPU, but the cost is that you have less (20Gb compared to 256Gb).

As far as the TPUs, one big advantage for ML is that they are float16/float32 (normal being f32/f64)(in ML you care very little about precision) and are optimized for tensor calculations. For anything that you don't need that resolution and are doing tensor stuff (lots of math/physics does tensor stuff), then these will give you an advantage. (I'm not aware of anyone using these for things other than ML, but I wouldn't be surprised if people did use them) But other things you need more precision and those won't use the TPUs (AFAIK).

garmaine · on June 28, 2018

All modern gpus support f16.

shaklee3 · on June 29, 2018

This, and lots of things besides ML don't need high precision.

bwanab · on June 28, 2018

Given that the top one is at Oak Ridge National Lab, my guess would be that they're not exploring deep learning. They've got other applications in mind.

godelski · on June 28, 2018

ORNL, like everyone else, is studying ML. They are a research lab. But there are a lot of other applications that they are interested in. These GPUs do help with the traditional research that they perform.

tntn · on June 29, 2018

One other point not mentioned in other comments: some work was presented at GTC regarding using tensor cores for a low precision solution followed by iterative refinement to a fp64-equivalent solution. IIRC, 2-4x speed up for fp64 dense system solvers.

blihp · on June 28, 2018

My guess would be the vast majority. In addition to being an area that has everyone's interest right now, the hardware is getting more and more specialized so it just doesn't benefit general purpose computing. Just as FPU enhancements target a fraction of computing tasks, GPU's target an even smaller fraction, Tensor cores / 16-bit FP etc smaller still.