Triton: Open-Source GPU Programming for Neural Networks

boulos · on July 28, 2021

Folks might find the author’s research paper [1] while at Harvard more informative. This is a great high-level description, but if you want more detail, I recommend the paper.

[1] https://dl.acm.org/doi/abs/10.1145/3315508.3329973

lsb · on July 28, 2021

That's http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tille... for those of us outside the paywall

boulos · on July 29, 2021

Huh, I thought the ACM DL was open "now" during the pandemic. Apologies!

mjn · on July 29, 2021

That was unfortunately short-lived. ACM announced on March 30, 2020 that they would open their DL for 90 days due to the pandemic [1]. I don't believe there was an extension, so it expired on June 30, 2020.

[1] https://www.acm.org/articles/bulletins/2020/march/dl-access-...

queuebert · on July 29, 2021

Always wait for the second link.

mshachkov · on July 29, 2021

Note, that there is a new (substantially accelerated) "autoscheduler" implementation in TVM (one of competitors in the article linked), details can be found in https://arxiv.org/abs/2006.06762

bmh · on July 29, 2021

A toy illustrative example, summing two arrays:

  CUDA
    c[i] = a[i] + b[i]
    i += 1

  Triton
    c[i:i+16] = a[i:i+16] + b[i:i+16]
    i += 16

The 16 in this example is the "block size", and could be anything. But this notion of expressing computation over blocks of dense data seems to be the big difference from other approaches.

A very exciting result of the incredible performance that Triton achieves, is the ability to fuse NN operations such as Matrix Multiply + LeakyReLU + Batch Norm. Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory, and then your Batch Norm would read the LeakyReLU out of memory again.

The ability to write very fast kernels, and especially being able to fuse them together, to avoid unnecessary memory round-trips is a big deal!

volta83 · on July 29, 2021

> Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory,

You could do that, but you can also just tell cuBLAS to fuse ReLU, by just passing the "CUBLASLT_EPILOGUE_RELU" option (among others), see the manual: https://docs.nvidia.com/cuda/cublas/index.html#cublasLtEpilo...

This has been possible for years. It's the kind of 1 line change that makes a big difference.

thebruce87m · on July 28, 2021

Unfortunate name clash with NVIDIAs Triton Inference Server: https://developer.nvidia.com/nvidia-triton-inference-server

6gvONxR4sf7o · on July 28, 2021

The author commented on reddit about that (https://www.reddit.com/r/MachineLearning/comments/otdpkx/n_i...)

> PS: The name Triton was coined in mid-2019 when I released my PhD paper on the subject (http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tille...). I chose not to rename the project when the Triton inference server came out a year later since it's the only thing that ties my helpful PhD advisors to the project.

king_magic · on July 28, 2021

The author is unfortunately wrong. NVIDIA's Triton was referenced in marketing material as far back as 2018. https://developer.nvidia.com/blog/nvidia-serves-deep-learnin...

0x008 · on July 29, 2021

Are we sure that marketing article was not changed after the fact? The Nvidia Triton Release notes show 'Starting in release 20.03, TensorRT Inference Server is now called Triton Inference Server.' [0]

It kind of looks like the article just changed the heading later since in the article the docker images they refer to etc. are all still called 'tensorrtserver' which was the project's name before they changed naming in Release 20.03.

The OpenAI-Triton Author says on reddit: 'You can also look at the github history and you will see that there is no mention of the "Triton inference server" up until version 2.0, which wasn't out in 2019 (I ran `git reset --hard v1.9.0 ; grep -ir "triton" .`)' [1]

- [0]: https://docs.nvidia.com/deeplearning/triton-inference-server...

- [1]: https://old.reddit.com/r/MachineLearning/comments/otdpkx/n_i...

king_magic · on July 29, 2021

Eh, I suspect the author is mistaken here, but who knows, maybe that was updated after the fact? Hard to say who is right.

wyldfire · on July 29, 2021

Confusingly similar: check

Same industry: check

Maybe the author should expect an incoming C&D. IIRC US is first-to-use, so NVIDIA would prevail?

cyberpunk · on July 28, 2021

I'm sure Joyent used the word also for their Hypervisor [0]

0: https://www.joyent.com/smartos

0x0EB0DA10 · on July 29, 2021

Also clashed with Triton[0], a demoscene group famous for its Crystal Dreams 2 demo, which featured zooming mandlebrot on 386 PC https://youtu.be/BLMUfBikxTY?t=182

[0]: https://en.wikipedia.org/wiki/Triton_(demogroup)

polynomial · on July 28, 2021

My first thought exactly. This will cause nothing but confusion and Triton (the inference server) is well integrated into the space.

So it's especially weird to see it coming from OpenAI, and not a more random startup. It honestly makes no sense they would deliberately do this, unless there is some secret cult of Triton that is going on in the Bay Area world of AI/ML.

pontusrehula · on July 29, 2021

Maybe name it Trinitron. That name is not used so much anymore :)

dvcrn · on July 29, 2021

Also name clash with Triton, a binary analysis framework - https://github.com/JonathanSalwan/Triton

JDDunn9 · on July 28, 2021

> CPUs and AMD GPUs are not supported at the moment, but we welcome community contributions aimed at addressing this limitation.

That's disappointing. My biggest frustration is that every ML library (Pytorch, Keras, etc) is tied to CUDA/Nvidia, so I take a huge performance hit when running them on my Mac.

ganoushoreilly · on July 28, 2021

I'm really surprised AMD hasn't pushed much harder into the space given their aggressive targeting of the data center with Epyc. I too use mostly Macs during the day and any ML projects are always relegated to Nvidia machines in my rack / amazon.

make3 · on July 29, 2021

not pytorch anymore. there's support now

shubuZ · on July 28, 2021

I have found writing CUDA code is much simpler than writing correct multi-threaded AVX2/AVX-512 code.

dragontamer · on July 28, 2021

If you need CPU-side SIMD, then try ispc: https://ispc.github.io/

Its pretty much the OpenCL-model, except it compiles into AVX2 code / AVX512 code. Very similar to CUDA / OpenCL style programming. Its not single-source like CUDA, but it largely accomplishes the programming model IMO.

gnufx · on July 28, 2021

Why not a standard? OpenMP is more than pretty much C(++) and Fortran, and has offload inspired by the needs of the Sierra supercomputer.

dragontamer · on July 28, 2021

OpenMP 4.5+ looks very promising to me (particularly the "simd" keyword associated with for-loops). But open-source implementations of OpenMP are somewhat lackluster... at least last time I checked it out.

Maybe its time I revisit it. I've always thought the OpenMP spec was being written by highly competent programmers. They seem to "get" what is needed. But the question is if I can get my hands on any OpenMP implementation to actually play with.

Not all of us can afford IBM's compiler suite!

gnufx · on July 28, 2021

I should really find time to do comparisons with the compilers to hand, including XL, on the NAS benchmarks, as I've never seen that, though it must have been done. I think we have a "community" version of XL, i.e. no support, like basically everything else. I wasn't aware there was anything much wrong with GCC and libgomp or libomp, but then I haven't measured.

jpf0 · on July 28, 2021

LLVM has an openMP implementation

dragontamer · on July 28, 2021

The task-based parallelism in LLVM leaves much to be desired however. Ideally, you'd want a more efficient implementation.

But yeah, good enough to play with. But maybe not good enough to achieve high levels of performance. The SIMD stuff is probably simple enough to implement... maybe I should checkout how well LLVM works with OMP SIMD keywords.

jpf0 · on July 28, 2021

Can you comment on experience (or contact me) regarding implementation efficiency? We have recently implemented task-based parallelism in the J language with openMP[0]. Improvements or critiques are appreciated. SIMD instructions there have been coded directly rather than via pragmas.

[0] https://www.monument.ai/m/parallel

dragontamer · on July 28, 2021

I can't say that my critiques are based off of personal experience. But mostly about microbenchmarks I've read that other people have talked about. I am probably a bit out of date, since its been a while since I last played with OpenMP.

I'm looking at the benchmarks I used to look at, and they're all from 2014 or earlier. So maybe I really should double-check modern implementations. We all know GCC 4.x and LLVM 3.x are an eternity ago, so I probably should revisit their performance.

For example: https://www.phoronix.com/scan.php?page=article&item=llvm_cla...

And back then, it was pretty well known that OpenMP implementations were slower than commercial (such as Intel ICC or IBM's OpenMP implementation).

gnufx · on July 28, 2021

In LLVM or in libomp? I don't know what omp simd is likely to get you over autovectorization. I know of cases where it was thought necessary (-fopenmp-simd, without -fopenmp) but wasn't with recent GCC.

dragontamer · on July 28, 2021

Autovectorization has issues with function calls.

"#pragma omp declare simd" applies over a function call, which then allows that function to be used inside of a "#pragma omp for simd" loop.

A few keywords here and there really help the autovectorizer achieve closer to CUDA-like environments (like... actually having your SIMD code extend "through" a function call, so you can start splitting up the work a bit better).

EDIT: Here's an example from Intel's ICC: https://software.intel.com/content/www/us/en/develop/documen...

gnufx · on July 29, 2021

I took the example program from the OpenMP standard and built it with GCC 11 -Ofast. -fopt-info said the relevant loop was vectorized. Adding -fopenmp gave more vectorization messages from elsewhere, but I don't have time to figure out the difference from the tree dump (not being good with assembler). Doubtless the directives can help, but you do need to get them right, and I trust GCC more than me!

einpoklum · on July 28, 2021

It's much more limited in expressivity than OpenCL/CUDA, IIANM.

37ef_ced3 · on July 28, 2021

Use a domain-specific compiler to generate custom, stand-alone, massively multi-threaded AVX-512 inference C code:

https://NN-512.com

The generated code is easily twice as fast as TensorFlow's AVX-512 kernels (Intel's oneAPI).

etaioinshrdlu · on July 28, 2021

Your project is well engineered but no matter how many times you post it here, you won’t get real traction without a different development model, such as building on github. This is especially true in a high churn, high cost field like ML. I also think you are being anonymous unnecessarily.

johndough · on July 29, 2021

Some user feedback: I tried to get the ResNet50 example to work, but I gave up after 2 hours. There really should be an end-to-end example like "./resnet50_example.py monkey.jpg".

A few points where I struggled:

- What are the names in the ResNet50Params struct? I looked at a few popular ResNet50 implementations, but I could not find any correspondence and since the names have been sorted, the order of the members might as well be random. I thought about matching the parameters by shape instead, but the chance of getting that correct is almost zero. Why no natural sort? Or even better, a naming scheme that sorts in the same order as execution order, for example with zero-prefixed numbers like "layer05"

- 6 MB C files are somewhat ridiculous. There should at least be a download link somewhere. Copy & pasting from the website would take forever because of scrolling. Also it is impossible to verify that that code monstrosity is not doing anything evil.

- Go seems like an odd choice when almost every machine learning project these days is developed in Python.

- GCC throws a few hundred warnings:

    ResNet50.c: In function ‘ResNet50ThreeArrangeDats7Callee1’:
    ResNet50.c:57009:15: warning: unused variable ‘rel27’ [-Wunused-variable]
    57009 |     ptrdiff_t rel27 = j62-0;
          |               ^~~~~

and

    ResNet50.c: In function ‘ResNet50NetCreate’:
    ResNet50.c:77852:20: warning: taking address of packed member of ‘struct ResNet50Params’ may result in an unaligned pointer value [-Waddress-of-packed-member]
    77852 |             params1->bn1Means,
          |             ~~~~~~~^~~~~~~~~~

- You would get more feedback if there was a platform to discuss such issues (e.g. GitHub) instead of hijacking random threads on HackerNews.

- EDIT: I just realized I could probably just write a parser for that graph file format. This would be much easier if it was something standard like JSON instead.

- EDIT2: I think some shapes in the graph file definition are incorrect (possibly reversed per block?). For example, tensor "one1" should have "ToChannels=64" instead of "ToChannels=256".

37ef_ced3 · on July 29, 2021

Here is a guy who worked it out:

https://github.com/jonatron/test_nn512

That's a full ResNet50 conversion from the corresponding Caffe protobuf. He also wrote a small script to convert TensorFlow networks.

GCC warns about perfectly valid, correct C. It even warns about switch case fallthough. All of the generated code is fully correct. Turn the warnings off.

Please don't copy and paste from the website! Download the NN-512 executable and run it.

Email me if you still have difficulties.

johndough · on July 29, 2021

Thanks, that code does indeed work! And it is significantly faster than PyTorch (15 vs 60 milliseconds). Maybe I will use it in the future if AVX512 should become more widespread.

37ef_ced3 · on Aug 2, 2021

The new NN-512 release (version 35) changes the build instructions to suppress GCC's warnings while building the generated code (-w switch):

  // To build an object file:
  // gcc -c -w -std=c99 -pthread -Ofast -mavx512f Example1.c

Thanks for your feedback.

37ef_ced3 · on July 29, 2021

Awesome. Send me email any time and I will assist you.

mvanaltvorst · on July 28, 2021

As a sidenote, these SVG graphs are absolutely beautiful. The flow diagram is even responsive on mobile!

croes · on July 28, 2021

Too bad it's CUDA Sooner or later this will become a problem because you are depending on the benevolence of a single manufacturer.

yumraj · on July 28, 2021

Sure, but if this abstraction layer becomes popular, then it becomes much easier to support other GPUs without requiring client libraries to change, which is a much harder problem.

dragontamer · on July 28, 2021

A big reason why CUDA is popular with compilers is that the PTX assembly-ish language is well documented and reasonable.

Compilers generate PTX, then the rest of the CUDA infrastructure turns PTX into Turing machine code, or Ampere machine code, or Pascal machine code.

In theory, SPIR-V should do the same job, but its just not as usable right now. In the meantime, getting it to work on PTX is easier, and then there's probably hope (in the far future) to move to SPIR-V if that ever actually takes off.

I'm not a developer on Triton, but that'd be my expectation.

pjmlp · on July 30, 2021

PTX and the immense tooling around CUDA.

Khronos mindset of it must be C like and parterns will take care of the ecosystem is what doomed OpenCL.

All of their API design endevours are "design by committe" at its best.

No wonder that SYSCL is now backend agnostic.

dragontamer · on July 28, 2021

AMD's ROCm 4.0 now supports cooperative groups, which is probably one of the last major holdouts for CUDA compatibility.

There's still the 64-wavefront (for AMD CDNA cards) instead of 32-wavefronts (for CUDA). But AMD even has 4x4 half-float matrix multiplication instructions in ROCm (for MI100, the only card that supports the matrix-multiplication / tensor instructions)

---------

I think CUDA vs OpenCL is over. ROCm from AMD has its restrictions, but... it really is easier to program than OpenCL. Its a superior model: having a single language that supports both CPU and GPU code is just easier than switching between C++ and OpenCL (where data-structures can't be shared as easily).

-----------

The main issue with AMD is that they're cutting support for their older cards. The cheapest card you can get that supports ROCm is Vega56 now... otherwise you're basically expected to go for the expensive MI-line (MI50, MI100).

meragrin_ · on July 28, 2021

> The main issue with AMD is that they're cutting support for their older cards.

No, their main issue is not properly supporting ROCm in general. No Windows support at all? It still feels like they don't know whether they want to continue investing in ROCm long term.

dragontamer · on July 28, 2021

I'd assume that they're gonna support ROCm as long as the Frontier deployment at Oak Ridge National Labs is up. ORNLs isn't exactly a customer you want to piss off.

N1H1L · on July 28, 2021

Yes. I am hopeful Frontier will shake things up.

zozbot234 · on July 28, 2021

The new contest is not CUDA vs. OpenCL but CUDA vs. Vulkan Compute. As support for Vulkan in hardware becomes more widespread, it makes more and more sense to just standardize on it for all workloads. The programming model is quite different between the two (kernels vs. shaders) and OpenCL 2.x has quite a few features that are not in Vulkan, but the latest version of OpenCL has downgraded many of these to extensions.

dragontamer · on July 28, 2021

CUDA programmers choose CUDA because when you make a struct FooBar{}; in CUDA, it works on both CPU-side and GPU-side.

Vulkan / OpenCL / etc. etc. don't have any data-structure sharing like that with the host code. Its a point of contention that makes anything more complicated than a 3-dimensional array hard to share.

Yeah, Vulkan / OpenCL have all sorts of pointer-sharing arrangements (Shared Virtual Memory) or whatnot. But its difficult to use in practice, because they keep the concepts of "GPU" code separate from "CPU" code.

---------

When you look at these things: such as Triton, you see that people want to unify the CPU-and-GPU code into a single code base. Look and read these Triton examples: they're just Python code, inside of the rest of CPU-Python code.

I think people are realizing that high-level code can flow between the two execution units (CPU or GPU) without changing the high level language. The compiler works hard to generate code for both systems, but its better for the compiler to work rather than the programmer to work on integration.

enos_feedler · on July 28, 2021

Is it practical for code to flow between these two execution units? In my understanding, the architecture of a GPU and CPU are so fundamentally different that it doesn’t reallly make sense. You do want to have a single data structure representation however.

dragontamer · on July 28, 2021

> In my understanding, the architecture of a GPU and CPU are so fundamentally different that it doesn’t reallly make sense.

Yes and no. You find surprising bits of code that can be shared.

Read through a high-performance GPU project like "GPU perft" (https://github.com/ankan-ban/perft_gpu/blob/master/chess.h). The "perft" problem is the problem of counting the number of board positions reachable in X moves. (Perft(15) is "how many board positions exist within 15 moves from the start of chess?". Perft(1) is 20, and Perft(2) is 400, then 3+ gets a bit complicated!).

You can see that the fundamentals of bitboard manipulation (using the 64-bit number to represent the 64-squares of a chessboard) remain the same whether or not you're on a CPU or GPU.

In this case, you want the code to be shared between the two sides. Why write the code twice? Both CPUs and GPUs are very good at 64-bit integer manipulation.

------------

The actual search (how to branch off, coalesce results, coordinate threads) is extremely different between CPU and GPU. So of course, you want that code to be written in a CPU-specific, or GPU-specific manner.

But the question of "where can this Bishop move??" (the sliding piece attack subroutine) is identical on CPU or GPU.

enos_feedler · on July 28, 2021

Sounds similar to business logic shared between and client and server web app. Thanks

I just wonder if this alone is enough to make a unified programming environment the preferred way. On the web we had Meteor.js which tried this approach to unify client and server with javascript. The shared code was of similar types. Meteor never became the preferred way to write web apps. Im wondering if the same is true for GPU + CPU programming.

nl · on July 29, 2021

This analogy is insufficient to capture what is going on.

In web dev, the client and server are doing fundamentally different things (eg UX vs data storage) and while sharing business logic is interesting it doesn't solve a huge problem.

CPU vs GPU is vaugly analogous to different web browsers on the client in the old days when they were very different. They have different performance characteristics, so shims worked but had performance impact. But they were widely used because it was too hard to do yourself.

enos_feedler · on July 29, 2021

I don’t think this is accurate. CPU and GPU are not trying to execute the same code with different performance characteristics. GPU and CPU are fundamentally doings different things. In fact the analogy for shared business logic works perfectly

dragontamer · on July 29, 2021

The other poster is right. The CPU and GPU are just two execution resources available to the programmer, no different than spinning up a new pthread_create (well... a bit more complicated but you get the gist).

Even in video games, where GPUs do graphics and CPUs do physics, the lines have been blurred with TressFX (hair physics on GPU instead of CPU) or PhysX (collision physics done on GPU instead of CPU).

GPUs are general purpose computers now. You can program them to do whatever you want. So when the programmer is writing code, it pays to experiment a little: maybe run the code on the GPU... see if the GPU really is slower or faster. You might be surprised.

Ditto on the reverse: maybe some bit of GPU code would be better run on the CPU.

--------

This guess-and-check, program-and-profile methodology is the same as any other high-performance code people write. You throw away assumptions and just test the heck out of all your ideas, because you're surprised far more often than not in this field.

As such, having the ability to quickly move code from GPU-side to CPU-side (and vice versa) for these tests is extremely beneficial. You don't really know which device will run your code faster until you actually test it. (Though your instincts get better the more experience you gain)

nl · on July 30, 2021

To expand on this:

GPUs are (basically) hardware devices for doing vast numbers of matrix manipulations very very fast in parallel, with limitations on the size of the matrix.

CPUs can do matrix manipulations, with less parallelism.

Things like TressFX uses the tridiagonal matrix algorithm[1] to solve a linear system representing the hair constraints system. This could be implemented on a CPU as well, doing exactly the same thing with the same algorithm. It would run slower though!

[1] https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm

[2] Slide 37, https://www.slideshare.net/DevCentralAMD/gs4147-billbilodeau

nl · on July 30, 2021

No, they are trying to execute exactly the same code. To quote the parent:

You can see that the fundamentals of bitboard manipulation (using the 64-bit number to represent the 64-squares of a chessboard) remain the same whether or not you're on a CPU or GPU.

In this case, you want the code to be shared between the two sides. Why write the code twice? Both CPUs and GPUs are very good at 64-bit integer manipulation.

The ceremony around co-ordinating threads and memory access is different, but the code that is being run is exactly the same.

enos_feedler · on July 30, 2021

The exact same business logic (helper functions), yes of course. My point is the high level algorithm is going to be fundmanetally different. Just like the high level algorithm on the browser (render and handle UI interaction) is different than the fundamental server-side algorithm (render html from database requests), and yet there might be shared helper functions. Does this make sense?

nl · on Aug 1, 2021

No, the algorithm is exactly the same.

For example, in neural networks the key algorithm is differentiation. That is implemented the same way on both CPUs and GPUs.

enos_feedler · on Aug 6, 2021

differentiation isn't really the key algorithm. In fact, I believe that happens at compile time, not run-time. So not only is not key, it isn't even happening at run-time.

pjmlp · on July 30, 2021

Given the state of Vulkan tooling the contest is already won, NVidia doesn't need to worry about anything.

Athas · on July 28, 2021

> ROCm from AMD has its restrictions, but... it really is easier to program than OpenCL. Its a superior model: having a single language that supports both CPU and GPU code is just easier than switching between C++ and OpenCL (where data-structures can't be shared as easily).

ROCm isn't a programming language or programming model - it's a collection of tools and libraries. Which language are you specifically referencing here? HIP?

dragontamer · on July 28, 2021

Yeah, HIP specifically. That's the part of ROCm that is aiming to give a CUDA-like environment.

N1H1L · on July 28, 2021

Two reasons also. Some of the fastest research supercomputers in the US have been nVIDIA based. Sitting at a national lab, it's laughable how much more support there is from nVIDIA. Getting a DGX box is easy - there is nothing comparable from AMD's side.

And the less said about Intel it's better. Smart, motivated people I know have left Argonne recently because of the Aurora shitshow.

bradneuberg · on July 29, 2021

Can you comment more on what’s going on at Argonne related to Aurora?

N1H1L · on July 29, 2021

Aurora was supposed to be up and running in early 2018. It seems like that it won't be functional even in 2021. This is by the way, such an years long delay has never happened when it comes to US Leadership Computing Facilities.

Intel, last I heard has written off close to 300 million dollars on Aurora.

dragontamer · on July 29, 2021

Do you know if there's any post-mortems about what went wrong there?

I mean, I have the gist. Intel bet big on Xeon Phi, but that didn't seem to offer enough performance. Intel then switched over to this GPU-strategy (see Intel Xe), but that required them to rework virtually everything from scratch.

No one said anything in my previous paragraph. But its blatantly obvious: the Xeon Phi was being advertised very strongly by Intel up to the point that Aurora's design was reworked. Suddenly, Xeon Phi was cut, and Intel started talking about Xe (including OneAPI, and all sorts of stuff the HPC market cares about). I'm confident enough at reading between the lines that I'm willing to assume this in the absence of hard facts :-)

I'm kind of curious on the play by play, if at all possible. What test showed up that made Argonne National Laboratory decide that the Xeon Phi model wasn't going to work? Was it possible to come to this conclusion any sooner? I realize this sort of stuff is rarely published, but I guess that's what makes me interested in those juicy details.

N1H1L · on Aug 2, 2021

My friends were at more junior levels so I don't know the full details too. However know this, a leadership class computing purchase by a national lab is a very complicated thing. Many people have their hands in the decision making pie - all the way up to Secretary of Energy.

Secondly, DOE has a strategy (mostly rightly in my opinion) of not putting all its' computation eggs in a single companies' basket. Thus national labs compete between themselves as do companies. Notice the cadence of computing purchases: IBM/Nvidia -> Intel/Intel -> AMD/AMD.

Thirdly, Intel had promised that using OneAPI existing GPU optimized simulation codes could be translated to Intel GPUs with minimum effort. The idea had merit, back in 2016-2017. I have a CUDA based simulation code - it's almost a matter of recompiling to Intel, with some minimal effort on my part. That it didn't work out - well hindsight is 20/20

anon_tor_12345 · on July 29, 2021

E.g. Pavan lol

N1H1L · on July 29, 2021

Balaji is legit. He is an ex Panda student and those guys know their stuff

einpoklum · on July 28, 2021

> ROCm from AMD has its restrictions, but... it really is easier to program than OpenCL

That really not the point. OpenCL is a standard that - at least in principle - is supposed to be supported on multiple platforms by multiple vendors. ROCm is AMD-only, and even that is questionable since it didn't exist 6 or 7 years ago, and who knows - they might drop it like they changed their earlier focus.

Also, CUDA has a much richer ecosystem than AMD ROCm (I'm sad to say; as I'm not a fan of NVIDIA).

dragontamer · on July 28, 2021

> OpenCL is a standard that - at least in principle - is supposed to be supported on multiple platforms by multiple vendors.

As was C++AMP (which was actually pretty good IMO as a language). Just because its a standard doesn't mean its going to be used.

OpenCL 2.0 was very poorly implemented: almost no one used any of its advanced features. To the point that OpenCL 3.0 is resetting from OpenCL 1.2.

Only Intel really supported OpenCL 2.1 or OpenCL 2.2. The entire OpenCL 2.x branch for years was squandered with tepid responses from NVidia and AMD (yes, AMD had better OpenCL 2.0 support. But it's debugger didn't work, all of the code was tested on OpenCL 1.2 only. No one cared)

I dare say that OpenCL 2.x was about as "standardized" and respected as C++ AMP. Just because its an open standard doesn't mean that its actually useful. Any serious OpenCL programmer stuck with OpenCL 1.2, including both AMD and NVidia OpenCL programmers.

einpoklum · on July 28, 2021

Pretty much agree; and I'll add NVIDIA has also made it a point not to support profiling (and probably debugging) OpenCL... to this day.

Still, that doesn't contradict what I said in my earlier post.

There was also OpenCL C++, which I really wanted to see implemented by GPU vendors, especially NVIDIA, and that didn't happen either.

As for C++AMP - did that ever work on NVIDIA hardware?

dragontamer · on July 28, 2021

C++ AMP compiled into Microsoft's DirectCompute which works on Nvidia (through the video game DirectX stuff)

my123 · on July 28, 2021

Reminder that ROCm is still not supported on RDNA and RDNA2.

blt · on July 28, 2021

Forgive my ignorance of legal matters, but could AMD write a CUDA compiler now or in the future? After some patent expires?

croes · on July 29, 2021

Seems like CUDA only is only at the moment. From their github:

>Under development: AMD GPUs, CPUs

shubuZ · on July 28, 2021

Which other hardware vendor provides the level of performance that Nvidia's GPU provide? Wasnt the benevolence on single (or couple) manufacturer(s) true in 90s, 2020s?

croes · on July 28, 2021

It's not about performance but open standards. Remember Oracle vs. Google, at some time in the future NVidia could decide to get money out of CUDA.

einpoklum · on July 28, 2021

Well, yes and no. Neither AMD nor NVIDIA are committed to open, libre, standards - they've both been "betraying" OpenCL (just like Apple and Intel, I should mention). So we have AMD's not-open (?) ROCm and NVIDIA's not-open CUDA.

Also, if we only cared about open standard we'd simply not use a GPU at all and settle RISC-V chips :-)

hhh · on July 28, 2021

Tenstorrent.

dragontamer · on July 28, 2021

AMD's MI100 is slightly faster than NVidia A100 for double-precision FLOPs at slightly lower costs. Good enough for Oak Ridge National Labs (Frontier Supercomputer), to say the least.

NVidia is faster at 4x4 16-bit matrix multiplications (common in Tensor / Deep Learning stuff), but MI100 still has 4x4 16-bit matrix multiplication instructions and acceleration. Its not far behind, and the greater 64-bit FLOPs is enough to win in scientific fields.

gnufx · on July 28, 2021

Is Frontier known to be using M100? I haven't been following along, but it was trailed as using purpose-built GPUs (presumably also for Sierra).

dragontamer · on July 28, 2021

Frontier is going to be an AMD-GPU for sure. Whether its MI100 or the next generation MI200 is still not really known and subject to rumors. I'm currently under the assumption that its MI100 however, but now that you press me on that detail I admit that I'm somewhat assuming. Frontier will be using ROCm in any case.

El Capitan (Lawrence Rivermore National Labs) is MI100 and specified as such. So I'll "retreat" my argument to that more easily verified position. Still a top supercomputer in the world, and more is known about El Capitan than Frontier.

gnufx · on July 28, 2021

Yes, I agree with the basic point. Where is the spec for El Capitain, then? I don't remember ever finding it, and I rather assumed it would wait for possible lessons from the Frontier hardware.

It's possibly worth saying that these things presumably are competitive with nvlink-based multi-GPU, and appear to support unified memory, but I don't know how that compares with ac922s. Not that UM is relevant for what I know of usage on our system....

dragontamer · on July 28, 2021

Wow. This is quite frustrating.

I have plenty of links, but they're all from non-primary source newsites, like servethehome or (worse) wcftech. Those are more "casual" blogs that aren't worth quoting IMO and are closer to rumormills (even if they sometimes have good information).

I can't find a single spec from AMD's website or ORNL, or the Department of Energy (primary sources: the ones we know would tell the truth on this matter).

The closest I got is this: https://www.amd.com/system/files/documents/oak-ridge-nationa...

Which is an MI100 test that ORNLs did in "preparation" for Frontier. It doesn't seem like Frontier nor El Capitan have had their specs officially released yet.

N1H1L · on July 28, 2021

Try this

https://www.olcf.ornl.gov/frontier/

nl · on July 29, 2021

Still doesn't have specs I can see!

riyadparvez · on July 28, 2021

I am confused. Is it another competitor of Tensorflow, JAX, and Pytorch? Or something else?

peytoncasper · on July 28, 2021

I believe this is more of an optimization layer to be utilized by libraries like Tensorflow and JAX. More of a simplification of the interaction with traditional CUDA instructions.

I imagine these libraries and possibly some users would implement libraries on top of this language and reap some of the optimization benefit without having to maintain low-level CUDA specific code.

jacoblambda · on July 28, 2021

Im curious how it would compare to Halide Lang. They both seem to be targetting the same problem.

sanxiyn · on July 29, 2021

https://triton-lang.org/programming-guide/chapter-2/related-... has comparison with Halide, which is categorized as "Scheduling Languages" there.

blueblisters · on July 28, 2021

So is this similar to XLA?

jpf0 · on July 28, 2021

XLA is domain-specific compiler for linear algebra. Triton generates and compiles an intermediate representation for tiled computation. This IR allows more general functions and also claims higher performance.

obligatory reference to the family of work: https://github.com/merrymercy/awesome-tensor-compilers

anon_tor_12345 · on July 28, 2021

Without reading the paper, I think you have it a little backwards - the IR doesn't itself allow for more general functions. More general functions are possible (in theory) because the frontend (this Triton language) is decoupled from the backend (CUDA) through the IR as an interface. In this way the Triton IR is no less domain specific than XLA (because both are IRs that represent sequences of operators that run on GPU (or TPU or whatever). I guess in theory Triton could be eschewing all of eg cuDNN but most likely it's not as NVIDIA's closed source kernels perform best on their closed source hardware.

Edit: should've read the post before commenting. Looks like they are in fact using LLVM's PTX backend (ie generating cuda kernels from scratch). Kudos to them

N1H1L · on July 28, 2021

From what I got from reading the docs and the blog post, was that this is a competitor to torch.jit and numba.cuda.jit - write faster Pythonic GPGPU kernels without sacrificing speed.

fzimmermann89 · on July 28, 2021

So the code looks (apart from pointers) similiar to numba which feels much closer to numpy/pytorch high level code. Are there huge advantages in the triton model compared to numba that I don't see? Or is there a big performance gap? For me numba was always the easiest way to get some new idea running on cuda, and most of the time it was fast enough..

Did anybody find performance comparison between numba and triton?

sanxiyn · on July 29, 2021

Unlike Numba, Triton operators operate on blocks with explicit load and store of blocks. This is what enables analysis to automate coalescing, shared memory management, etc.

I guess Triton is not for you if Numba is fast enough.

notthedroids · on July 28, 2021

Does Triton support automatic differentiation? I don't see that feature in a quick poke through the docs.

If it does compile to LLVM, I suppose it can use Enzyme https://enzyme.mit.edu/

bmh · on July 29, 2021

This sits BELOW the automatic differentiation layer.

xmaayy · on July 28, 2021

I wonder if this can be used for graphics programming. Shaders are notoriously hard to write correctly and this seems like it might provide an easier gateway than OpenGLSL

ipsum2 · on July 28, 2021

Taichi is a similar project focused on graphics: https://taichi.graphics/

voldacar · on July 29, 2021

If python had proper macros/easy AST processing you wouldn't need to make this a separate language. still cool

giacaglia · on July 28, 2021

OpenAI keeps innovating. Amazing to see the speed of execution of the team

ipsum2 · on July 28, 2021

This guy developed Triton for his PhD thesis, and OpenAI hired him to continue working on it. Doesn't really seem fair to give all the innovation credit to OpenAI.

See: https://www.reddit.com/r/MachineLearning/comments/otdpkx/n_i...

shmerl · on July 28, 2021

This seems to be tied to CUDA? Why not build it on top of portable GPU programming base?

einpoklum · on July 28, 2021

This sounds like basically hand-holding for Python programmers to write simple NN-operations. I'm sure it's convenient and useful, but it's still glorified glue code.

bmh · on July 29, 2021

No, this is a DSL that allows people who normally write CUDA, to do so with less lines of code, and end up with a faster kernel. By embedding it inside Python you don't need to write your own lexer/parser.

blt · on July 28, 2021

This is kind of the opposite of glue code.