Folks might find the author’s research paper [1] while at Harvard more informative. This is a great high-level description, but if you want more detail, I recommend the paper.
That was unfortunately short-lived. ACM announced on March 30, 2020 that they would open their DL for 90 days due to the pandemic [1]. I don't believe there was an extension, so it expired on June 30, 2020.
Note, that there is a new (substantially accelerated) "autoscheduler" implementation in TVM (one of competitors in the article linked), details can be found in https://arxiv.org/abs/2006.06762
CUDA
c[i] = a[i] + b[i]
i += 1
Triton
c[i:i+16] = a[i:i+16] + b[i:i+16]
i += 16
The 16 in this example is the "block size", and could be anything. But this notion of expressing computation over blocks of dense data seems to be the big difference from other approaches.
A very exciting result of the incredible performance that Triton achieves, is the ability to fuse NN operations such as Matrix Multiply + LeakyReLU + Batch Norm. Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory, and then your Batch Norm would read the LeakyReLU out of memory again.
The ability to write very fast kernels, and especially being able to fuse them together, to avoid unnecessary memory round-trips is a big deal!
> Previously, you needed to rely on cuBLAS for fast hand-written Matrix Multiply kernels, and then your LeakyReLU would need to read that result out of memory,
> PS: The name Triton was coined in mid-2019 when I released my PhD paper on the subject (http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tille...). I chose not to rename the project when the Triton inference server came out a year later since it's the only thing that ties my helpful PhD advisors to the project.
Are we sure that marketing article was not changed after the fact? The Nvidia Triton Release notes show 'Starting in release 20.03, TensorRT Inference Server is now called Triton Inference Server.' [0]
It kind of looks like the article just changed the heading later since in the article the docker images they refer to etc. are all still called 'tensorrtserver' which was the project's name before they changed naming in Release 20.03.
The OpenAI-Triton Author says on reddit: 'You can also look at the github history and you will see that there is no mention of the "Triton inference server" up until version 2.0, which wasn't out in 2019 (I ran `git reset --hard v1.9.0 ; grep -ir "triton" .`)' [1]
Also clashed with Triton[0], a demoscene group famous for its Crystal Dreams 2 demo, which featured zooming mandlebrot on 386 PC https://youtu.be/BLMUfBikxTY?t=182
My first thought exactly. This will cause nothing but confusion and Triton (the inference server) is well integrated into the space.
So it's especially weird to see it coming from OpenAI, and not a more random startup. It honestly makes no sense they would deliberately do this, unless there is some secret cult of Triton that is going on in the Bay Area world of AI/ML.
> CPUs and AMD GPUs are not supported at the moment, but we welcome community contributions aimed at addressing this limitation.
That's disappointing. My biggest frustration is that every ML library (Pytorch, Keras, etc) is tied to CUDA/Nvidia, so I take a huge performance hit when running them on my Mac.
I'm really surprised AMD hasn't pushed much harder into the space given their aggressive targeting of the data center with Epyc. I too use mostly Macs during the day and any ML projects are always relegated to Nvidia machines in my rack / amazon.
Its pretty much the OpenCL-model, except it compiles into AVX2 code / AVX512 code. Very similar to CUDA / OpenCL style programming. Its not single-source like CUDA, but it largely accomplishes the programming model IMO.
OpenMP 4.5+ looks very promising to me (particularly the "simd" keyword associated with for-loops). But open-source implementations of OpenMP are somewhat lackluster... at least last time I checked it out.
Maybe its time I revisit it. I've always thought the OpenMP spec was being written by highly competent programmers. They seem to "get" what is needed. But the question is if I can get my hands on any OpenMP implementation to actually play with.
I should really find time to do comparisons with the compilers to hand, including XL, on the NAS benchmarks, as I've never seen that, though it must have been done.
I think we have a "community" version of XL, i.e. no support, like basically everything else. I wasn't aware there was anything much wrong with GCC and libgomp or libomp, but then I haven't measured.
The task-based parallelism in LLVM leaves much to be desired however. Ideally, you'd want a more efficient implementation.
But yeah, good enough to play with. But maybe not good enough to achieve high levels of performance. The SIMD stuff is probably simple enough to implement... maybe I should checkout how well LLVM works with OMP SIMD keywords.
Can you comment on experience (or contact me) regarding implementation efficiency? We have recently implemented task-based parallelism in the J language with openMP[0]. Improvements or critiques are appreciated. SIMD instructions there have been coded directly rather than via pragmas.
I can't say that my critiques are based off of personal experience. But mostly about microbenchmarks I've read that other people have talked about. I am probably a bit out of date, since its been a while since I last played with OpenMP.
I'm looking at the benchmarks I used to look at, and they're all from 2014 or earlier. So maybe I really should double-check modern implementations. We all know GCC 4.x and LLVM 3.x are an eternity ago, so I probably should revisit their performance.
In LLVM or in libomp?
I don't know what omp simd is likely to get you over autovectorization. I know of cases where it was thought necessary (-fopenmp-simd, without -fopenmp) but wasn't with recent GCC.
"#pragma omp declare simd" applies over a function call, which then allows that function to be used inside of a "#pragma omp for simd" loop.
A few keywords here and there really help the autovectorizer achieve closer to CUDA-like environments (like... actually having your SIMD code extend "through" a function call, so you can start splitting up the work a bit better).
I took the example program from the OpenMP standard and built it with GCC 11 -Ofast. -fopt-info said the relevant loop was vectorized. Adding -fopenmp gave more vectorization messages from elsewhere, but I don't have time to figure out the difference from the tree dump (not being good with assembler). Doubtless the directives can help, but you do need to get them right, and I trust GCC more than me!
Your project is well engineered but no matter how many times you post it here, you won’t get real traction without a different development model, such as building on github. This is especially true in a high churn, high cost field like ML. I also think you are being anonymous unnecessarily.
Some user feedback: I tried to get the ResNet50 example to work, but I gave up after 2 hours. There really should be an end-to-end example like "./resnet50_example.py monkey.jpg".
A few points where I struggled:
- What are the names in the ResNet50Params struct? I looked at a few popular ResNet50 implementations, but I could not find any correspondence and since the names have been sorted, the order of the members might as well be random. I thought about matching the parameters by shape instead, but the chance of getting that correct is almost zero. Why no natural sort? Or even better, a naming scheme that sorts in the same order as execution order, for example with zero-prefixed numbers like "layer05"
- 6 MB C files are somewhat ridiculous. There should at least be a download link somewhere. Copy & pasting from the website would take forever because of scrolling. Also it is impossible to verify that that code monstrosity is not doing anything evil.
- Go seems like an odd choice when almost every machine learning project these days is developed in Python.
- GCC throws a few hundred warnings:
ResNet50.c: In function ‘ResNet50ThreeArrangeDats7Callee1’:
ResNet50.c:57009:15: warning: unused variable ‘rel27’ [-Wunused-variable]
57009 | ptrdiff_t rel27 = j62-0;
| ^~~~~
and
ResNet50.c: In function ‘ResNet50NetCreate’:
ResNet50.c:77852:20: warning: taking address of packed member of ‘struct ResNet50Params’ may result in an unaligned pointer value [-Waddress-of-packed-member]
77852 | params1->bn1Means,
| ~~~~~~~^~~~~~~~~~
- You would get more feedback if there was a platform to discuss such issues (e.g. GitHub) instead of hijacking random threads on HackerNews.
- EDIT: I just realized I could probably just write a parser for that graph file format. This would be much easier if it was something standard like JSON instead.
- EDIT2: I think some shapes in the graph file definition are incorrect (possibly reversed per block?). For example, tensor "one1" should have "ToChannels=64" instead of "ToChannels=256".
That's a full ResNet50 conversion from the corresponding Caffe protobuf. He also wrote a small script to convert TensorFlow networks.
GCC warns about perfectly valid, correct C. It even warns about switch case fallthough. All of the generated code is fully correct. Turn the warnings off.
Please don't copy and paste from the website! Download the NN-512 executable and run it.
Thanks, that code does indeed work! And it is significantly faster than PyTorch (15 vs 60 milliseconds). Maybe I will use it in the future if AVX512 should become more widespread.
Sure, but if this abstraction layer becomes popular, then it becomes much easier to support other GPUs without requiring client libraries to change, which is a much harder problem.
A big reason why CUDA is popular with compilers is that the PTX assembly-ish language is well documented and reasonable.
Compilers generate PTX, then the rest of the CUDA infrastructure turns PTX into Turing machine code, or Ampere machine code, or Pascal machine code.
In theory, SPIR-V should do the same job, but its just not as usable right now. In the meantime, getting it to work on PTX is easier, and then there's probably hope (in the far future) to move to SPIR-V if that ever actually takes off.
I'm not a developer on Triton, but that'd be my expectation.
AMD's ROCm 4.0 now supports cooperative groups, which is probably one of the last major holdouts for CUDA compatibility.
There's still the 64-wavefront (for AMD CDNA cards) instead of 32-wavefronts (for CUDA). But AMD even has 4x4 half-float matrix multiplication instructions in ROCm (for MI100, the only card that supports the matrix-multiplication / tensor instructions)
---------
I think CUDA vs OpenCL is over. ROCm from AMD has its restrictions, but... it really is easier to program than OpenCL. Its a superior model: having a single language that supports both CPU and GPU code is just easier than switching between C++ and OpenCL (where data-structures can't be shared as easily).
-----------
The main issue with AMD is that they're cutting support for their older cards. The cheapest card you can get that supports ROCm is Vega56 now... otherwise you're basically expected to go for the expensive MI-line (MI50, MI100).
> The main issue with AMD is that they're cutting support for their older cards.
No, their main issue is not properly supporting ROCm in general. No Windows support at all? It still feels like they don't know whether they want to continue investing in ROCm long term.
I'd assume that they're gonna support ROCm as long as the Frontier deployment at Oak Ridge National Labs is up. ORNLs isn't exactly a customer you want to piss off.
The new contest is not CUDA vs. OpenCL but CUDA vs. Vulkan Compute. As support for Vulkan in hardware becomes more widespread, it makes more and more sense to just standardize on it for all workloads. The programming model is quite different between the two (kernels vs. shaders) and OpenCL 2.x has quite a few features that are not in Vulkan, but the latest version of OpenCL has downgraded many of these to extensions.
CUDA programmers choose CUDA because when you make a struct FooBar{}; in CUDA, it works on both CPU-side and GPU-side.
Vulkan / OpenCL / etc. etc. don't have any data-structure sharing like that with the host code. Its a point of contention that makes anything more complicated than a 3-dimensional array hard to share.
Yeah, Vulkan / OpenCL have all sorts of pointer-sharing arrangements (Shared Virtual Memory) or whatnot. But its difficult to use in practice, because they keep the concepts of "GPU" code separate from "CPU" code.
---------
When you look at these things: such as Triton, you see that people want to unify the CPU-and-GPU code into a single code base. Look and read these Triton examples: they're just Python code, inside of the rest of CPU-Python code.
I think people are realizing that high-level code can flow between the two execution units (CPU or GPU) without changing the high level language. The compiler works hard to generate code for both systems, but its better for the compiler to work rather than the programmer to work on integration.
Is it practical for code to flow between these two execution units? In my understanding, the architecture of a GPU and CPU are so fundamentally different that it doesn’t reallly make sense. You do want to have a single data structure representation however.
> In my understanding, the architecture of a GPU and CPU are so fundamentally different that it doesn’t reallly make sense.
Yes and no. You find surprising bits of code that can be shared.
Read through a high-performance GPU project like "GPU perft" (https://github.com/ankan-ban/perft_gpu/blob/master/chess.h). The "perft" problem is the problem of counting the number of board positions reachable in X moves. (Perft(15) is "how many board positions exist within 15 moves from the start of chess?". Perft(1) is 20, and Perft(2) is 400, then 3+ gets a bit complicated!).
You can see that the fundamentals of bitboard manipulation (using the 64-bit number to represent the 64-squares of a chessboard) remain the same whether or not you're on a CPU or GPU.
In this case, you want the code to be shared between the two sides. Why write the code twice? Both CPUs and GPUs are very good at 64-bit integer manipulation.
------------
The actual search (how to branch off, coalesce results, coordinate threads) is extremely different between CPU and GPU. So of course, you want that code to be written in a CPU-specific, or GPU-specific manner.
But the question of "where can this Bishop move??" (the sliding piece attack subroutine) is identical on CPU or GPU.
Sounds similar to business logic shared between and client and server web app. Thanks
I just wonder if this alone is enough to make a unified programming environment the preferred way. On the web we had Meteor.js which tried this approach to unify client and server with javascript. The shared code was of similar types. Meteor never became the preferred way to write web apps. Im wondering if the same is true for GPU + CPU programming.
This analogy is insufficient to capture what is going on.
In web dev, the client and server are doing fundamentally different things (eg UX vs data storage) and while sharing business logic is interesting it doesn't solve a huge problem.
CPU vs GPU is vaugly analogous to different web browsers on the client in the old days when they were very different. They have different performance characteristics, so shims worked but had performance impact. But they were widely used because it was too hard to do yourself.
I don’t think this is accurate. CPU and GPU are not trying to execute the same code with different performance characteristics. GPU and CPU are fundamentally doings different things. In fact the analogy for shared business logic works perfectly
The other poster is right. The CPU and GPU are just two execution resources available to the programmer, no different than spinning up a new pthread_create (well... a bit more complicated but you get the gist).
Even in video games, where GPUs do graphics and CPUs do physics, the lines have been blurred with TressFX (hair physics on GPU instead of CPU) or PhysX (collision physics done on GPU instead of CPU).
GPUs are general purpose computers now. You can program them to do whatever you want. So when the programmer is writing code, it pays to experiment a little: maybe run the code on the GPU... see if the GPU really is slower or faster. You might be surprised.
Ditto on the reverse: maybe some bit of GPU code would be better run on the CPU.
--------
This guess-and-check, program-and-profile methodology is the same as any other high-performance code people write. You throw away assumptions and just test the heck out of all your ideas, because you're surprised far more often than not in this field.
As such, having the ability to quickly move code from GPU-side to CPU-side (and vice versa) for these tests is extremely beneficial. You don't really know which device will run your code faster until you actually test it. (Though your instincts get better the more experience you gain)
GPUs are (basically) hardware devices for doing vast numbers of matrix manipulations very very fast in parallel, with limitations on the size of the matrix.
CPUs can do matrix manipulations, with less parallelism.
Things like TressFX uses the tridiagonal matrix algorithm[1] to solve a linear system representing the hair constraints system. This could be implemented on a CPU as well, doing exactly the same thing with the same algorithm. It would run slower though!
No, they are trying to execute exactly the same code. To quote the parent:
You can see that the fundamentals of bitboard manipulation (using the 64-bit number to represent the 64-squares of a chessboard) remain the same whether or not you're on a CPU or GPU.
In this case, you want the code to be shared between the two sides. Why write the code twice? Both CPUs and GPUs are very good at 64-bit integer manipulation.
The ceremony around co-ordinating threads and memory access is different, but the code that is being run is exactly the same.
The exact same business logic (helper functions), yes of course. My point is the high level algorithm is going to be fundmanetally different. Just like the high level algorithm on the browser (render and handle UI interaction) is different than the fundamental server-side algorithm (render html from database requests), and yet there might be shared helper functions. Does this make sense?
differentiation isn't really the key algorithm. In fact, I believe that happens at compile time, not run-time. So not only is not key, it isn't even happening at run-time.
> ROCm from AMD has its restrictions, but... it really is easier to program than OpenCL. Its a superior model: having a single language that supports both CPU and GPU code is just easier than switching between C++ and OpenCL (where data-structures can't be shared as easily).
ROCm isn't a programming language or programming model - it's a collection of tools and libraries. Which language are you specifically referencing here? HIP?
Two reasons also. Some of the fastest research supercomputers in the US have been nVIDIA based. Sitting at a national lab, it's laughable how much more support there is from nVIDIA. Getting a DGX box is easy - there is nothing comparable from AMD's side.
And the less said about Intel it's better. Smart, motivated people I know have left Argonne recently because of the Aurora shitshow.
Aurora was supposed to be up and running in early 2018. It seems like that it won't be functional even in 2021. This is by the way, such an years long delay has never happened when it comes to US Leadership Computing Facilities.
Intel, last I heard has written off close to 300 million dollars on Aurora.
Do you know if there's any post-mortems about what went wrong there?
I mean, I have the gist. Intel bet big on Xeon Phi, but that didn't seem to offer enough performance. Intel then switched over to this GPU-strategy (see Intel Xe), but that required them to rework virtually everything from scratch.
No one said anything in my previous paragraph. But its blatantly obvious: the Xeon Phi was being advertised very strongly by Intel up to the point that Aurora's design was reworked. Suddenly, Xeon Phi was cut, and Intel started talking about Xe (including OneAPI, and all sorts of stuff the HPC market cares about). I'm confident enough at reading between the lines that I'm willing to assume this in the absence of hard facts :-)
I'm kind of curious on the play by play, if at all possible. What test showed up that made Argonne National Laboratory decide that the Xeon Phi model wasn't going to work? Was it possible to come to this conclusion any sooner? I realize this sort of stuff is rarely published, but I guess that's what makes me interested in those juicy details.
My friends were at more junior levels so I don't know the full details too. However know this, a leadership class computing purchase by a national lab is a very complicated thing. Many people have their hands in the decision making pie - all the way up to Secretary of Energy.
Secondly, DOE has a strategy (mostly rightly in my opinion) of not putting all its' computation eggs in a single companies' basket. Thus national labs compete between themselves as do companies. Notice the cadence of computing purchases: IBM/Nvidia -> Intel/Intel -> AMD/AMD.
Thirdly, Intel had promised that using OneAPI existing GPU optimized simulation codes could be translated to Intel GPUs with minimum effort. The idea had merit, back in 2016-2017. I have a CUDA based simulation code - it's almost a matter of recompiling to Intel, with some minimal effort on my part. That it didn't work out - well hindsight is 20/20
> ROCm from AMD has its restrictions, but... it really is easier to program than OpenCL
That really not the point. OpenCL is a standard that - at least in principle - is supposed to be supported on multiple platforms by multiple vendors. ROCm is AMD-only, and even that is questionable since it didn't exist 6 or 7 years ago, and who knows - they might drop it like they changed their earlier focus.
Also, CUDA has a much richer ecosystem than AMD ROCm (I'm sad to say; as I'm not a fan of NVIDIA).
> OpenCL is a standard that - at least in principle - is supposed to be supported on multiple platforms by multiple vendors.
As was C++AMP (which was actually pretty good IMO as a language). Just because its a standard doesn't mean its going to be used.
OpenCL 2.0 was very poorly implemented: almost no one used any of its advanced features. To the point that OpenCL 3.0 is resetting from OpenCL 1.2.
Only Intel really supported OpenCL 2.1 or OpenCL 2.2. The entire OpenCL 2.x branch for years was squandered with tepid responses from NVidia and AMD (yes, AMD had better OpenCL 2.0 support. But it's debugger didn't work, all of the code was tested on OpenCL 1.2 only. No one cared)
I dare say that OpenCL 2.x was about as "standardized" and respected as C++ AMP. Just because its an open standard doesn't mean that its actually useful. Any serious OpenCL programmer stuck with OpenCL 1.2, including both AMD and NVidia OpenCL programmers.
Which other hardware vendor provides the level of performance that Nvidia's GPU provide?
Wasnt the benevolence on single (or couple) manufacturer(s) true in 90s, 2020s?
Well, yes and no. Neither AMD nor NVIDIA are committed to open, libre, standards - they've both been "betraying" OpenCL (just like Apple and Intel, I should mention). So we have AMD's not-open (?) ROCm and NVIDIA's not-open CUDA.
Also, if we only cared about open standard we'd simply not use a GPU at all and settle RISC-V chips :-)
AMD's MI100 is slightly faster than NVidia A100 for double-precision FLOPs at slightly lower costs. Good enough for Oak Ridge National Labs (Frontier Supercomputer), to say the least.
NVidia is faster at 4x4 16-bit matrix multiplications (common in Tensor / Deep Learning stuff), but MI100 still has 4x4 16-bit matrix multiplication instructions and acceleration. Its not far behind, and the greater 64-bit FLOPs is enough to win in scientific fields.
Frontier is going to be an AMD-GPU for sure. Whether its MI100 or the next generation MI200 is still not really known and subject to rumors. I'm currently under the assumption that its MI100 however, but now that you press me on that detail I admit that I'm somewhat assuming. Frontier will be using ROCm in any case.
El Capitan (Lawrence Rivermore National Labs) is MI100 and specified as such. So I'll "retreat" my argument to that more easily verified position. Still a top supercomputer in the world, and more is known about El Capitan than Frontier.
Yes, I agree with the basic point. Where is the spec for El Capitain, then? I don't remember ever finding it, and I rather assumed it would wait for possible lessons from the Frontier hardware.
It's possibly worth saying that these things presumably are competitive with nvlink-based multi-GPU, and appear to support unified memory, but I don't know how that compares with ac922s. Not that UM is relevant for what I know of usage on our system....
I have plenty of links, but they're all from non-primary source newsites, like servethehome or (worse) wcftech. Those are more "casual" blogs that aren't worth quoting IMO and are closer to rumormills (even if they sometimes have good information).
I can't find a single spec from AMD's website or ORNL, or the Department of Energy (primary sources: the ones we know would tell the truth on this matter).
Which is an MI100 test that ORNLs did in "preparation" for Frontier. It doesn't seem like Frontier nor El Capitan have had their specs officially released yet.
I believe this is more of an optimization layer to be utilized by libraries like Tensorflow and JAX. More of a simplification of the interaction with traditional CUDA instructions.
I imagine these libraries and possibly some users would implement libraries on top of this language and reap some of the optimization benefit without having to maintain low-level CUDA specific code.
XLA is domain-specific compiler for linear algebra. Triton generates and compiles an intermediate representation for tiled computation. This IR allows more general functions and also claims higher performance.
Without reading the paper, I think you have it a little backwards - the IR doesn't itself allow for more general functions. More general functions are possible (in theory) because the frontend (this Triton language) is decoupled from the backend (CUDA) through the IR as an interface. In this way the Triton IR is no less domain specific than XLA (because both are IRs that represent sequences of operators that run on GPU (or TPU or whatever). I guess in theory Triton could be eschewing all of eg cuDNN but most likely it's not as NVIDIA's closed source kernels perform best on their closed source hardware.
Edit: should've read the post before commenting. Looks like they are in fact using LLVM's PTX backend (ie generating cuda kernels from scratch). Kudos to them
From what I got from reading the docs and the blog post, was that this is a competitor to torch.jit and numba.cuda.jit - write faster Pythonic GPGPU kernels without sacrificing speed.
So the code looks (apart from pointers) similiar to numba which feels much closer to numpy/pytorch high level code. Are there huge advantages in the triton model compared to numba that I don't see? Or is there a big performance gap?
For me numba was always the easiest way to get some new idea running on cuda, and most of the time it was fast enough..
Did anybody find performance comparison between numba and triton?
Unlike Numba, Triton operators operate on blocks with explicit load and store of blocks. This is what enables analysis to automate coalescing, shared memory management, etc.
I guess Triton is not for you if Numba is fast enough.
I wonder if this can be used for graphics programming. Shaders are notoriously hard to write correctly and this seems like it might provide an easier gateway than OpenGLSL
This guy developed Triton for his PhD thesis, and OpenAI hired him to continue working on it. Doesn't really seem fair to give all the innovation credit to OpenAI.
This sounds like basically hand-holding for Python programmers to write simple NN-operations. I'm sure it's convenient and useful, but it's still glorified glue code.
No, this is a DSL that allows people who normally write CUDA, to do so with less lines of code, and end up with a faster kernel. By embedding it inside Python you don't need to write your own lexer/parser.
[1] https://dl.acm.org/doi/abs/10.1145/3315508.3329973