Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You're right that most people only use a small subset of cuda: we prioritied support for features based on what was needed for various open-source projects, as a way to try to capture the most common things first.

A complete API comparison table is coming soon, I belive. :D

In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX versions, with all their weird/wacky/insane arguments. - Atomics: yes, except the 128-bit atomics nvidia added very recently. - MMA: in development, though of course we can't fix the fact that nvidia's hardware in this area is just better than AMD's, so don't expect performance to be as good in all cases. - TMA: On the same branch as MMA, though it'll just be using AMD's async copy instructions.

> mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.

We plan to publish a compatibility table of which instructons are supported, but a list of the instructions used to produce each PTX instruction is not in general meaningful. The inline PTX handler works by converting the PTX block to LLVM IR at the start of compilation (at the same time the rest of your code gets turned into IR), so it then "compiles forward" with the rest of the program. As a result, the actual instructions chosen vary on a csae-by-case basis due to the whims of the optimiser. This design in principle produces better performance than a hypothetical solution that turned PTX asm into AMD asm, because it conveniently eliminates the optimisation barrier an asm block typically represents. Care, of course, is taken to handle the wacky memory consistency concerns that this implies!

We're documenting which ones are expected to perform worse than on NVIDIA, though!



Have you seen anyone productively using TMA on Nvidia or async instructions on AMD? I’m currently looking at a 60% throughput degradation for 2D inputs on H100: https://github.com/ashvardanian/scaling-democracy/blob/a8092...


> You're right that most people only use a small subset of cuda

This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.


Oh yes, we found all kinds of bugs in Nvidia's cuda implementation during this project :D.

There's a bunch of pretty obscure functions in the device side apis too: some esoteric math functions, old simd "intrinsics" that are mostly irrelevant with modern compilers, etc.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: