You're right that most people only use a small subset of cuda: we prioritied sup...

ashvardanian · on July 16, 2024

Have you seen anyone productively using TMA on Nvidia or async instructions on AMD? I’m currently looking at a 60% throughput degradation for 2D inputs on H100: https://github.com/ashvardanian/scaling-democracy/blob/a8092...

einpoklum · on July 16, 2024

> You're right that most people only use a small subset of cuda

This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.

ckitching · on July 16, 2024

Oh yes, we found all kinds of bugs in Nvidia's cuda implementation during this project :D.

There's a bunch of pretty obscure functions in the device side apis too: some esoteric math functions, old simd "intrinsics" that are mostly irrelevant with modern compilers, etc.