It's a matter of perspective. If you think of the GPU as a separate computer, yo...

saati · 2024-12-15T18:37:32 1734287852

A 486SX never delegates floating point instructions, the 487 is a full 486DX that disables the SX and fully takes over, you are thinking of 386 and older.

almostgotcaught · 2024-12-14T21:03:47 1734210227

> It's a matter of perspective. If you think of the GPU as a separate computer, you're right.

this perspective is a function of exactly one thing: do you care about the performance of your program? if not then sure indulge in whatever abstract perspective you want ("it's magic, i just press buttons and the lights blink"). but if you don't care about perf then why are you using a GPU at all...? so for people that aren't just randomly running code on a GPU (for shits and giggles), the distinction is very significant between "syscall" and syscall.

people who say these things don't program GPUs for a living. there are no abstractions unless you don't care about your program's performance (in which case why are you using a GPU at all).

JonChesterfield · 2024-12-14T21:26:12 1734211572

The "proper syscall" isn't a fast thing either. The context switch blows out your caches. Part of why I like the name syscall is it's an indication to not put it on the fast path.

The implementation behind this puts a lot of emphasis on performance, though the protocol was heavilt simplfied in upstreaming. Running on pcie instead of the APU systems makes things rather laggy too. Design is roughly a mashup of io_uring and occam, made much more annoying by the GPU scheduler constraints.

The two authors of this thing probably count as people who program GPUs for a living for what it's worth.

quotemstr · 2024-12-14T21:07:53 1734210473

Not everything in every program is performance critical. A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs. That's as much BS on GPU as it is on CPU. In CPU land, we moved past this sophomoric attitude decades ago. The GPU world might catch up one day.

Are you planning on putting fopen() in an inner loop or something? LOL

oivey · 2024-12-14T21:39:11 1734212351

The whole reason CUDA/GPUs are fast is that they explicitly don’t match the architecture of CPUs. The truly sophomoric attitude is that all compute devices should work like CPUs. The point of CUDA/GPUs is to provide a different set of abstractions than CPUs that enable much higher performance for certain problems. Forcing your GPU to execute CPU-like code is a bad abstraction.

Your comment about putting fopen in an inner loop really betrays that. Every thread in your GPU kernel is going to have to wait for your libc call. You’re really confused if you’re talking about hot loops in a GPU kernel.

saagarjha · 2024-12-15T12:50:22 1734267022

> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs.

You're talking to the wrong people; this is definitely not true in general.

nickysielicki · 2024-12-14T21:23:01 1734211381

genuinely asking: where else should ML engineers focus their time, if not on looking at datapath bottlenecks in either kernel execution or the networking stack?

lmm · 2024-12-15T01:40:29 1734226829

The point is that you should focus on the bottlenecks, not on making every random piece of code "as fast as possible". And that sometimes other things (maintainability, comprehensibility, debuggability) are more important than maximum possible performance, even on the GPU.

nickysielicki · 2024-12-15T02:27:06 1734229626

That's fair, but I didn't understand OP to be claiming above that "cudaheads" aren't looking at their performance bottlenecks before driving work, just that they're looking at the problem incorrectly (and eg: maybe should prioritize redesigns over squeezing perf out of flawed approaches.)

almostgotcaught · 2024-12-14T21:36:16 1734212176

> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs

I don't know what a "cudahead" is but if you're gonna build up a strawman just to chop it down have at it. Doesn't change anything about my point - these aren't syscalls because there's no sys. I mean the dev here literally spells it out correctly so I don't understand why there's any debate.