> One thing this gives you is syscall on the gpu i wish people in our industry w...

JonChesterfield · 2024-12-14T21:10:33 1734210633

Well, I called it syscall because it's a void function of 8 u64 arguments which your code stumbles into, gets suspended, then restored with new values for those integers. That it's a function instead of an instruction doesn't change the semantics. My favourite of the uses of that is to pass six of those integers to the x64 syscall operation.

This isn't misnaming. It's a branch into a trampoline that messes about with shared memory to give the effect of the x64 syscall you wanted, or some other thing that you'd rather do on the cpu.

There's a gpu thing called trap which is closer in behaviour to what you're thinking of but it's really annoying to work with.

Side note, RPC has a terrible rep for introducing failure modes into APIs, but that's completely missing here because pcie either works or your machine is gonna have to reboot. There are no errors on the interface that can be handled by the application.

almostgotcaught · 2024-12-14T21:47:56 1734212876

> Well, I called it syscall because it's a void function of 8 u64 arguments which your code stumbles into, gets suspended, then restored with new values for those integers

I'm put it really simply: is there a difference (in perf, semantics, whatever) between using this "syscalls" to implement fopen on GPU and using a syscall to implement fopen on CPU? Note that's a rhetorical question because we both already know that the answer is yes. So again you're just playing slight of hand in calling them syscalls and I'll emphasize: this is a slight of hand that the dev himself doesn't play (so why would I take your word over his).

JonChesterfield · 2024-12-14T22:17:47 1734214667

Wonderfully you don't need to trust my words, you've got my code :)

If semantics are different, that's a bug/todo. It'll have worse latency than a CPU thread making the same kernel request. Throughput shouldn't be way off. The GPU writes some integers to memory that the CPU will need to read, and then write other integers, and then load those again. Plus whatever the x64 syscall itself does. That's a bunch of cache line invalidation and reads. It's not as fast as if the hardware guys were on board with the strategy but I'm optimistic it can be useful today and thus help justify changing the hardware/driver stack.

The whole point of libc is to paper over the syscall interface. If you start from musl, "syscall" can be a table of function pointers or asm. Glibc is more obstructive. This libc open codes a bunch of things, with a rpc.h file dealing with synchronising memcpy of arguments to/from threads running on the CPU which get to call into the Linux kernel directly. It's mainly carefully placed atomic operations to keep the data accesses well defined.

There's also nothing in here which random GPU devs can't build themselves. The header files are (now) self contained if people would like to use the same mechanism for other functionality and don't want to handroll the data structure. The most subtle part is getting this to work correctly under arbitrary warp divergence on volta. It should be an out of the box thing under openmp early next year too.

almostgotcaught · 2024-12-14T23:44:03 1734219843

> Wonderfully you don't need to trust my words, you've got my code :)

My friend it's so incredibly bold of you to claim credit for this work when

1. Joe presented it

2. Joe's name is the only name on the git blame

3. I know Joe and I know he did the lion's share of the work

And so I'll repeat: Joe himself calls it rpc so I'm gonna keep calling it rpc and not syscall.

jhuber6 · 2024-12-15T00:05:14 1734221114

The RPC implementation in LLVM is an adaptation of Jon's original state machine (see https://github.com/JonChesterfield/hostrpc). It looks very different at this point, but we collaborated on the initial design before I fleshed out everything else. Syscall or not is a bit of a semantic argument, but I lean more towards syscall 'inspired'.

JonChesterfield · 2024-12-15T00:35:02 1734222902

Here's the algorithm https://doi.org/10.1145/3458744.3473357. My paper with Joseph on the implementation is at https://doi.org/10.1007/978-3-031-40744-4_15.

The syscall layer this runs on was written at https://github.com/JonChesterfield/hostrpc, 800 commits from May 2020 until Jan 2023. I deliberately wrote that in the open, false paths and mistakes and all. Took ages for a variety of reasons, not least that this was my side project.

You'll find the upstream of that scattered across the commits to libc, mostly authored by Joseph (log shows 300 for him, of which I reviewed 40, and 25 for me). You won't find the phone calls and offline design discussions. You can find the tricky volta solution at https://reviews.llvm.org/D159276 and the initial patch to llvm at https://reviews.llvm.org/D145913.

GPU libc is definitely Joseph's baby, not mine, and this wouldn't be in trunk if he hadn't stubbornly fought through the headwinds to get it there. I'm excited to see it generating some discussion on here.

But yeah, I'd say the syscall implementation we're discussing here has my name adequately written on it to describe it as "my code".

rowanG077 · 2024-12-14T22:40:04 1734216004

Why does a perf difference factor into it? There is no requirement for a syscall to be this fast or else it isn't a syscall. If you have a hot loop you shouldn't be putting a syscall in it, not even on the CPU.

saagarjha · 2024-12-15T12:49:25 1734266965

It's not a syscall any more than a call into a Wine thunk is a syscall. Sure, it implements and emulates a syscall. But it's not a syscall.

quotemstr · 2024-12-14T20:54:57 1734209697

It's a matter of perspective. If you think of the GPU as a separate computer, you're right. If you think of it as a coprocessor, then the use of RPC is just an implementation detail of the system call mechanism, not a semantically different thing.

When an old school 486SX delegates a floating point instruction to a physically separate 487DX coprocessor, is it executing an instruction or doing an RPC? If RPC, does the same instruction start being a real instruction when you replace your 486SX with a 486DX, with an integrated GPU? The program can't tell the difference!

saati · 2024-12-15T18:37:32 1734287852

A 486SX never delegates floating point instructions, the 487 is a full 486DX that disables the SX and fully takes over, you are thinking of 386 and older.

almostgotcaught · 2024-12-14T21:03:47 1734210227

> It's a matter of perspective. If you think of the GPU as a separate computer, you're right.

this perspective is a function of exactly one thing: do you care about the performance of your program? if not then sure indulge in whatever abstract perspective you want ("it's magic, i just press buttons and the lights blink"). but if you don't care about perf then why are you using a GPU at all...? so for people that aren't just randomly running code on a GPU (for shits and giggles), the distinction is very significant between "syscall" and syscall.

people who say these things don't program GPUs for a living. there are no abstractions unless you don't care about your program's performance (in which case why are you using a GPU at all).

JonChesterfield · 2024-12-14T21:26:12 1734211572

The "proper syscall" isn't a fast thing either. The context switch blows out your caches. Part of why I like the name syscall is it's an indication to not put it on the fast path.

The implementation behind this puts a lot of emphasis on performance, though the protocol was heavilt simplfied in upstreaming. Running on pcie instead of the APU systems makes things rather laggy too. Design is roughly a mashup of io_uring and occam, made much more annoying by the GPU scheduler constraints.

The two authors of this thing probably count as people who program GPUs for a living for what it's worth.

quotemstr · 2024-12-14T21:07:53 1734210473

Not everything in every program is performance critical. A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs. That's as much BS on GPU as it is on CPU. In CPU land, we moved past this sophomoric attitude decades ago. The GPU world might catch up one day.

Are you planning on putting fopen() in an inner loop or something? LOL

oivey · 2024-12-14T21:39:11 1734212351

The whole reason CUDA/GPUs are fast is that they explicitly don’t match the architecture of CPUs. The truly sophomoric attitude is that all compute devices should work like CPUs. The point of CUDA/GPUs is to provide a different set of abstractions than CPUs that enable much higher performance for certain problems. Forcing your GPU to execute CPU-like code is a bad abstraction.

Your comment about putting fopen in an inner loop really betrays that. Every thread in your GPU kernel is going to have to wait for your libc call. You’re really confused if you’re talking about hot loops in a GPU kernel.

saagarjha · 2024-12-15T12:50:22 1734267022

> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs.

You're talking to the wrong people; this is definitely not true in general.

nickysielicki · 2024-12-14T21:23:01 1734211381

genuinely asking: where else should ML engineers focus their time, if not on looking at datapath bottlenecks in either kernel execution or the networking stack?

lmm · 2024-12-15T01:40:29 1734226829

The point is that you should focus on the bottlenecks, not on making every random piece of code "as fast as possible". And that sometimes other things (maintainability, comprehensibility, debuggability) are more important than maximum possible performance, even on the GPU.

nickysielicki · 2024-12-15T02:27:06 1734229626

That's fair, but I didn't understand OP to be claiming above that "cudaheads" aren't looking at their performance bottlenecks before driving work, just that they're looking at the problem incorrectly (and eg: maybe should prioritize redesigns over squeezing perf out of flawed approaches.)

almostgotcaught · 2024-12-14T21:36:16 1734212176

> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs

I don't know what a "cudahead" is but if you're gonna build up a strawman just to chop it down have at it. Doesn't change anything about my point - these aren't syscalls because there's no sys. I mean the dev here literally spells it out correctly so I don't understand why there's any debate.