Unfortunately, the comment you are responding to is more correct on this than I ...

fancyfredbot · 2025-03-15T21:02:35 1742072555

I don't think the paper is about statically scheduled architectures. In fact they mention it's for modern accelerators. These switch between threads in a dynamic way rather than stalling. The scheduling being referred to seemed to mean the order in which instructions should be fed to a potentially dynamic scheduler to enable efficient usage of caches etc.

So I'm not sure you can dismiss it as a thesis which will amount to nothing on the basis that static scheduling is a bad idea!

I could easily have missed something though. It's not a particularly clear or succinct write-up and I have only read some of it. If it does say that it only works for strictly deterministic in-order architectures somewhere please can you point out where?

pclmulqdq · 2025-03-16T03:07:57 1742094477

If you read the actual github, linked from the article, you will realize that it is a manual static scheduler. Example here:

https://github.com/exo-lang/exo/blob/main/examples/avx2_matm...

The hardware still does what it does, but this is a tool for manually ordering operations.

I did, in fact, read the materials before dismissing them as useless, since this is very relevant to what I do professionally.

fancyfredbot · 2025-03-16T07:20:42 1742109642

It's me who hadn't read all of the materials. But infact I think we both agree about what the tool does, which is scheduling an instruction stream ahead of time.

I'm confused because that approach seems identical to all mainstream compilers I know of. GCC/Clang also schedule instructions statically and the right schedule will improve performance. Why won't it work here? What kind of dynamic scheduling do you think it needs in order to be useful? Like a tracing JIT? Or they need to implement the reordering in hardware and reschedule as instructions execute?

pclmulqdq · 2025-03-16T12:17:00 1742127420

The issue is that it takes manual programmer effort for no gain over what a compiler gives you.

The selling point of most of these tools is that you can be smarter than the compiler, and this one (despite the example) is selling static scheduling of compute kernels, too. That would normally be up to the OS/driver with some flexibility.

fancyfredbot · 2025-03-16T14:06:22 1742133982

Most CUDA kernels are hand tuned anyway so it's not clear this is a lot more effort for the programmer. Most compilers can't perform these kind of loop transformation and tiling optimizations for you. The CUDA compiler certainly doesn't do this. So it is actually both possible and very worthwhile to try and be smarter than the compiler in this case.

In terms of scheduling compute kernels, the project doesn't remove the ability of the OS/driver to schedule execution. It's only affecting the instruction sequence within the kernel, which is something OS/drivers don't typically control. They retain full control over when the kernel is executed by the hardware.

(PTX is sometimes compiled to SAAS by the Nvidia driver rather than ahead of time. That does allow the driver to reschedule instructions, but using this project doesn't prevent it from doing that. The driver will compile PTX emitted from this project in the same way as any other)

pclmulqdq · 2025-03-16T15:37:38 1742139458

Yes, micro-optimization with manual instruction scheduling is usually a great idea. No, macro-optimization with manual instruction scheduling is usually a bad idea. That it "doesn't remove" something isn't an argument that this is a good idea when the selling point is that it enables removal.

The "novel" idea here is the macro-optimization, with some overtures about making the micro-optimization easier - as you likely know, this is not true since the complexity here is more about understanding how to do better, not about what language features you use to make those changes.