Hacker News new | past | comments | ask | show | jobs | submit login

Unfortunately, the comment you are responding to is more correct on this than I think you are. The python thing was stupid, though - a lot of high-performance code gets written in libraries like numpy (that call C 99.99% of the time) or pytorch (JIT-ed before executing) that keep Python our of the critical path.

The problem with this research is that many similar ideas have been tried before in the contexts of supercomputers or CPU compilers. They ultimately all fail because they end up (1) being more work to program in, and (2) not being any faster because real life happens. Networks drop packets, clock frequencies jitter, and all sorts of non-determinism happens when you have large scale. A static scheduler forces you to stall the whole program for any of these faults. All the gain you got by painstakingly optimizing things goes away.

PhD theses, in a best-case scenario, are the basis for new applications. Most of them amount to nothing. This one belongs on that pile. The sad part about that is that it isn't the student's fault that the professor sent them down a research direction that is guaranteed to amount to nothing of use. This one is on the professor.




I don't think the paper is about statically scheduled architectures. In fact they mention it's for modern accelerators. These switch between threads in a dynamic way rather than stalling. The scheduling being referred to seemed to mean the order in which instructions should be fed to a potentially dynamic scheduler to enable efficient usage of caches etc.

So I'm not sure you can dismiss it as a thesis which will amount to nothing on the basis that static scheduling is a bad idea!

I could easily have missed something though. It's not a particularly clear or succinct write-up and I have only read some of it. If it does say that it only works for strictly deterministic in-order architectures somewhere please can you point out where?


If you read the actual github, linked from the article, you will realize that it is a manual static scheduler. Example here:

https://github.com/exo-lang/exo/blob/main/examples/avx2_matm...

The hardware still does what it does, but this is a tool for manually ordering operations.

I did, in fact, read the materials before dismissing them as useless, since this is very relevant to what I do professionally.


It's me who hadn't read all of the materials. But infact I think we both agree about what the tool does, which is scheduling an instruction stream ahead of time.

I'm confused because that approach seems identical to all mainstream compilers I know of. GCC/Clang also schedule instructions statically and the right schedule will improve performance. Why won't it work here? What kind of dynamic scheduling do you think it needs in order to be useful? Like a tracing JIT? Or they need to implement the reordering in hardware and reschedule as instructions execute?


The issue is that it takes manual programmer effort for no gain over what a compiler gives you.

The selling point of most of these tools is that you can be smarter than the compiler, and this one (despite the example) is selling static scheduling of compute kernels, too. That would normally be up to the OS/driver with some flexibility.


Most CUDA kernels are hand tuned anyway so it's not clear this is a lot more effort for the programmer. Most compilers can't perform these kind of loop transformation and tiling optimizations for you. The CUDA compiler certainly doesn't do this. So it is actually both possible and very worthwhile to try and be smarter than the compiler in this case.

In terms of scheduling compute kernels, the project doesn't remove the ability of the OS/driver to schedule execution. It's only affecting the instruction sequence within the kernel, which is something OS/drivers don't typically control. They retain full control over when the kernel is executed by the hardware.

(PTX is sometimes compiled to SAAS by the Nvidia driver rather than ahead of time. That does allow the driver to reschedule instructions, but using this project doesn't prevent it from doing that. The driver will compile PTX emitted from this project in the same way as any other)


Yes, micro-optimization with manual instruction scheduling is usually a great idea. No, macro-optimization with manual instruction scheduling is usually a bad idea. That it "doesn't remove" something isn't an argument that this is a good idea when the selling point is that it enables removal.

The "novel" idea here is the macro-optimization, with some overtures about making the micro-optimization easier - as you likely know, this is not true since the complexity here is more about understanding how to do better, not about what language features you use to make those changes.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: