Hacker News new | past | comments | ask | show | jobs | submit login

Because that's not how machine learning models work. Machine learning as a field goes through a nearly complete revolution annually. Every new major model is a special snowflake of unique cases.

Writing high performance software that handles all of them is next to impossible, because its the special tailoring to the unique features of a given model that provides the high performance.




That's not how I think it works. ML is a small number of operations applied to very large blocks of data, tensors. You can build all kinds of complex formulas using those small number of tensor operations, but the (relative) speed is determined by how efficient the small number of operations are implemented, not by how complicated the formulas are (relatively, compared to other operations using the same formula).


You're half right. First, tensor operations are only a small part of modern ML. Second, how you plug all those small operation together is where all the performance difference is had these days between implementations.

Different hardware have a variety of different small operations that do almost the same thing. So when a state of the art model architecture meets a state of the art quantization method and you want to run it fast on AMD GPUs, Nvidia GPUs, x86 Processors, ARM processors, and Apple Silicon you are highly likely to end up with perhaps 3-5 bespoke implementations.

This happens every few months in ML. Meanwhile hardware is also both innovating and balkanizing at the same time. Now we have Google Silicon, Huawei Silicon, and Intel Arc GPUs. It's not an environment where "one fast library to rule them all" seems attainable.


Ok, but in the end you're just evaluating a graph, and I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written. So it makes more sense to work on a backend that you can use for any type of model than to hand-optimize everything.


>I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written.

No, that's exactly the problem. Compilers can't because the GPU hardware and the algorithms involved are such rapidly moving targets. Bespoke hardware specific quantization, inference, attention, and kernel compilation is the only way to squeeze out the performance users are looking for.

Creating one fast implementation for all models on all hardware would be like writing one GPU driver for all GPUs and OSs. It just isn't going to work and if it does it isn't going to be fast on all hardware.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: