Because that's not how machine learning models work. Machine learning as a field...

amelius · on March 11, 2023

That's not how I think it works. ML is a small number of operations applied to very large blocks of data, tensors. You can build all kinds of complex formulas using those small number of tensor operations, but the (relative) speed is determined by how efficient the small number of operations are implemented, not by how complicated the formulas are (relatively, compared to other operations using the same formula).

MacsHeadroom · on March 11, 2023

You're half right. First, tensor operations are only a small part of modern ML. Second, how you plug all those small operation together is where all the performance difference is had these days between implementations.

Different hardware have a variety of different small operations that do almost the same thing. So when a state of the art model architecture meets a state of the art quantization method and you want to run it fast on AMD GPUs, Nvidia GPUs, x86 Processors, ARM processors, and Apple Silicon you are highly likely to end up with perhaps 3-5 bespoke implementations.

This happens every few months in ML. Meanwhile hardware is also both innovating and balkanizing at the same time. Now we have Google Silicon, Huawei Silicon, and Intel Arc GPUs. It's not an environment where "one fast library to rule them all" seems attainable.

amelius · on March 11, 2023

Ok, but in the end you're just evaluating a graph, and I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written. So it makes more sense to work on a backend that you can use for any type of model than to hand-optimize everything.

MacsHeadroom · on March 12, 2023

>I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written.

No, that's exactly the problem. Compilers can't because the GPU hardware and the algorithms involved are such rapidly moving targets. Bespoke hardware specific quantization, inference, attention, and kernel compilation is the only way to squeeze out the performance users are looking for.

Creating one fast implementation for all models on all hardware would be like writing one GPU driver for all GPUs and OSs. It just isn't going to work and if it does it isn't going to be fast on all hardware.