Oh shit, I took a closer look and you’re right. The repo was also helpfully updated with a note to this effect: “The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't know how to utilize it properly. But in any case, you can even disable it with LLAMA_NO_ACCELERATE=1 make and the performance will be the same, since no BLAS calls are invoked by the current implementation”.
No Joi in my pocket just yet :(
Because of this I re-checked my claims about the Whisper speed up from the Neural Engine and that does look legit, 6x at least. So the Neural Engine does have the chops for this workload, it just isn’t being used in this repo. It may not be LLaMA, but I sure hope someone gets an LLM running on the ANE sooner rather than later.
Our investigations indicate that it might not be possible to achieve ANE performance improvement over CPU for LLM Decoder inference with batch size of 1 [0]. Just to make it clear - I'm no expert in Core ML / ANE, so these conclusions could be totally wrong.
Neural Engine across the M1 and M2 series is also sadly very limited.
I bought one thinking I could exploit it for StableDiffusion and other tasks but found that most libraries say to use GPU for faster generation. What I found is not only is the engine the same on m2 pro (meaning I upgraded for no reason from my m1 basemodel) but it also doesn't scale at all except in the m1 Ultra where it's doubled simply because it's using two dies bridged.
Neural Engine can generate 512x512 images pretty easily but takes a while even compared to using the GPU on a basemodel m1 Mac Mini. It's kinda crazy. Looking into ways to improve it and take advantage of the neural engine in the future but the current situation is very limited. Even apples official implementation and coreML libraries seem to prefer you run them on Metal
The optimization in this case only seems to refer to the 4bit model loading method (to be friendlier to the arm64 CPU)
GeoHot has tinygrad running LLaMA on Metal (but only the 7B model) that's the closest I've seen to taking advantage of apple silicon.
Neural Engine implementation would be awesome