Isn't the new trend to train in lower precision anyway?

neilmovva · 2025-08-23T21:19:42 1755983982

Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].

What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.

[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586

laidoffamazon · 2025-08-23T22:11:51 1755987111

Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway

storus · 2025-08-23T20:52:28 1755982348

Only GPU-poors run Q-GaLore and similar tricks.

Twirrim · 2025-08-24T00:00:09 1755993609

Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.

storus · 2025-08-24T09:17:39 1756027059

For inference of course; the OP I replied to mentioned training though.