Well, those smaller floats require less BW to transfer back and forth as well. P...

andy99 · on March 12, 2024

The impact on bandwidth is the main reason smaller is better I belive, certainly when it's the bottleneck. I'm only really familiar with CPU but with say FP16 you might convert back to FP32 when you're doing the actual multiplication (so conversion plus multiplication is actually slower) but because you're moving half the data in and off you still get a huge speedup.

danielhanchen · on March 13, 2024

I can't remember some research paper somewhere even if you do float32 multiplications, but keep the data in bfloat16 by just simply truncating the lower mantissa bits, and doing packing, you still get speedups, since matrix multiplication is bound both by compute and cache access. If you can optimize on the cache side of things, speedups are definitely there.