Well, those smaller floats require less BW to transfer back and forth as well. Perhaps not a reduction linear in the size of the float, as maybe smaller floats require more iterations and/or more nodes in the model graph to get an equivalent result.
But rest assured there's an improvement, it's not like people would be doing it if there wasn't any benefit!
The impact on bandwidth is the main reason smaller is better I belive, certainly when it's the bottleneck. I'm only really familiar with CPU but with say FP16 you might convert back to FP32 when you're doing the actual multiplication (so conversion plus multiplication is actually slower) but because you're moving half the data in and off you still get a huge speedup.
I can't remember some research paper somewhere even if you do float32 multiplications, but keep the data in bfloat16 by just simply truncating the lower mantissa bits, and doing packing, you still get speedups, since matrix multiplication is bound both by compute and cache access. If you can optimize on the cache side of things, speedups are definitely there.
But rest assured there's an improvement, it's not like people would be doing it if there wasn't any benefit!