A (16, 1) posit gives a maximum of 2 extra fractional bits of precision over float16, and (16, 2) gives you 1 extra bit. But also while the numerical distribution a posit assumes is slightly better but still not well geared to the distribution of values as seen in NNs in any case, it's not going to get you that much more (you hardly need anything above 10.0 because people like hyperparameters that look like 0.01 or whatever, even though with typical floating point you can multiply all the values by 10^-3 or 10^3 (or in float32, even 10^10 or 10^-10) and things would work just as well.
A posit is also more expensive in hardware, because the adders, multipliers and shifters are bigger (provisioned for the maximum fractional precision).
The trick is still in the scaling still. I've done full training in (16, 1) posit (like, everything in (16, 1), no float32 or shadow floating point values, like this paper) for some of these convnets. It doesn't work well out of the box without doing scaling tricks, and then it's ~the same as float16 with the tricks I find. It simply doesn't add that much more precision in such reduced space (that 1 or 2 extra bits at best).
What they benchmarked on here in this paper is ancient history too, not sure how these models are that relevant to modern practice these days.
The big advantage of Posit16 over Float16 is way less overflow potential. Float16s have 1 less bit than Posit16, but they also max out at 65000 which can cause a lot of overflow/NaNs. BFloats have a really big range (up to 10^38), but they only have a 7 bit mantissa. Posit16 gives you higher accuracy than BFloat16 for the entire range of Float16, while being almost as resistant to overflow as BFloat16. Yes, you lose a lot of accuracy for the big posits, but in a NN context, often for the giant values all you care about is that it's large and infinite. The hardware is a little more expensive, but it has a lot fewer special cases and you can share a lot more operations with Int hardware (although none of the expensive stuff).
Had not known about this, thanks for sharing. This does certainly seem very interesting and applicable to the problem at hand. Large values are important to keep instead of NaNs and we generally don't need the precision in my experience.
Plus, when they're that big, they're sort of a wrecking hammer to whatever weights they touch anyways, so might as well save the precision for the cleanup steps afterwards where it really counts (at whatever number of bits works best of course) :D :)))) :D :))))
This paper feels behind by a few years, I think. But bfloat16 and fp16 are both natively supported in hardware.
We're down to fp8 now with NVIDIA's latest hardware. This conversation is wayyyyy back from where it is in a few other places. FP8 even shouldn't be a huge issue (at least for mixed at first), it's things like the 4-bit datatypes and such where things really and truly get spicy IMO.
Somewhat because the few bits they save on things like Nan give them a sensible boost on those low precisions.
But the best way to go is to explucitly fit the distribution of values you are expecting: precisely what Meta did when they introduced their own 16bits format.