The precision used should match the requirements of the dataset, the training process, and the available compute. There are practical uses to 16-Bit FP training.
"Our findings demonstrate that pure 16-bit floating-point neural networks can achieve similar or even better performance than their mixed-precision and 32-bit counterparts." This is a very deceptive statement. Take 100 initialization states and train a FP16 vs a FP32 network, and you'll find FP32 will have an accuracy advantage. It's certainly possible to conclude this if a small sample of networks are trained. This paper goes on to state, "Lowering the precision of real numbers used for the neural network’s weights to fixed-point, as shown in [11], leads to a significant decrease in accuracy.", while later concluding, "we have shown that pure 16-bit networks can perform on par with, if not better than, mixed-precision and 32-bit networks in various image classification tasks." The results certainly do, but that doesn't really give an accurate evaluation of what's really going on here. A FP64 network can fall into a local minima and be outperformed by a PF16 network, but is it correct to say the FP16 network is better. I'm getting a lot of mixed signals.
I feel like, "significant implications" is quite a stretch.
A few concerns: Besides figure 3, other results do not provide side-by-side test vs validation accuracy to attempt demonstrate the network is not overfit, and the only mention of normalization was the custom batch normalization operation.
This may more be a rant about the current state of ML, but in a perfect world, we wouldn't use GPUs/would enforce deterministic calculations, results would be replicable, we'd train hundreds if not thousands of networks to draw conclusions from, we'd better understand how to visualize network accuracies and overfitting, and all datasets would be free of bias and accurately generalize the problem attempting to be modelled. We can dream.
This is generally incorrect. FP16 matches FP32 via bfloat usually with almost no sweat, and generally any additional noise tends to have a positive regularization effect.
I train directly in _pure_ fp16/bf16 with no issues and the benefits greatly outweigh the tradeoffs. On smaller networks, I use 0 gradient clipping whatsoever.
FP32 has almost no uses outside of bizzarely intricate simulation-kinds of things, in which case FP64 is still generally important.
I appreciate your input on bfloat. I've always been under the impression that precision matters a lot when attempting to avoid local min/maxima if the landscape of the error function is jagged, but I suppose there's a good argument to be made that any floating point format can be used if the data, learning rate, network structure, etc are molded to match. Perhaps it's my perspective or maybe there actually isn't enough discourse on FP format being equally or more important factor to consider than just its affect on compute and memory requirements.
The use of FP64 could aid against vanishing gradients and just general information loss in deep networks, but that's probably comparable to using an atomic bomb to power a wind turbine. It certainly works, but is it the best way to go about it?
I personally think the use of mixed precision in deep networks will become more common as time goes on. I'm doubtful that all of a network really benefits from having large amounts of precision.
Well, if I could guide a bit in terms of focus, it's not necessarily the precision of the floating point values as much as the structure of information flow and expressivity in the network. Gradients are going to die basically regardless of precision or not, you're maybe saving yourself a few steps but if you're at the point of using precision to stave off dead gradients it's like several orders of magnitude less efficient than a decent solution is.
My personal belief on experience is that training in pure FP8 is maybe possible with some hacks, but that our limit for needing mixed precision to stabilize things might come into play around 3-6/7 bits or so (a wide range, sorry). I could be wrong though, maybe there is some really cool discrete training method out there that I'm not aware of.
A good way to prevent information loss in neural networks is to minimize all of your subpath lengths. You also want a really short shortest path for information from your first to your final layer. That will do a lot.
Also, as far as things being jagged -- remember that floating point only loses a lot of precision on large numbers, which should be really coarse anyways. Having large, perfectly precise numbers means we are likely overfitting. Small and detailed means that we can afford to have high precision. Think of it as a beneficial tradeoff like knowing momentum and/or velocity to some exchangeable extent in quantum mechanics. If we impose that on our precision, we get some nice benefits in the end.
Hope that helps sort of expound on the subject a bit more, feel free to let me know if you have any questions and much love! <3 :))) :D :)
From what I can tell the architecture is more important anyway and having smaller but more parameters gives the model more chances to figure out the optimal architecture on its own.
My best understanding is that architecture is predetermined, which determines the number of parameters up front?
I do think that, however, having shallower bit depths over time will require some slightly deeper networks to compensate, as a result. Sorta makes sense when you think about it a bit. :) <3 :DDDD :)
"Our findings demonstrate that pure 16-bit floating-point neural networks can achieve similar or even better performance than their mixed-precision and 32-bit counterparts." This is a very deceptive statement. Take 100 initialization states and train a FP16 vs a FP32 network, and you'll find FP32 will have an accuracy advantage. It's certainly possible to conclude this if a small sample of networks are trained. This paper goes on to state, "Lowering the precision of real numbers used for the neural network’s weights to fixed-point, as shown in [11], leads to a significant decrease in accuracy.", while later concluding, "we have shown that pure 16-bit networks can perform on par with, if not better than, mixed-precision and 32-bit networks in various image classification tasks." The results certainly do, but that doesn't really give an accurate evaluation of what's really going on here. A FP64 network can fall into a local minima and be outperformed by a PF16 network, but is it correct to say the FP16 network is better. I'm getting a lot of mixed signals.
I feel like, "significant implications" is quite a stretch.
A few concerns: Besides figure 3, other results do not provide side-by-side test vs validation accuracy to attempt demonstrate the network is not overfit, and the only mention of normalization was the custom batch normalization operation.
This may more be a rant about the current state of ML, but in a perfect world, we wouldn't use GPUs/would enforce deterministic calculations, results would be replicable, we'd train hundreds if not thousands of networks to draw conclusions from, we'd better understand how to visualize network accuracies and overfitting, and all datasets would be free of bias and accurately generalize the problem attempting to be modelled. We can dream.