Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My guess is that very small and very large values in the weights are already trained away due to the regularisation of the cost function, so insignificant changes in a network don't tend to have significant changes in the output.

You gain more by being able to run gradient descent faster than by having higher-precision floats.



Absolutely. And beyond weight regularization, for any weighting followed by a sigmoid or other squashing function, large weights simply tend to saturate the squashing function and there is very little gradient (quickly effectively zero) to benefit from increasing the weight value past that point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: