Bert has an 15% masking rate, seems co-related, also 90% is what works well when...

viig99 on Nov 11, 2019 | parent | context | favorite | on: Failing 15% of the time is the best way to learn, ...

Bert has an 15% masking rate, seems co-related, also 90% is what works well when you are trying to do label smoothing using entropy minimisation, what's going on!