Basically every activation function throws away half of the dynamic range at eve...

		sudosysgen 5 months ago \| parent \| context \| favorite \| on: DeepGEMM: clean and efficient FP8 GEMM kernels wit... Basically every activation function throws away half of the dynamic range at every neuron, which across a deep network is a lot. You make a good point about LayerNorm, it's probably even worse.