Great article... However, the proliferation of "quantization" (8bit, 4bit, 3, 2, etc.) so normies like myself can run transformer based models on consumer grade has changed this math significantly. It has also changed the landscape for text generation at such a pace that its nearly impossible to keep up.
I don't look at any model the same after head to head comparisons with full precision and quantization at 4bit have run on my machine. There is little to no perceptible change with models of the same initial weight. BUT!!! I am now able to run models that required a DGX a few weeks ago on my home computer thanks to quantization. These models are better in every way from my POV. I am now more interested in what I can "do" with the models vs. just getting them to run. 30B at 4 bits is the sweet spot for my setup.
I don't look at any model the same after head to head comparisons with full precision and quantization at 4bit have run on my machine. There is little to no perceptible change with models of the same initial weight. BUT!!! I am now able to run models that required a DGX a few weeks ago on my home computer thanks to quantization. These models are better in every way from my POV. I am now more interested in what I can "do" with the models vs. just getting them to run. 30B at 4 bits is the sweet spot for my setup.