eli5 quant?

gpm · on May 7, 2024

Quant is short for "quantization" here.

LLMs are parameterized by a ton of weights, when we say something like 400B we mean it has 400 billion parameters. In modern LLMs those parameters are basically always 16 bit floating point numbers.

It turns out you can get nearly as good results by reducing the precision of those numbers, for instance by using 4 bits per parameter instead of 16, meaning each parameter can only take on one of 16 possible values instead of one of 65536.

Der_Einzige · on May 8, 2024

Most claims of "nearly as good results" are massively overblown.

Even the so called "good" quants of huge models are extremely crippled.

Nothing is ever free, and even going from 16 to 8bit will massively reduce the quality of your model, no matter whatever their hacked benchmarks claim.

No, it doesn't help because of "free regularization" either. Dropout and batch norm were also placebo BS that didn't actually help to back in the day when they were still being used.

sbierwagen · on May 7, 2024

Interestingly enough, Llama3 suffers more performance loss than Llama2 did at identical quantizations. https://arxiv.org/abs/2404.14047

There's some speculation that a net trained for more epochs on more data learns to pack more information into the weights, and so does worse when weight data is degraded.

fennecfoxy · on May 8, 2024

Quantization is reducing the number of bits to store a parameter for a machine learning model.

Put simple, a parameter is a number that determines how likely it is that something will occur, ie if the number is < 0.5 say "goodbye" otherwise say "hello".

Now, if the parameter is a 32bit (unsigned) integer it can have a value of 0-4,294,967,296.

If you were using this 32bit value to represent physical objects, then you could represent 4,294,967,296 objects (each object gets given its own number).

However a lot of the time in machine learning, after training you can find that not quite so many different "things" need to be represented by a particular parameter, so if say you were representing types of fruit with this parameter (Google says there are over 2000 types of fruit, but let's just say there are exactly 2000). In that case 4,294,967,296/2000 means there are 2.1 million distinct values we assign each fruit, which is such a waste! Our perfect case would be that we use a number that only represents 0-2000 in the smallest way for this job.

Now is where quantization comes in, where the size of the number we use to represent a parameter is reduced, saving memory size at the expense of a small performance hit of the model accuracy - it's known that many models don't really take a large accuracy hit from this, meaning that the way the parameter is used inside the model doesn't really need/take advantage of being able to represent so many values.

So what we do is say, reduce that 32bit number to 16, or 8, or 4 bits. We go from being able to represent billions or millions of distinct values/states to maybe 16 (with 4bit quantization) and then we benchmark the model performance against the larger version with 32bit parameters - often finding that what training has decided to use that parameter for doesn't really need an incredibly granular value.