The term is perfectly fine to use here because choosing a quantization strategy to deploy already has enough variables:
- quality for your specific application
- time to first token
- inter-token latency
- memory usage (varies even for a given bits per weight)
- generation of hardware required to run
Of those the hardest to measure is consistently "quality for your specific application".
It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...
When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.
- quality for your specific application
- time to first token
- inter-token latency
- memory usage (varies even for a given bits per weight)
- generation of hardware required to run
Of those the hardest to measure is consistently "quality for your specific application".
It's so hard to measure robustly that many will take significantly worse performance on the other fronts just to not have to try to measure it... which is how you end up with full precision deployments of a 405b parameter model: https://openrouter.ai/meta-llama/llama-3.1-405b-instruct/pro...
When people are paying multiples more for compute to side-step a problem, language and technology that allows you to erase it from the equation is valid.