> There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy there is too much loss of resolution.
I'm not sure this is accurate. From what I have seen, 8-bit quantization is usually fine, and even 4-bit is a viable tradeoff. Here are some benchmarks from TextSynth showing no significant degradation between 16 and 8 bit:
It's true if you're doing training. But for inference severe quantization is mostly okay. And there are some internal parts of a transformer running inference with a quantized model where you might want the x-bit inputs to do calculations with 16 bits like the dot product similarity between vectors.
Cool stuff. I looked at https://en.wikipedia.org/wiki/Hopper_%28microarchitecture%29 and I noticed that that the fp8 support is only for the tensor cores and not the CUDA side. Does that mean training with H100 GPU in fp8 mode would use some software ecosystem that's not the existing vast existing CUDA one? Or am I just misunderstanding CUDA cores vs tensor cores?
PS, as a joke, they should implement GPU fluint8 and get baked in non-linearity for the activation function without even using a non-linear function, https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt half decent: The hidden power of imprecise lines" by suckerpinch)
The problem with 8bit at the moment is massive performance degradation with bitsandbytes. Recent improvements in 4bit inference mean that 8bit is now a massive laggard (although there’s no reason not to expect this to resolve).
AFAIK for over-parameterized models, performing quantization or any other form of compression won't reduce accuracy by much (don't quote me on this though).
Nonetheless people do tend to use 16 bit huggingface models, and if you do go to 8 bits and it's wrong, you're never quite sure if it's the quant or the model.
The article is right, 8-bit (and especially 4-bit) is atypical for deep learning models and highly depends on the amount of parameters (larger model can handle more quantization) and can even depend on specific training hyperparameters (mainly dropout & weight decay which can induce sparsity)
Thing is, even when the impact from 4-bit is substantial, the larger parameter count it allows on the same hardware more than makes up for it. E.g. llama-30b is better at 4-bit than any derivative of llama-13b, no matter how fine-tuned or quantized.
> Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.
No, 4bit quantization is the typical case.
At 4bit you can fit twice the parameters of 8bit in the same space for far better performance/perplexity/quality.
Running LLMs higher than 4bit is atypical and almost always sub-optimal (compared to running a model half the size in 8bit).
Even pretraining and finetuning in 4bit is likely to become the norm soon as fp4 becomes more well understood.
You're just wrong. You're looking at the wrong numbers. The perplexity score of a model with twice the parameters in half the bits (4bit) is FAR LOWER (ie better).
If you are limited to X RAM and have two 16bit models of size 4X and 2X then the 4X model in 4bit will always be far superior to the 2X model in 8bit, with far lower perplexity.
Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit perplexity of 5.9069. That is over 0.54 lower perplexity for the same RAM amount by using 4bit! That is MASSIVE!
Another factor is that larger models degrade less when quantized.
You have to wonder if running a huge model, say, 300B parameters at 2-bit quantization might be "optimal" in that it would fit into a single A100 or H100 GPU and likely outperform an 80B parameter 8-bit model...
Not sure here. The LLaMA models - yes, all weights fit in the small range between -2.0 .. 2.0
And some other models have more crazy numbers with even more crazier outliers within them, like you might have a weight of 12.00 between long array of typical small numbers around 0.00
I've read story about attempt to quantize RWKV model into the 4/5 bits which failed short due to the presence of outlier weights.
The author told somewhere that bigger models had worse perplexity because of this.
There's also research showing that the perplexity reduction is less at higher parameter counts. E.g. a 65b parameter model barely has any impact at all when reducing from 16bit to 4bit
Well, if you have a fixed RAM size, you're better off with the largest model you can fit at 4 bits (13B 4b is way better than 7B 16b despite being twice smaller).
No it isn't, quantization is not free. You lose a significant amount of performance that you are not measuring properly in automated benchmarks when you quantize to that level.
You can see it in real time when you take most LLMs and compare them at different quantization levels. I can see the degradation even in the largest llama quite badly even at 8 bits.
Quantization is not free, but VRAM is even less free.
If you have X amount of VRAM and can fit a 16bit model of size 2X in 8bit or a model of size 4X in 4bit then the 4X model in 4bit is ALWAYS superior with lower perplexity and better performance.
You LOSE performance by using a smaller model in 8bit vs a larger model in 4bit.
Can somebody please explain how quantization below 8 bit works? Since a byte is the smallest addressable unit I think, is the dimensionality of the weights somehow reduced?
[Author] You approximate the weights using fewer bits. You also switch to ints instead of floats and then do some fancy stuff when multiplying to make it all work together.
Software can address units of any size, by packing and unpacking bits from bytes (or more likely words) in the underlying implementation. I don’t know about any specific NN implementation here, just commenting in general that the size of the addressable unit and the size of your reads can writes can be completely independent. I routinely use bit-packing data compression techniques in CUDA, for example.
Generally, since the memory is byte addressable, you load data which is packed into bytes. It is the compute instructions that use the specified bits needed.
So in this case one would load a byte which would have 2 4b data, and then you would have a 4b ADD or MAC which would operate on them.
If you don't have them then you need to sign/zero extend or convert the smaller bit-widths to 8/16/32b whichever is available.
> llama.cpp which runs a 13 billion parameter model on a 6GB GPU
I think that's a typo there too, the 13B model needs like 10G of memory for 4 bits, it's the 7B one that fits into 6G. Well unless you do the split thing with some layers on the CPU I guess.
Does training cost scale linearly with model size and token count? If so, that suggests a lower bound of $600k to train the 13 billion params model. (Still roughly the same magnitude)
[Author] Mosaic must be getting some kind of sweetheart deals on A100 80GB and A100 40GB. The prices they are quoting are not what say the AWS on-demand prices are. They quote $2 per GPU for A100 40GB and $2.50 for A100 80GB. That's literally half the AWS on-demand rate for A100s here: https://aws.amazon.com/ec2/instance-types/p4/
And these are impossible to get. We tried to get some for Anyscale, and we were told there were no on-demand available and lead time for reserved (ouchie on the price! You're talking a quarter of a million dollars a year for one machine at list) was in weeks.
Once you take the model size and hefty sweetheart deals into account, you're within 10%. Mosaic does have some nice whitebox optimizations, but nothing that radically changes the equation.
A100-40GB is like $1.10 on LambdaLabs, on demand. Their availability is horrific on singles, but I've seen 8x instances pop up more often than not. And you can rent A100s for a buck a pop interruptible from other clouds, plenty of availability. $2 doesn't seem like much of a sweetheart deal.
I have a suggested modification. You are mixing references in your document.
Re: '~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens
The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs.'
The LLaMA-13B model took 2.75 days of 2048xA100 (135,168 GPU-hours) with 1 trillion tokens. The 21 days for 1.4 trillion was for LLaMA-65B.
I would suggest using the LLaMa-13B numbers since those are the most relevant for this section, or at least modify "21 days to train LLaMa" to "21 days to train LLaMa-65B" for clarity.
There are some unique assumptions being made in parts of the gist
> 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding
> 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries
I don't know how useful these numbers are if you take away the assumptions that self-hosted will work as well as API.
> 10x: Throughput improvement from batching LLM requests
I see that the write up mentions memory being a caveat to this, but it also depends on the card specs as well. Memory Bandwidth / TFLOPs offered by say 4090 is superior while having the same amount of VRAM as 3090. The caveat mentioned with token length in the gist itself makes the 10x claim not a useful rule of thumb.
> This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!
In a narrow use-case of a strict look-up. This seems to exaggerate the cost difference while having completely different trade-offs.
I think that it would be helpful to add a fine-tuning costs for an open source model (think LLaMA to Alpaca).
From the phrasing around fine tuning right now it seems like it's using openai's fine tuning api to determine that cost, but it's not very clear.
Also this would be helpful for other foundation models if that doesn't already exist - how much VRAM to run Stable Diffusion v2.1 at different resolutions, running Whisper or Bark for audio, etc.
They mention that they could finetune a 6B model for $7. Obviously the number depends on the amount of data and the model size but it's probably not going to be a significant expense in practice.
That is how I understood it, a token is on average a 3/4 of a word. "Token to word". So if you want to buy 1000 tokens you would get effectively 750 words.
I'm surprised not to see anything about data-to-parameter ratios for optimal scaling. My superficial understanding per the Chinchilla paper is to target 20 to 1.
I'm also confused about this:
> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens
This is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions.
The Chinchilla paper only addresses the contrived use case of a model that is trained once and never used for inference. Since most of the real world compute cost will be in inference, Chinchilla seems to offer little practical guidance.
[Author] Good luck trying to use clusters of Lambda machines. Lambda labs are cheap for a reason: their API is not very featureful (we looked at them and we saw they didn't even support machine tagging). If you're looking for a box or two, lambda labs is fine. If you're looking for 1,000, not so much.
Plus they don't actually have any actually A100s available at the moment (2022-05-17).
CoreWeave is a nice middle ground. You can at least get the A100 machines into a k8s cluster.
Okay well that is just your experience. If you are brand new in this industry that is undergoing absolutely massive shortages of GPUs right now you probably will not be able to easily source GPUs. It might not seem fair but why would Lamda help someone they never heard of who will move for the next fad as quick as possible versus their long term existing customers?
The amount of get-off-my-lawn grognardness that LLM activity inspires is really ridiculous.
I really would ask you to take a second look at the spirit of your comment and think carefully about how much you really understand about the work being done on top of LLMs and if it justifies this kind of response.
I had the same reaction as the OP. I’m not a data scientist by trade or title, but I would personally be a little offended. If you designed the Porsche 911, would you not be offended by the shade tree mechanic who simply knows how to change the oil calling himself a Porsche designer/engineer?
There are people making applications based on LLMs. You may quibble with the term LLM Developer, but to sneer or roll your eyes at it as if it were prima facie inaccurate or laughable is unjustified.
I’m confused. If I am an LLM developer why do I need to know the cost per token? That’s not the GPU cost, that’s a business decision from a company.
If I am an LLM user maybe that’s relevant but prone to being out of date. I’m not going to use this page as the source of truth on that anyways.
Since the article seems to be targeted at developers who use LLMs to e.g. generate Embeddings for semantic search, the title is about as accurate as saying a software engineer is a “keyboard developer” because they use a keyboard.
This is the first time I heard this term, and when I Google search "LLM developer" in an incognito tab, different device, this article is one of the first results.
Seems like we should first establish what exactly is an LLM developer.
> When I was at Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know.
The personal plug and appeal to authority of "When I was a Google" is unnecessary. "Numbers every Engineer should know" is public and literally linked there. It's a weird way to start a engineering blog post and makes it feel like marketing of one's resume. Then again, I guess that's what most of these engineering blog posts are nowadays.
Indeed Jeff Dean is a legend and needing to add the "legendary engineer" qualifier detracts from this point. Let these things speak for themselves.
I think you're being somewhat uncharitable here. There's nothing wrong with adding a personal detail here or there, and nothing wrong with giving credit to those who deserve it. I don't see any reason to bikeshed the short, inessential details included in the blogger's prose.
Most of this applies to people developing applications that depends on LLMs. Some of it also applies to people using LLMs for other purposes. Very little of it is applicable to someone developing LLMs.
- Human Reading Speed (English): ~250 words per minute
- Human Speaking Speed (English): ~150 words per minute
Should be treated like the Doherty Threshold [1] for generative content.
[1] https://lawsofux.com/doherty-threshold/