Hacker News new | past | comments | ask | show | jobs | submit login
Numbers every LLM developer should know (github.com/ray-project)
428 points by richardliaw on May 17, 2023 | hide | past | favorite | 103 comments



I would add the following two numbers if you're generating realtime text or speech for human consumption:

- Human Reading Speed (English): ~250 words per minute

- Human Speaking Speed (English): ~150 words per minute

Should be treated like the Doherty Threshold [1] for generative content.

[1] https://lawsofux.com/doherty-threshold/


Human reading speed varies by a factor of 10 or more between individuals, while speaking speed is much more consistent.


Even my own reading speed even varies by a factor of 5 day to day, depending on how much reading I've been doing, sleep I've gotten, etc.


Plus, whether I am reading light fiction versus technical documentation.


> speaking speed is much more consistent.

Is it? I've noticed a huge variance in speaking speed in the US, but it tends to vary more between regions rather than individuals.


Exceptions for languages where rapidity of speech really varies according to context, such as in Spanish.


But I'd say LLMs produce content faster than I can read or write it, because they can produce content which is really dense.

Ask GPT-4 a question and then answer it yourself. Maybe your answer will be as good or better than GPT-4's but GPT-4 writes its answer a lot faster.


It certainly doesn't produce content as fast as I can read it.


Only if you use gpt-4. gpt-3.5-turbo is much faster, and gpt-4 is only going to get faster as GPUs get faster.


Yep. I use GPT-4 extensively and exclusively, and the comment I was replying to mentioned GPT-4. I can't wait for it to get faster.


Bing also uses GPT-4 and it is very fast. Microsoft spends more ok compute.


It doesn't exclusively use GPT-4, you might be right anyway that their GPT-4 is much faster but you're also not always seeing GPT-4 with them.


I'm pretty sure it at least mostly uses GPT-4.


afaict OpenAI's instance is massively overloaded, you can see with the 32k context model actually being faster in practice rather than slower


dense content? Not in my experience. It seems to be really overly verbose for me.


Prompt it to be information dense in its response then.


So I must specifically ask for it, but it's not at all the default.


I get it, but it is just about infinitely configurable to your specific needs so it doesn't bother me too much what the default response is.


> There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy there is too much loss of resolution.

I'm not sure this is accurate. From what I have seen, 8-bit quantization is usually fine, and even 4-bit is a viable tradeoff. Here are some benchmarks from TextSynth showing no significant degradation between 16 and 8 bit:

https://textsynth.com/technology.html

8-bit uses half as much memory and doubles the throughput for limited quality loss.


It's true if you're doing training. But for inference severe quantization is mostly okay. And there are some internal parts of a transformer running inference with a quantized model where you might want the x-bit inputs to do calculations with 16 bits like the dot product similarity between vectors.


Even that is being tackled by newer GPU architectures. For example, novelai is currently training an LLM in fp8 precision, using H100 GPUs.[1]

[1] https://blog.novelai.net/anlatan-acquires-hgx-h100-cluster-4...

https://blog.novelai.net/text-model-progress-is-going-good-8...


Cool stuff. I looked at https://en.wikipedia.org/wiki/Hopper_%28microarchitecture%29 and I noticed that that the fp8 support is only for the tensor cores and not the CUDA side. Does that mean training with H100 GPU in fp8 mode would use some software ecosystem that's not the existing vast existing CUDA one? Or am I just misunderstanding CUDA cores vs tensor cores?

PS, as a joke, they should implement GPU fluint8 and get baked in non-linearity for the activation function without even using a non-linear function, https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt half decent: The hidden power of imprecise lines" by suckerpinch)


you can access the tensor cores from cuda, in practice you might generate the code with something like openai triton


The problem with 8bit at the moment is massive performance degradation with bitsandbytes. Recent improvements in 4bit inference mean that 8bit is now a massive laggard (although there’s no reason not to expect this to resolve).


AFAIK for over-parameterized models, performing quantization or any other form of compression won't reduce accuracy by much (don't quote me on this though).


[Author] Fair point. Adjusted the language.

Nonetheless people do tend to use 16 bit huggingface models, and if you do go to 8 bits and it's wrong, you're never quite sure if it's the quant or the model.


The article is right, 8-bit (and especially 4-bit) is atypical for deep learning models and highly depends on the amount of parameters (larger model can handle more quantization) and can even depend on specific training hyperparameters (mainly dropout & weight decay which can induce sparsity)


Thing is, even when the impact from 4-bit is substantial, the larger parameter count it allows on the same hardware more than makes up for it. E.g. llama-30b is better at 4-bit than any derivative of llama-13b, no matter how fine-tuned or quantized.


> Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical.

No, 4bit quantization is the typical case.

At 4bit you can fit twice the parameters of 8bit in the same space for far better performance/perplexity/quality.

Running LLMs higher than 4bit is atypical and almost always sub-optimal (compared to running a model half the size in 8bit).

Even pretraining and finetuning in 4bit is likely to become the norm soon as fp4 becomes more well understood.


[Author] Completely disagree. Any analysis shows that you see perplexity reduction at 4 bits. Have a look at llama.cpp's results here:

https://github.com/ggerganov/llama.cpp#quantization

4 bit has a perplexity score 0.13 or so higher.


You're just wrong. You're looking at the wrong numbers. The perplexity score of a model with twice the parameters in half the bits (4bit) is FAR LOWER (ie better).

If you are limited to X RAM and have two 16bit models of size 4X and 2X then the 4X model in 4bit will always be far superior to the 2X model in 8bit, with far lower perplexity.

Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit perplexity of 5.9069. That is over 0.54 lower perplexity for the same RAM amount by using 4bit! That is MASSIVE!


Another factor is that larger models degrade less when quantized.

You have to wonder if running a huge model, say, 300B parameters at 2-bit quantization might be "optimal" in that it would fit into a single A100 or H100 GPU and likely outperform an 80B parameter 8-bit model...


Not sure here. The LLaMA models - yes, all weights fit in the small range between -2.0 .. 2.0

And some other models have more crazy numbers with even more crazier outliers within them, like you might have a weight of 12.00 between long array of typical small numbers around 0.00

I've read story about attempt to quantize RWKV model into the 4/5 bits which failed short due to the presence of outlier weights.

The author told somewhere that bigger models had worse perplexity because of this.


A stupid question but...what about a 16x model in 1bit?


Has binary neural network been implemented for Transformers yet?


Another factor in favor of 8 bits is that inference speed will generally be 2x faster than a larger model at 4 bits.


There's also research showing that the perplexity reduction is less at higher parameter counts. E.g. a 65b parameter model barely has any impact at all when reducing from 16bit to 4bit


Not actually, you might see here how bigger models have much worse perplexity with 4bit due to the weight outliers:

https://github.com/saharNooby/rwkv.cpp/issues/12

For LLaMA models - yeah, different story.


Well, if you have a fixed RAM size, you're better off with the largest model you can fit at 4 bits (13B 4b is way better than 7B 16b despite being twice smaller).


No it isn't, quantization is not free. You lose a significant amount of performance that you are not measuring properly in automated benchmarks when you quantize to that level.

You can see it in real time when you take most LLMs and compare them at different quantization levels. I can see the degradation even in the largest llama quite badly even at 8 bits.


Quantization is not free, but VRAM is even less free.

If you have X amount of VRAM and can fit a 16bit model of size 2X in 8bit or a model of size 4X in 4bit then the 4X model in 4bit is ALWAYS superior with lower perplexity and better performance.

You LOSE performance by using a smaller model in 8bit vs a larger model in 4bit.


If you take a model and quantize it it's obviously going to get worse, but what if you train it again after that?


Can somebody please explain how quantization below 8 bit works? Since a byte is the smallest addressable unit I think, is the dimensionality of the weights somehow reduced?


[Author] You approximate the weights using fewer bits. You also switch to ints instead of floats and then do some fancy stuff when multiplying to make it all work together.

More detail than you probably wanted: https://huggingface.co/blog/hf-bitsandbytes-integration


The latest release of bitsandbytes uses a new fp4 format. 4bit floating point scailing results in much lower perplexity than int4.

Also note that for a fixed memory (RAM) size, 4bit (even int4) is always superior, resulting in lower perplexity than 8bit.

E.g. LLaMA-13B int4 is far better/lower perplexity than LLaMA-7B fp8 while using the same amount of RAM.


Software can address units of any size, by packing and unpacking bits from bytes (or more likely words) in the underlying implementation. I don’t know about any specific NN implementation here, just commenting in general that the size of the addressable unit and the size of your reads can writes can be completely independent. I routinely use bit-packing data compression techniques in CUDA, for example.


Generally, since the memory is byte addressable, you load data which is packed into bytes. It is the compute instructions that use the specified bits needed.

So in this case one would load a byte which would have 2 4b data, and then you would have a 4b ADD or MAC which would operate on them.

If you don't have them then you need to sign/zero extend or convert the smaller bit-widths to 8/16/32b whichever is available.


Go see yourself :)

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

There's too many schemes right now with 4_0 and 5_1 really popular between LLM geeks.


I believe it's locally (inner-loop or simd op) up-cast to float8/float16/int8, but I haven't looked at the internals of llama.cpp myself


> llama.cpp which runs a 13 billion parameter model on a 6GB GPU

I think that's a typo there too, the 13B model needs like 10G of memory for 4 bits, it's the 7B one that fits into 6G. Well unless you do the split thing with some layers on the CPU I guess.



Yeah that's the split layer mode I mentioned. With 6G one can do about 18 layers, which is less than half of the 40 total for 13B.


> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

MosaicML claims they trained a 7 billion parameter on 1 trillion tokens with a budget of $200k.

https://www.mosaicml.com/blog/mpt-7b

Does training cost scale linearly with model size and token count? If so, that suggests a lower bound of $600k to train the 13 billion params model. (Still roughly the same magnitude)


[Author] Mosaic must be getting some kind of sweetheart deals on A100 80GB and A100 40GB. The prices they are quoting are not what say the AWS on-demand prices are. They quote $2 per GPU for A100 40GB and $2.50 for A100 80GB. That's literally half the AWS on-demand rate for A100s here: https://aws.amazon.com/ec2/instance-types/p4/

And these are impossible to get. We tried to get some for Anyscale, and we were told there were no on-demand available and lead time for reserved (ouchie on the price! You're talking a quarter of a million dollars a year for one machine at list) was in weeks.

Once you take the model size and hefty sweetheart deals into account, you're within 10%. Mosaic does have some nice whitebox optimizations, but nothing that radically changes the equation.


A100-40GB is like $1.10 on LambdaLabs, on demand. Their availability is horrific on singles, but I've seen 8x instances pop up more often than not. And you can rent A100s for a buck a pop interruptible from other clouds, plenty of availability. $2 doesn't seem like much of a sweetheart deal.


There is no possible way for anyone buying 1M worth of compute to get list pricing.


Thanks for putting to this together.

I have a suggested modification. You are mixing references in your document.

Re: '~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens The LLaMa paper mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs.'

The LLaMA-13B model took 2.75 days of 2048xA100 (135,168 GPU-hours) with 1 trillion tokens. The 21 days for 1.4 trillion was for LLaMA-65B.

I would suggest using the LLaMa-13B numbers since those are the most relevant for this section, or at least modify "21 days to train LLaMa" to "21 days to train LLaMa-65B" for clarity.


RANDOM THOUGHT:

i wonder when we are getting docker for llm ... a Modelfile ?

FROM "PAAMA/16b"

APPLY "MNO/DATASET"

each layer could be lora adapter like thing maybe.

maybe when AI chips are finally here.


SQLFlow[0] looks sort of like that:

    SELECT * FROM iris.train
    TO TRAIN DNNClassifier
    WITH model.hidden_units = [10, 10], model.n_classes = 3, train.epoch= 10
    COLUMN sepal_length, sepal_width, petal_length, petal_width
    LABEL class
    INTO sqlflow_models.my_dnn_model;
No idea how well it works.

[0]: https://sql-machine-learning.github.io/


PyTorch tutorial looks similar (lower on the page)

https://pytorch.org/tutorials/beginner/pytorch_with_examples...


> 40-90%: Amount saved by appending “Be Concise” to your prompt

Looks to me like "numbers every LLM user needs to know".


I think parts of the write-up are great.

There are some unique assumptions being made in parts of the gist

> 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding

> 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries

I don't know how useful these numbers are if you take away the assumptions that self-hosted will work as well as API.

> 10x: Throughput improvement from batching LLM requests

I see that the write up mentions memory being a caveat to this, but it also depends on the card specs as well. Memory Bandwidth / TFLOPs offered by say 4090 is superior while having the same amount of VRAM as 3090. The caveat mentioned with token length in the gist itself makes the 10x claim not a useful rule of thumb.


> This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x4 less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x!

In a narrow use-case of a strict look-up. This seems to exaggerate the cost difference while having completely different trade-offs.


I think that it would be helpful to add a fine-tuning costs for an open source model (think LLaMA to Alpaca).

From the phrasing around fine tuning right now it seems like it's using openai's fine tuning api to determine that cost, but it's not very clear.

Also this would be helpful for other foundation models if that doesn't already exist - how much VRAM to run Stable Diffusion v2.1 at different resolutions, running Whisper or Bark for audio, etc.


They mention that they could finetune a 6B model for $7. Obviously the number depends on the amount of data and the model size but it's probably not going to be a significant expense in practice.


How come the token to word ratio is smaller than 1 if tokens are either words or part of words? Shouldn't you expect more tokens than words?


That is how I understood it, a token is on average a 3/4 of a word. "Token to word". So if you want to buy 1000 tokens you would get effectively 750 words.


[Author] Fair point -- I clarified the language and gave a concrete example. Hope that helps!


It's the token to word multiplier, yeah. i.e. x tokens = 0.75x words.


I think all the ratios given are x:1 and they tell you x.


It’s the other way around.

1 GPT4 token is equivalent to 50 GPT3.5 tokens.

1 token is equivalent to 0.75 words.


That would make it 0.75 tokens to 1 word right?


lol, yes, I'm glad they clarified because I understood it correctly then made the mistake GP did when I replied to them.


I'm surprised not to see anything about data-to-parameter ratios for optimal scaling. My superficial understanding per the Chinchilla paper is to target 20 to 1.

I'm also confused about this:

> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

This is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions.


The Chinchilla paper only addresses the contrived use case of a model that is trained once and never used for inference. Since most of the real world compute cost will be in inference, Chinchilla seems to offer little practical guidance.


> ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens

Llama paper mentioned 135,168 A100 hours for training 13 billion model on 1 trillion tokens, which means ~$150k for lambdalabs on demand instance.


[Author] Good luck trying to use clusters of Lambda machines. Lambda labs are cheap for a reason: their API is not very featureful (we looked at them and we saw they didn't even support machine tagging). If you're looking for a box or two, lambda labs is fine. If you're looking for 1,000, not so much.

Plus they don't actually have any actually A100s available at the moment (2022-05-17).

CoreWeave is a nice middle ground. You can at least get the A100 machines into a k8s cluster.


Okay well that is just your experience. If you are brand new in this industry that is undergoing absolutely massive shortages of GPUs right now you probably will not be able to easily source GPUs. It might not seem fair but why would Lamda help someone they never heard of who will move for the next fad as quick as possible versus their long term existing customers?


Excellent! Thank you so much for making/posting this


[Author] You're welcome -- glad it was useful!


> 1.3: Average tokens per word

this is so US centric :-(

for billions of people, arguably the majority of the world, that’s incorrect


> Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second.

Why?


Talks about throughput but doesn't mention memory I/O speed, which should be a bottleneck for LLMs


Are there any open source host-your-own LLMs that have licensing that allows for commercial use?



Dolly from Databricks is one at least


[Author] TL;DR OS LLM models are coming.

Dolly's not that great -- I've hit lots of issues using it to be honest .

MosaicML has a nice commercially usable model here: https://www.mosaicml.com/blog/mpt-7b

I think they're one of the leading ones (bias: they're kinda competitors to my employer Anyscale, but you gotta say something's good when it is).

Red Pajama are leading an effort to build a fully open source model similar to LLaMa. https://www.together.xyz/blog/redpajama


Vicuna-13b is on Apache License 2.0.


Vicuna is a delta model that you have to apply on top of LLaMA.


how does one get the original LLaMA weights. I tried the form that Meta has no dice. Also tried some torrents no luck there either


I filled the form and got the weights a few weeks later, but I work for a research organisation.

I thought the torrents were super active.


> LLM Developer

This is the fastest I've rolled my eyes in a long time!


The amount of get-off-my-lawn grognardness that LLM activity inspires is really ridiculous.

I really would ask you to take a second look at the spirit of your comment and think carefully about how much you really understand about the work being done on top of LLMs and if it justifies this kind of response.


I had the same reaction as the OP. I’m not a data scientist by trade or title, but I would personally be a little offended. If you designed the Porsche 911, would you not be offended by the shade tree mechanic who simply knows how to change the oil calling himself a Porsche designer/engineer?


There are people making applications based on LLMs. You may quibble with the term LLM Developer, but to sneer or roll your eyes at it as if it were prima facie inaccurate or laughable is unjustified.


Well he was a web3 developer 6 months ago and a nft dev 12 months ago so forgive us for not taking this weeks flavor as being all that serious.


Context matters. Is a "web developer" someone who makes web pages, or works on a browser rendering engine?


I’m confused. If I am an LLM developer why do I need to know the cost per token? That’s not the GPU cost, that’s a business decision from a company.

If I am an LLM user maybe that’s relevant but prone to being out of date. I’m not going to use this page as the source of truth on that anyways.

Since the article seems to be targeted at developers who use LLMs to e.g. generate Embeddings for semantic search, the title is about as accurate as saying a software engineer is a “keyboard developer” because they use a keyboard.


> LLM developer

This is the first time I heard this term, and when I Google search "LLM developer" in an incognito tab, different device, this article is one of the first results.

Seems like we should first establish what exactly is an LLM developer.

> When I was at Google, there was a document put together by Jeff Dean, the legendary engineer, called Numbers every Engineer should know.

The personal plug and appeal to authority of "When I was a Google" is unnecessary. "Numbers every Engineer should know" is public and literally linked there. It's a weird way to start a engineering blog post and makes it feel like marketing of one's resume. Then again, I guess that's what most of these engineering blog posts are nowadays.

Indeed Jeff Dean is a legend and needing to add the "legendary engineer" qualifier detracts from this point. Let these things speak for themselves.


I think you're being somewhat uncharitable here. There's nothing wrong with adding a personal detail here or there, and nothing wrong with giving credit to those who deserve it. I don't see any reason to bikeshed the short, inessential details included in the blogger's prose.


The term "LLM developer" is clear enough from context.


I’m still not sure if the advices apply to people developing LLM models, or to software developers using LLM in their daily job to produce code.


Most of this applies to people developing applications that depends on LLMs. Some of it also applies to people using LLMs for other purposes. Very little of it is applicable to someone developing LLMs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: