If you are interested in implementing LLaMA yourself or learning, I noticed that...

gpm · on March 11, 2023

My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd

Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...

At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.

At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.

You could probably optimize quite a bit for batch throughput if you're ok with the latency though.

adeon · on March 11, 2023

Yeah, it does seem like there's a fundamental limit how fast you can go even if you engineer the data juggling to perfection. My guess is that every loop through the transformer is going to have to visit every weight and if those weights cannot fit in your fastest memory, then it's going to have to spend time transferring data from SSD or whatever is lower in your memory hierarchy.

The quantization used in the post luckily seems to work somewhat well; I'm also wondering if some new clever ways will be invented that reduce the amount of data you need to juggle. Maybe e.g. not just using 4-bit weights but also compressing them in some way, sorting the weights or something.

gpm · on March 11, 2023

Huffman encoding the weights (treating each 16bit float a symbol) could reduce the weights size to ~85% the original (I calculated this exactly before, but am going from memory). You could maybe get a bit more than that with arithmetic encoding (if you managed to decode fast enough), but it shouldn't be that much more.

Once you start including lossy steps like quantization though it's much less clear. At some point you just reach "knowledge distillation is an open problem".

AmblingAvocado · on March 11, 2023

Perhaps there is an instance of Amdahl's law lurking the the midst?

Tepix · on March 11, 2023

Won't the 65b model (almost) fit into 128GB RAM? Or into 128GB RAM and 24GB VRAM?

MacsHeadroom · on March 11, 2023

LLaMA-65B fits in 32GB of VRAM using state of the art GPTQ quantization with no output performance loss.

https://github.com/qwopqwop200/GPTQ-for-LLaMa

summarity · on March 11, 2023

So if I'm reading this right, 65B at 4bit would consume around 20GB of VRAM and ~130GB of system RAM?

MacsHeadroom · on March 11, 2023

LLaMA it doesn't require any system RAM to run.

It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.

But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)

summarity · on March 11, 2023

The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.

MacsHeadroom · on March 12, 2023

Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.

I would not personally call compilation of software part of its "use case." It's use case is text generation.

nl · on March 12, 2023

Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.

Or it is probably possible to make it work slowly using a swapfile on Linux.

Tepix · on March 13, 2023

Closer to 38-40GB VRAM (and hardly any RAM).

gpm · on March 11, 2023

Yes (I just don't have that much ram)

I have a separate branch that streams weights from ram - at which point I think I was only seeing negligible performance loss compared to storing the weights in vram. The bottleneck was compute, not GPU bandwidth.

MacsHeadroom · on March 12, 2023

The 65B model only needs just over 32GB of VRAM to run. It does not need system RAM to run/use if you use pre-quantized weights which you can find many places already.

No need to quantize yourself (besides it takes almost a day to do 4bit GPTQ quantization on 3xA6000).

gpm · on March 12, 2023

Quantizing is a lossy process, you can't really claim to be running the 65B model llama at that point (though the 65b qgpt-llama does look like it might be very useful)

Tepix · on March 13, 2023

The GPTQ paper https://arxiv.org/abs/2210.17323 claims "negligible accuracy degradation relative to the uncompressed baseline".

Tepix · on March 13, 2023

Are you sure? I think it took a mere 2 hours to do 4bit GPTQ quantization of LLaMA-65B on 1x RTX 3090. But i may be mistaken.

nerdponx · on March 11, 2023

I noticed the Fasttext code was also surprisingly clean and readable C++. whatever moralities and other flaws the metal business model might have in general, they seem to have a consistently excellent track record when it comes to publicly available libraries and tools.

rektide · on March 11, 2023

Very nice post, good lead. It makes me curious... I wonder what LLaMA would look like implemented upon the newly release OpenXLA[1]! Is that even a sensible ask? I feel like it could potentially be an informative exercise, that would aid in the understanding of the landscape of tooling.

[1] https://opensource.googleblog.com/2023/03/openxla-is-ready-t... https://news.ycombinator.com/item?id=35078410

magic_at_nodai · on March 12, 2023

We have it running as part of SHARK (which is built on IREE). https://github.com/nod-ai/SHARK/tree/main/shark/examples/sha...

toxik · on March 11, 2023

I’m pretty sure the code you linked is just simplified for publication. I think it’s interesting to read, I just don’t think it’s what they actually used to train and develop the algorithm.

nl · on March 11, 2023

This is only the model code which defined the shape and how to do a forward pass.

It isn't the training code, but it would be unlikely that the model code used then is any different.

toxik · on March 11, 2023

There are little hints strewn out through the code that suggests it is indeed “trimmed” from a larger codebase.

zamnos · on March 11, 2023

tinygrad by geohot, also linked on this thread, has similar properties good for learning - it's a couple hundred LoC to integrate LLaMA support

https://github.com/geohot/tinygrad/tree/llama

sillysaurusx · on March 11, 2023

Just don’t copy their sampler. It’s trashcan-tier. Bad defaults and no repetition penalty.

toxik · on March 12, 2023

I think tuning the sampler temperature and using top-k over top-p sounds ad hoc and shouldn’t be necessary for a solid model. Do you have any reason for suggesting those changes in particular? Especially since top-p, or nucleus sampling, is meant to be an improvement over top-k.