Hacker News new | past | comments | ask | show | jobs | submit login

If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.

For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.

Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.

I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.




My code for this is very much not high quality, but I have a CPU + GPU + SSD combination: https://github.com/gmorenz/llama/tree/ssd

Usage instructions in the commit message: https://github.com/facebookresearch/llama/commit/5be06e56056...

At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.

At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.

You could probably optimize quite a bit for batch throughput if you're ok with the latency though.


Yeah, it does seem like there's a fundamental limit how fast you can go even if you engineer the data juggling to perfection. My guess is that every loop through the transformer is going to have to visit every weight and if those weights cannot fit in your fastest memory, then it's going to have to spend time transferring data from SSD or whatever is lower in your memory hierarchy.

The quantization used in the post luckily seems to work somewhat well; I'm also wondering if some new clever ways will be invented that reduce the amount of data you need to juggle. Maybe e.g. not just using 4-bit weights but also compressing them in some way, sorting the weights or something.


Huffman encoding the weights (treating each 16bit float a symbol) could reduce the weights size to ~85% the original (I calculated this exactly before, but am going from memory). You could maybe get a bit more than that with arithmetic encoding (if you managed to decode fast enough), but it shouldn't be that much more.

Once you start including lossy steps like quantization though it's much less clear. At some point you just reach "knowledge distillation is an open problem".


Perhaps there is an instance of Amdahl's law lurking the the midst?


Won't the 65b model (almost) fit into 128GB RAM? Or into 128GB RAM and 24GB VRAM?


LLaMA-65B fits in 32GB of VRAM using state of the art GPTQ quantization with no output performance loss.

https://github.com/qwopqwop200/GPTQ-for-LLaMa


So if I'm reading this right, 65B at 4bit would consume around 20GB of VRAM and ~130GB of system RAM?


LLaMA it doesn't require any system RAM to run.

It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.

But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)


The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.


Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.

I would not personally call compilation of software part of its "use case." It's use case is text generation.


Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.

Or it is probably possible to make it work slowly using a swapfile on Linux.


Closer to 38-40GB VRAM (and hardly any RAM).


Yes (I just don't have that much ram)

I have a separate branch that streams weights from ram - at which point I think I was only seeing negligible performance loss compared to storing the weights in vram. The bottleneck was compute, not GPU bandwidth.


The 65B model only needs just over 32GB of VRAM to run. It does not need system RAM to run/use if you use pre-quantized weights which you can find many places already.

No need to quantize yourself (besides it takes almost a day to do 4bit GPTQ quantization on 3xA6000).


Quantizing is a lossy process, you can't really claim to be running the 65B model llama at that point (though the 65b qgpt-llama does look like it might be very useful)


The GPTQ paper https://arxiv.org/abs/2210.17323 claims "negligible accuracy degradation relative to the uncompressed baseline".


Are you sure? I think it took a mere 2 hours to do 4bit GPTQ quantization of LLaMA-65B on 1x RTX 3090. But i may be mistaken.


I noticed the Fasttext code was also surprisingly clean and readable C++. whatever moralities and other flaws the metal business model might have in general, they seem to have a consistently excellent track record when it comes to publicly available libraries and tools.


Very nice post, good lead. It makes me curious... I wonder what LLaMA would look like implemented upon the newly release OpenXLA[1]! Is that even a sensible ask? I feel like it could potentially be an informative exercise, that would aid in the understanding of the landscape of tooling.

[1] https://opensource.googleblog.com/2023/03/openxla-is-ready-t... https://news.ycombinator.com/item?id=35078410


We have it running as part of SHARK (which is built on IREE). https://github.com/nod-ai/SHARK/tree/main/shark/examples/sha...


I’m pretty sure the code you linked is just simplified for publication. I think it’s interesting to read, I just don’t think it’s what they actually used to train and develop the algorithm.


This is only the model code which defined the shape and how to do a forward pass.

It isn't the training code, but it would be unlikely that the model code used then is any different.


There are little hints strewn out through the code that suggests it is indeed “trimmed” from a larger codebase.


tinygrad by geohot, also linked on this thread, has similar properties good for learning - it's a couple hundred LoC to integrate LLaMA support

https://github.com/geohot/tinygrad/tree/llama


Just don’t copy their sampler. It’s trashcan-tier. Bad defaults and no repetition penalty.


I think tuning the sampler temperature and using top-k over top-p sounds ad hoc and shouldn’t be necessary for a solid model. Do you have any reason for suggesting those changes in particular? Especially since top-p, or nucleus sampling, is meant to be an improvement over top-k.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: