Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what you meant by large swap file?

On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk?

Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala):

>To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory.

https://github.com/young-geng/EasyLM/blob/main/docs/checkpoi...

https://github.com/young-geng/EasyLM/blob/main/docs/koala.md

Presumably the same technique can be used with Vicuna.



btw I got 4bit quantized Vicuna working in my 16GB laptop and the results seem very good, perhaps the best I got running locally so far


Did you have to diff LLaMA? Did you use EasyLM?


I found it ready-made for download, here https://huggingface.co/eachadea/ggml-vicuna-13b-4bit




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: