Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		andai on April 5, 2023 \| parent \| context \| favorite \| on: State-of-the-art open-source chatbot, Vicuna-13B, ... Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what you meant by large swap file? On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk? Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala): >To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory. https://github.com/young-geng/EasyLM/blob/main/docs/checkpoi... https://github.com/young-geng/EasyLM/blob/main/docs/koala.md Presumably the same technique can be used with Vicuna.

muyuu on April 5, 2023 [–]

btw I got 4bit quantized Vicuna working in my 16GB laptop and the results seem very good, perhaps the best I got running locally so far

andai on April 7, 2023 | [–]

Did you have to diff LLaMA? Did you use EasyLM?

muyuu on April 7, 2023 | | [–]

I found it ready-made for download, here https://huggingface.co/eachadea/ggml-vicuna-13b-4bit

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact