Vicuna's GitHub says that applying the delta takes 60GB of CPU RAM? Is that what you meant by large swap file?
On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk?
Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala):
>To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory.
On that note, why is any RAM needed? Can't the files be loaded and diffed chunk by chunk?
Edit: The docs for running Koala (a similar model) locally say this (about converting LLaMA to Koala):
>To facilitate training very large language models that does not fit into the main memory of a single machine, EasyLM adopt a streaming format of model checkpoint. The streaming checkpointing format is implemented in checkpoint.py. During checkpointing, the StreamingCheckpointer simply flatten a nested state dictionary into a single level dictionary, and stream the key, value pairs to a file one by one using messagepack. Because it streams the tensors one by one, the checkpointer only needs to gather one tensor from the distributed accelerators to the main memory at a time, hence saving a lot of memory.
https://github.com/young-geng/EasyLM/blob/main/docs/checkpoi...
https://github.com/young-geng/EasyLM/blob/main/docs/koala.md
Presumably the same technique can be used with Vicuna.