It's a sparse model, which means you only need to load the "experts" that you're...

ntonozzi · on Nov 20, 2023

But you still need to page in the weights from disk to the GPU at each layer, right?

juliangoldsmith · on Nov 21, 2023

You should only need the weights for the experts you want to run. The experts clock in at around 400 MB each (based on the 800 GB figure given elsewhere). A 24 GB GPU could fit around 60 experts, so it might be usable with a couple of old M40s.