Hacker News new | past | comments | ask | show | jobs | submit login

It's a sparse model, which means you only need to load the "experts" that you're going to use for a given input.



But you still need to page in the weights from disk to the GPU at each layer, right?


You should only need the weights for the experts you want to run. The experts clock in at around 400 MB each (based on the 800 GB figure given elsewhere). A 24 GB GPU could fit around 60 experts, so it might be usable with a couple of old M40s.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: