Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
sp332
on Nov 20, 2023
|
parent
|
context
|
favorite
| on:
Switch Transformers C – 2048 experts (1.6T params ...
It's a sparse model, which means you only need to load the "experts" that you're going to use for a given input.
ntonozzi
on Nov 20, 2023
[–]
But you still need to page in the weights from disk to the GPU at each layer, right?
juliangoldsmith
on Nov 21, 2023
|
parent
[–]
You should only need the weights for the experts you want to run. The experts clock in at around 400 MB each (based on the 800 GB figure given elsewhere). A 24 GB GPU could fit around 60 experts, so it might be usable with a couple of old M40s.
Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: