That way, at inference-time you get the speed of 36B params because you are only "using" 36B params at a time, but the next token might (and frequently does) need a different set of experts than the one before it. If that new set of experts is already loaded (ie you preloaded them into GPU VRAM with the full 132B params), there's no overhead, and you just keep running at 36B speed irrespective of the loaded experts.
You could theoretically load in 36B at a time, but you would be severely bottlenecked by having to reload those 36B params, potentially for every new token! Even on top of the line consumer GPUs that would slow you down to ~seconds per token instead of tokens per second :)
That way, at inference-time you get the speed of 36B params because you are only "using" 36B params at a time, but the next token might (and frequently does) need a different set of experts than the one before it. If that new set of experts is already loaded (ie you preloaded them into GPU VRAM with the full 132B params), there's no overhead, and you just keep running at 36B speed irrespective of the loaded experts.
You could theoretically load in 36B at a time, but you would be severely bottlenecked by having to reload those 36B params, potentially for every new token! Even on top of the line consumer GPUs that would slow you down to ~seconds per token instead of tokens per second :)