Hacker News new | past | comments | ask | show | jobs | submit login

MoE reduces compute cost for inference at scale, but not for training. You still have to train the whole model (plus the router)



It’s absolutely beneficial when training because the forward pass and back propagation is still only on the neurons that were activated.

The Mistral guys specifically mention that training speed (due to not needing as much compute) was one of the reasons Mixtral was released so soon after Mistal 7b.


With an MoE you only need to train a smaller model which you can then combine into an x8 and finetune/train the router. Mistral used their 7B base to make Mixtral, Qwen's new MoE uses their 1.8B model upscaled to 2.7B, pretty sure Grok also trained a smaller model first.


Very incorrect! The "8x7b" in the name regularly confuses people into some similar conclusion, but there are not eight 7b "experts" in Mixtral 8x. It's more apt to think of all 256 FFN's as the "experts," as each expert FFN on a given layer has no relation to the expert FFN's on other layers. You need to train them all within the MoE architecture, as combining existing models ("clown car MoE") works, but isn't gaining anything from the architecture/sparsity


Sorry, could you expand on this a bit further? Are you saying that for a MoE, you want to train the exact same model, and then just finetune the feed forward networks differently for each of them? And you're saying that separately training 8 different models would not be efficient - do we have evidence for that?


You're only correct about Qwen's MoE. I presume that Chinese model builders feel more pressure to be efficient about using their GPU time because of sanctions.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: