MoE reduces compute cost for inference at scale, but not for training. You still...

Me1000 · on April 5, 2024

It’s absolutely beneficial when training because the forward pass and back propagation is still only on the neurons that were activated.

The Mistral guys specifically mention that training speed (due to not needing as much compute) was one of the reasons Mixtral was released so soon after Mistal 7b.

rileyphone · on April 4, 2024

With an MoE you only need to train a smaller model which you can then combine into an x8 and finetune/train the router. Mistral used their 7B base to make Mixtral, Qwen's new MoE uses their 1.8B model upscaled to 2.7B, pretty sure Grok also trained a smaller model first.

phree_radical · on April 5, 2024

Very incorrect! The "8x7b" in the name regularly confuses people into some similar conclusion, but there are not eight 7b "experts" in Mixtral 8x. It's more apt to think of all 256 FFN's as the "experts," as each expert FFN on a given layer has no relation to the expert FFN's on other layers. You need to train them all within the MoE architecture, as combining existing models ("clown car MoE") works, but isn't gaining anything from the architecture/sparsity

epups · on April 5, 2024

Sorry, could you expand on this a bit further? Are you saying that for a MoE, you want to train the exact same model, and then just finetune the feed forward networks differently for each of them? And you're saying that separately training 8 different models would not be efficient - do we have evidence for that?

samus · on April 5, 2024

You're only correct about Qwen's MoE. I presume that Chinese model builders feel more pressure to be efficient about using their GPU time because of sanctions.