They also did bandwidth scaling to handle work around the nerfed H800 interconnects.
> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths
> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.
> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths
> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.
(I know some of those words)
https://arxiv.org/html/2412.19437v1