It is model-dependent. I've seen that (NVLink benefits) when comparing against PCIe-3 connection, with small batch size, no gradient accumulation.
Once you have larger batch size and gradient accumulation, DDP won't be improved by NVLink I believe (the all-reduce traffic on gradients will be small comparing to your computation overhead).
Once you have larger batch size and gradient accumulation, DDP won't be improved by NVLink I believe (the all-reduce traffic on gradients will be small comparing to your computation overhead).