Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Interesting dig on IB. RoCE is the right solution since it is open standards and more importantly, available without a 52+ week lead time.



Yeah, and RoCE isn't single vendor. I'm not sure IB scales to the relevant cluster sizes, either.


Is NVLink just not scalable enough here?


I don't know. I haven't actually worked with IB in this specific space (or since before Nvidia acquired MLNX). My experience with RoCE/IB was for storage cluster backend in the late 2010s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: