Indeed, no extra synchronisation is needed here due to the nature of the hardwar...

lmeyerov · on July 16, 2024

Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.