There is more glaring issue, ROCm doesn't even work well on most AMD devices nowadays, and hip performance wise deterioriates on the same hardware compared to ROCm.
If you want to write very efficient CUDA kernel for modern datacenter NVIDIA GPU (read H100), you need to write it with having hardware in mind (and preferably in hands, H100 and RTX 4090 behave very differently in practice). So I don't think the difference between AMD and NVIDIA is as big as everyone perceives.
https://github.com/ROCm/HIPIFY