I’m very curious about your experience doing audio on the GPU. What kind of worst-case latency are you able to get? Does it tend to be pretty deterministic or do you need to keep a lot of headroom for occasional latency spikes? Is the latency substantially different between integrated vs discrete GPUs?
Short answer: it has been a big pain in the butt. The GPU hardware is mostly really great, but the drivers/APIs were not designed for such a low-latency use case. There's (for audio) a large overhead latency in kernel execution scheduling. I've had to do a lot of fun optimization in terms of just reducing the runtime of the kernel itself, and a lot of less-fun evil dark magic optimization to e.g. trick macOS into raising the GPU clock speed.
Long answer: I've written a fair bit about this on my devlog. You might check out these tags:
Thanks for the extra info, I read through some of your entries on GPU optimization and it definitely seems like it's been a journey! Thanks for blazing the trail.