This is crazy insightful, thanks! I’d really love to learn how to get to this level of understanding, but can’t seem to figure out what curriculum I’d follow where I’d end up with this level of technical competence.
You need to understand how the gpu architecture works on a abstract level. Try to understand the SIMT (Single Instruction Multiple Threads) principle.
Doing some shader programming or writing a cuda kernel could be a nice exercise.
In a nutshell, if you want to add two vectors with hundred elements, instead of looping from 0 to 99 you would call a function called "kernel" (or "shader" in graphics programming) 100 times and pass it different indices.
Then research how it is realized on the hardware with "warp"s or "wavefront"s (on AMD i think). How the cache works is also very important here. Sadly the information on the internet is relatively sparse here.