What is interesting here is that in their current implementation they aren't very beneficial [1] and [2].
[1] https://arxiv.org/pdf/1806.05713.pdf [2] https://www.sciencedirect.com/topics/computer-science/scatte... (recommends these instructions to be used outside of main loop)
I remember vaguely that first implementations of scatter/gather instructions were not faster than sequential access from different memory registers.
And, thusly, it may come handly that AMD has much bigger core count because each thread will have less memory to access.
What is interesting here is that in their current implementation they aren't very beneficial [1] and [2].
[1] https://arxiv.org/pdf/1806.05713.pdf [2] https://www.sciencedirect.com/topics/computer-science/scatte... (recommends these instructions to be used outside of main loop)
I remember vaguely that first implementations of scatter/gather instructions were not faster than sequential access from different memory registers.
And, thusly, it may come handly that AMD has much bigger core count because each thread will have less memory to access.