It's not the vector instructions, it's the careful scheduling of instructions to...

It's not the vector instructions, it's the careful scheduling of instructions to spend just enough time manipulating pointers when you want to crunch actual data. All while respecting dependency chains and memory stall times. (Hyperthreading helps a lot with the latter, see Nvidia Maxas (nervana systems now) for details on how flexible number of threads benefit weighing of memory load stall hiding vs. register pressure causing more data shuffling.