I really like that more folks are looking at the vectorized instructions in various ARM chips but I worry about gross generalizations like "However, for many problems, the total lack of a movemask instruction, and the weakness of the equivalent to the pshufb, might limit the scope of what is possible with ARM NEON."
I would much prefer taking each of the various 'cpu eater' type applications and look at them as a unit. So DSP or convolution or bin packing or list searching and look at what you can do vs what needs help. A lot of the improvements in the x86 architecture came about because a design engineer at Intel or AMD read a clear statement of the problem and the challenges with the solution.
That said, and to justinjlynn's comment, I would really love a top to bottom 'unicode string processing current processors' so many people are doing nothing but scripting these days that string hacking is a big part of interpreters (rather than say raw floating point performance back in the day).
The problem with that is that Unicode string processing is highly data dependent. You may not get good speedups because most performance improvement from putting things in hardware comes from parallelism in the data path.
A good approach to getting speedups is probably by assuming ascii, vectorizing the hell out of that, and falling back to multibyte processing where that fails. Checking for multibyte characters comes down to checking if the high bit of any byte is set, which should be fast and easy for the branch predictor to deal with.
I would much prefer taking each of the various 'cpu eater' type applications and look at them as a unit. So DSP or convolution or bin packing or list searching and look at what you can do vs what needs help. A lot of the improvements in the x86 architecture came about because a design engineer at Intel or AMD read a clear statement of the problem and the challenges with the solution.
That said, and to justinjlynn's comment, I would really love a top to bottom 'unicode string processing current processors' so many people are doing nothing but scripting these days that string hacking is a big part of interpreters (rather than say raw floating point performance back in the day).