To expand on the UTF-8 point - assuming you are using "UCP" type definitions of ...

To expand on the UTF-8 point - assuming you are using "UCP" type definitions of whitespace where you start looking for "Mathematical Space" (codepoint 0x205f) and "Ideographic Space" (codepoint 0x3000) and a good couple dozen other friendly spaces.... you can do a trick similar to what I describe if you really think that SIMD is going to pay off here (hey, maybe you're doing a lot of "6 EM Space" processing). It's still potentially doable in SIMD and the comparison is, well, what? One-char-at-a-time?

You will need more shuffles and/or more buckets.