Hacker News new | past | comments | ask | show | jobs | submit login

To expand on the UTF-8 point - assuming you are using "UCP" type definitions of whitespace where you start looking for "Mathematical Space" (codepoint 0x205f) and "Ideographic Space" (codepoint 0x3000) and a good couple dozen other friendly spaces.... you can do a trick similar to what I describe if you really think that SIMD is going to pay off here (hey, maybe you're doing a lot of "6 EM Space" processing). It's still potentially doable in SIMD and the comparison is, well, what? One-char-at-a-time?

You will need more shuffles and/or more buckets.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: