For short needle + haystack that fits in a vector reg, use SSE/AVX2 string search.
For short needle and long haystack, use SSE to find the first byte of the needle jumping a whole vector-register width at a time, then compare. If that produces too many false positives, fall back to SSE/AVX2 search for the whole needle.
For long needle, use a Rabin-Karp rolling hash to find a likely match, then verify.
The string search optimization rabbit hole can go pretty deep, especially because different implementations might win for different values passed to the function. Like, I'm guessing Go tries that search for just the first byte because, empirically, it helps pretty often, but it's not inherently faster for every possible needle and haystack (think needle ab haystack aaaaaaaaaaaaaa...ab), hence the escape hatch after too many false matches.
It's kind of wild what clever machinery gets invoked when you try to do simple things.
Whoa, that's by Wojciech Muła, and I've used his pydvi2svg project. Neat coincidence.
Only sort of related to these links, but one thing I know very little about is whether you can use tricks like these to speed up related search-y ops that aren't exactly strstr or strchr. For example, finding an occurrence of any of a set of strings, or chars needing escaping for JSON or HTML/XML output. Of course, makes sense that most of the work goes to the most used fundamental functions; I'm just wondering out loud.
His work on Base64 decoding might be a start. In many cases it's a matter of how small an FSM or transducer is.
For multiple string match I was looking at doing Aho-Corasick for long strings in a SIMD-friendly way, but it's more a matter of an FSM with relatively few branch points, so that SIMD can digest long label matches quickly just as it does with long strcmp's.
For short needle + haystack that fits in a vector reg, use SSE/AVX2 string search.
For short needle and long haystack, use SSE to find the first byte of the needle jumping a whole vector-register width at a time, then compare. If that produces too many false positives, fall back to SSE/AVX2 search for the whole needle.
For long needle, use a Rabin-Karp rolling hash to find a likely match, then verify.
The string search optimization rabbit hole can go pretty deep, especially because different implementations might win for different values passed to the function. Like, I'm guessing Go tries that search for just the first byte because, empirically, it helps pretty often, but it's not inherently faster for every possible needle and haystack (think needle ab haystack aaaaaaaaaaaaaa...ab), hence the escape hatch after too many false matches.
It's kind of wild what clever machinery gets invoked when you try to do simple things.