> The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.
Too many programmers often think of programming as an online operation - online with stackoverflow, the manual, npm install, and so on -- and in a world of always-online it's hard to justify "wasting" precious mental-space preserving the ability to program in a wifi-free zone like an airplane or on the underground, but you need the ability to think about your software to program on the chip that doesn't exist yet, so this may be interesting to meditate over from time to time.
There was a fun MySQL bug[1] a while back involving similar functions and SSE(I think) instructions where the MySQL check was something like:
if((my_bool)memcmp(password, hash) == 0)
Except that the SSE function would return nonzero of > 256 which would truncate to 0. Since the hash was seeded with time you could just spin-lock your way to entry on any system of that arch.
So while they may have been "correct" there's always fun implications when you deploy them in the wild.
Intel provides a tool that emulates instructions which are unsupported by the host CPU, so it's likely that the code was tested before hardware became available.
For short needle + haystack that fits in a vector reg, use SSE/AVX2 string search.
For short needle and long haystack, use SSE to find the first byte of the needle jumping a whole vector-register width at a time, then compare. If that produces too many false positives, fall back to SSE/AVX2 search for the whole needle.
For long needle, use a Rabin-Karp rolling hash to find a likely match, then verify.
The string search optimization rabbit hole can go pretty deep, especially because different implementations might win for different values passed to the function. Like, I'm guessing Go tries that search for just the first byte because, empirically, it helps pretty often, but it's not inherently faster for every possible needle and haystack (think needle ab haystack aaaaaaaaaaaaaa...ab), hence the escape hatch after too many false matches.
It's kind of wild what clever machinery gets invoked when you try to do simple things.
Whoa, that's by Wojciech Muła, and I've used his pydvi2svg project. Neat coincidence.
Only sort of related to these links, but one thing I know very little about is whether you can use tricks like these to speed up related search-y ops that aren't exactly strstr or strchr. For example, finding an occurrence of any of a set of strings, or chars needing escaping for JSON or HTML/XML output. Of course, makes sense that most of the work goes to the most used fundamental functions; I'm just wondering out loud.
His work on Base64 decoding might be a start. In many cases it's a matter of how small an FSM or transducer is.
For multiple string match I was looking at doing Aho-Corasick for long strings in a SIMD-friendly way, but it's more a matter of an FSM with relatively few branch points, so that SIMD can digest long label matches quickly just as it does with long strcmp's.
These all look to read beyond the end of the string, which can cause a crash. Don't they need a scalar handling of <16 byte suffixes, or at least aligning to avoid crossing a page?
Other features of SSE 4.2
The strings do not need to be aligned.
The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
You can use the instructions with Unicode characters, signed or unsigned bytes.
Four aggregation operations can be used to implement a wide range of string-processing functions.
So presumably not, looks like the processor handles the end-of-string stuff natively.
This seems impossible, and besides there's code like `MovDqU xmm0, dqword[edx]` which will certainly SIGSEGV if crossing a page boundary into unmapped address space.
It's supposed to be a comparison, but it doesn't set the flags with comparison results. Instead it uses them to signal if it's found an unequal byte, or hit the end of the strings. The latter is equality, which is one condition we want, but it doesn't say where in the 16 bytes it found the terminator. In fact it sets the register (ecx) which otherwise would be the offset to compare bytes, to 16.
That's all fine if you just want to return string comparison, but if you're trying to find their longest common prefix it's a whole new comparison to find the 0x00's.
> The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.
Too many programmers often think of programming as an online operation - online with stackoverflow, the manual, npm install, and so on -- and in a world of always-online it's hard to justify "wasting" precious mental-space preserving the ability to program in a wifi-free zone like an airplane or on the underground, but you need the ability to think about your software to program on the chip that doesn't exist yet, so this may be interesting to meditate over from time to time.