Implementing strcmp, strlen, and strstr using SSE 4.2 instructions (2008)

geocar · on Sept 8, 2017

My favourite part:

> The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.

Too many programmers often think of programming as an online operation - online with stackoverflow, the manual, npm install, and so on -- and in a world of always-online it's hard to justify "wasting" precious mental-space preserving the ability to program in a wifi-free zone like an airplane or on the underground, but you need the ability to think about your software to program on the chip that doesn't exist yet, so this may be interesting to meditate over from time to time.

vvanders · on Sept 8, 2017

There was a fun MySQL bug[1] a while back involving similar functions and SSE(I think) instructions where the MySQL check was something like:

   if((my_bool)memcmp(password, hash) == 0)

Except that the SSE function would return nonzero of > 256 which would truncate to 0. Since the hash was seeded with time you could just spin-lock your way to entry on any system of that arch.

So while they may have been "correct" there's always fun implications when you deploy them in the wild.

[1] - https://nvd.nist.gov/vuln/detail/CVE-2012-2122

jsheard · on Sept 8, 2017

Intel provides a tool that emulates instructions which are unsupported by the host CPU, so it's likely that the code was tested before hardware became available.

https://software.intel.com/en-us/articles/intel-software-dev...

> Intel is releasing this Intel SDE so that developers can gain familiarity with our upcoming instruction set extensions.

twotwotwo · on Sept 8, 2017

Go's implementation seems to mostly live in https://golang.org/src/strings/strings_amd64.go and https://github.com/golang/go/blob/master/src/runtime/asm_amd... (search 'indexbyte' and 'indexshortstr'). At least some of the stuff to use newer instruction sets was added by Ilya Tocar of Intel in https://go-review.googlesource.com/c/go/+/22337 . It looks like:

For short needle + haystack that fits in a vector reg, use SSE/AVX2 string search.

For short needle and long haystack, use SSE to find the first byte of the needle jumping a whole vector-register width at a time, then compare. If that produces too many false positives, fall back to SSE/AVX2 search for the whole needle.

For long needle, use a Rabin-Karp rolling hash to find a likely match, then verify.

The string search optimization rabbit hole can go pretty deep, especially because different implementations might win for different values passed to the function. Like, I'm guessing Go tries that search for just the first byte because, empirically, it helps pretty often, but it's not inherently faster for every possible needle and haystack (think needle ab haystack aaaaaaaaaaaaaa...ab), hence the escape hatch after too many false matches.

It's kind of wild what clever machinery gets invoked when you try to do simple things.

kwillets · on Sept 9, 2017

There's a pretty good discussion and benchmarks here: http://0x80.pl/articles/simd-strfind.html

and here: http://0x80.pl/articles/simd-friendly-karp-rabin.html

twotwotwo · on Sept 9, 2017

Whoa, that's by Wojciech Muła, and I've used his pydvi2svg project. Neat coincidence.

Only sort of related to these links, but one thing I know very little about is whether you can use tricks like these to speed up related search-y ops that aren't exactly strstr or strchr. For example, finding an occurrence of any of a set of strings, or chars needing escaping for JSON or HTML/XML output. Of course, makes sense that most of the work goes to the most used fundamental functions; I'm just wondering out loud.

kwillets · on Sept 9, 2017

His work on Base64 decoding might be a start. In many cases it's a matter of how small an FSM or transducer is.

For multiple string match I was looking at doing Aho-Corasick for long strings in a SIMD-friendly way, but it's more a matter of an FSM with relatively few branch points, so that SIMD can digest long label matches quickly just as it does with long strcmp's.

vinayan3 · on Sept 8, 2017

According to this one lecture the speed ups vary from 1.47x - 13.8x. http://ispass.org/ispass2011/slides/4_2.pdf

Another set of benchmarks comparing simple strcmp to SSE 4.2: https://www.linkedin.com/pulse/simple-fast-string-comparison...

millstone · on Sept 8, 2017

These all look to read beyond the end of the string, which can cause a crash. Don't they need a scalar handling of <16 byte suffixes, or at least aligning to avoid crossing a page?

Twirrim · on Sept 8, 2017

It mentions on the site:

  Other features of SSE 4.2
  
      The strings do not need to be aligned.
      The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
      You can use the instructions with Unicode characters, signed or unsigned bytes.
      Four aggregation operations can be used to implement a wide range of string-processing functions.

So presumably not, looks like the processor handles the end-of-string stuff natively.

millstone · on Sept 9, 2017

This seems impossible, and besides there's code like `MovDqU xmm0, dqword[edx]` which will certainly SIGSEGV if crossing a page boundary into unmapped address space.

I also found this SO question which claims that trivial testcases break these implementations https://stackoverflow.com/questions/21714827/movdqu-instruct...

This code is just horrendously broken. Why has nobody else noticed it?

chrisseaton · on Sept 8, 2017

Aren't the reads aligned? Is it possible for an aligned 16 byte block to cross a page boundary?

millstone · on Sept 8, 2017

"The strings do not need to be aligned" and the code doesn't take special pains to align the reads.

wyldfire · on Sept 8, 2017

Also of interest would be whether any popular x86 libc's actually leverage this already.

I did a quick skim (disclaimer it's sometimes hard to rule out that an SSE optimized version is in there somewhere).

* musl 1.1.16: no

* glibc 2.26: yes (also an avx2 optimized one)

* Apple Libc 1158.50.2: optimized (non-SSE2) versions of several str* funcs exist but strcpy() has an >=SSE impl

DangerousPie · on Sept 8, 2017

Page says "Created 9 years ago", so this should probably say (2008).

LukeShu · on Sept 8, 2017

If you mouse-over it, it will tell you the exact date: 2008-12-21

    <time title="21 Dec 2008, 18:19" datetime="2008-12-21T18:19:19+06:00">9&nbsp;years ago</time>

sctb · on Sept 8, 2017

Thanks, updated!

kwillets · on Sept 8, 2017

PCMPISTRI, what an instruction.

It's supposed to be a comparison, but it doesn't set the flags with comparison results. Instead it uses them to signal if it's found an unequal byte, or hit the end of the strings. The latter is equality, which is one condition we want, but it doesn't say where in the 16 bytes it found the terminator. In fact it sets the register (ecx) which otherwise would be the offset to compare bytes, to 16.

That's all fine if you just want to return string comparison, but if you're trying to find their longest common prefix it's a whole new comparison to find the 0x00's.

kobeya · on Sept 8, 2017

Sure, but now you can compare 16 bytes per cycle.

kwillets · on Sept 8, 2017

It's 16/3 bytes per cycle.

The newer SIMD stuff seems to be factoring it out into bytewise compare and CLZ on a mask.

yuhong · on Sept 9, 2017

They were touting SSE4.2 for XML parsing. I wonder what about HTML parsing too.