Pruning spaces from strings quickly on ARM processors

glangdale · on July 4, 2017

[ full disclosure: Intel employee, working in software that does exactly this sort of thing ]

I would suggest that the best way to generally hunt down a character class - i.e. white space is itself an 'algorithmic' problem and a mildly interesting one. If you have a fast permute handy this can be done on both x86 and ARM.

We have some code here that does this, but it's not the easiest to understand (this is the run-time only):

https://github.com/01org/hyperscan/blob/master/src/nfa/shuft...

The idea is that you represent a character class by intersecting two PSHUFB (or whatever floats your boat) results. So if you want to do a real regex-style \s, you will probably have to use one "bucket" to represent the low-order stuff (like \r and \n) and one for \x20. You then use 1 PSHUFB to look up your bucket for the high 4 bits (aka nibble) and the other PSHUFB to look up the low 4 bits.

This can't represent all character classes - it's possible to run out of buckets. There's an obvious way of doing all character classes with 3 shuffles (see truffle.c in the same directory).

As another commenter says, it's a lot harder to do this for UTF-8 as you are looking for bigger data items, variable-sized data items, and potentially a more generous definition of whitespace.

glangdale · on July 4, 2017

To expand on the UTF-8 point - assuming you are using "UCP" type definitions of whitespace where you start looking for "Mathematical Space" (codepoint 0x205f) and "Ideographic Space" (codepoint 0x3000) and a good couple dozen other friendly spaces.... you can do a trick similar to what I describe if you really think that SIMD is going to pay off here (hey, maybe you're doing a lot of "6 EM Space" processing). It's still potentially doable in SIMD and the comparison is, well, what? One-char-at-a-time?

You will need more shuffles and/or more buckets.

ChuckMcM · on July 3, 2017

I really like that more folks are looking at the vectorized instructions in various ARM chips but I worry about gross generalizations like "However, for many problems, the total lack of a movemask instruction, and the weakness of the equivalent to the pshufb, might limit the scope of what is possible with ARM NEON."

I would much prefer taking each of the various 'cpu eater' type applications and look at them as a unit. So DSP or convolution or bin packing or list searching and look at what you can do vs what needs help. A lot of the improvements in the x86 architecture came about because a design engineer at Intel or AMD read a clear statement of the problem and the challenges with the solution.

That said, and to justinjlynn's comment, I would really love a top to bottom 'unicode string processing current processors' so many people are doing nothing but scripting these days that string hacking is a big part of interpreters (rather than say raw floating point performance back in the day).

ori_b · on July 4, 2017

The problem with that is that Unicode string processing is highly data dependent. You may not get good speedups because most performance improvement from putting things in hardware comes from parallelism in the data path.

A good approach to getting speedups is probably by assuming ascii, vectorizing the hell out of that, and falling back to multibyte processing where that fails. Checking for multibyte characters comes down to checking if the high bit of any byte is set, which should be fast and easy for the branch predictor to deal with.

0xfaded · on July 3, 2017

The NEON coprocessor runs at the same CPU clock, but 20 cycles behind the main CPU. It has its own instruction queue, and as long as the queue isn't full the main CPU will simply forward instructions to it. If you want to read from the wide registers though, you need to wait the 20 cycles for the coprocessor to catch up.

pm215 · on July 3, 2017

A quick google suggests this was true for the Cortex-A8 (released in 2005, over a decade ago), but is unlikely to be true for any implementation since, and I would definitely not expect it to be true for a modern 64-bit implementation like the one the article author tested.

0xfaded · on July 3, 2017

Thanks for the reply. After much sifting I finally found modern numbers.

  Gen -> NEON 6 cycles
  NEON -> Gen 8 cycles

So yes, definitely much closer.

http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Co...

NathanOsullivan · on July 4, 2017

> Most of the benefit from this approach would come if you can often expect streams of 16 bytes to contain no white space character. This seems like a good guess in many applications.

The average word length in English is substantially less than 16, so it is hard to see how this would help.

wongarsu · on July 4, 2017

parse concatenated zero-terminated strings, parse space-separated base64 strings, find the commas in CSVs, discard sensor data that doesn't meet the threshold, do run length encoding on sorted data, ...

There are plenty of applications for this algorithm or variations of it. But you are right that the presented application doesn't seem immediately useful.

jandrese · on July 4, 2017

How often do you want to pull the spaces out of English text though? I'd think this would be used more for something like Base64 encoded binaries where you want to simplify the decoder by preprocessing the text.

hedora · on July 3, 2017

(x<32)?1:0 asks for a branch. Presumably the compiler will remove it.

However, you can just write (x<32), which is defined to be zero or one by the language spec. This shouldn't save anything on ARM, but who knows what the compiler will do.

Similarly, the scalar version might be faster if you do manual loop unrolling. It probably still won't beat Neon, but it makes for a better "scalar vs vectorized" bake-off.

fred256 · on July 4, 2017

Maybe I’m misunderstanding your post, but any half-decent compiler should compile (x<32)?1:0 and (x<32) to exactly the same code, regardless of platform.

torrent-of-ions · on July 4, 2017

The C specification requires that (x<32) is 0 or 1, but that doesn't mean it is necessarily possible to do it branch-free on any particular computer.

vidarh · on July 4, 2017

Maybe there are cpu's where it isn't possible, but I don't think there are any non-joke systems where it isn't possible.

There are just too many ways of doing it. E.g. on most architectures there'll be a straightforward way of doing it using an add/sub operator and taking advantage of overflow/underflow to set a carry flag, and then another operator or three to get the least significant bit set to the carry flag.

Or you can do a straight lookup into a table on any architecture that supports a load from memory and an add, though it'd of course be hugely wasteful.

On any architecture that also supports a right shift / integer division plus AND you could make that use less memory by shifting, masking, lookup up (any set bit =>0, no set bits => 1), and then and'ing the results together with an initial 1.

And many, many more. I have a hard time picturing what an instruction set that would make it impossible to do that without branches would even look like.

Of course that doesn't guarantee it will be the most efficient choice, and of course also doesn't guarantee that a compiler will know how to do it.

pm215 · on July 4, 2017

In this particular instance, 64-bit ARM (A64) has a conditional-set instruction, so "int cmp(int x) { return x < 32; }" compiles straightforwardly to "cmp w0, #0x1f; cset w0, le; ret". For 32-bit ARM you can use the conditional-execution you get on almost all instructions, so gcc emits "cmp r0, #31; movgt r0, #0; movle r0, #1; bx lr".

justinjlynn · on July 3, 2017

These tricks won't have a great time with Unicode encodings.

thechao · on July 4, 2017

I do a lot of low-level systems programming; think rasterizers, drivers, etc. There's numerous occasions where I've needed to hand-code some sort of work-around, or to paper over a gap in the functionality provided by the hardware. In C/++, this isn't really a thing: the language only has so much flexibility in what in can say.

When I really have to get into things, though, I end up writing assembly for the truly hot-loops. In that case, as fugly as x86/64 is, it is one of the best ISAs for exactly expressing the algorithm I have in mind. Personally, I find the most advanced NEON specs are about a decade behind the x86 vector specs. Generally, while the ARM implementations can keep up cycle-for-cycle on commodity parts for scalar, algorithmically, x86/64 blows it out of the water. I think the article cites about 5x? In the worst case, I've seen places where ARM falls entirely out of the I$ and thrashes its D$ & I$ on scalar work-arounds for gaps in its NEON set, where x86/64 manages to just squeak by completely vectorized. In those cases I see 50x (or more) delta, even though the spec sheet would imply that ARM is "just as good" as x86/64.

I like Mike Abrash's quote on this (to paraphrase): "assembly isn't faster than C++; assembly is more expressive than C++".

pjmlp · on July 4, 2017

Back in the MS-DOS days, when I had lots of fun using TASM, the common way to write high performance C and other higher level languages on 8 and 16 bit computers was using inline Assembly everywhere.

The typical for 80% of the application

    void func(args)
    {
        asm {
        }
    }

Literally poor man's macro assembler.

Now 40 years of optimizing compiler improvements, we have younger generations thinking that C is like writing Assembly and C compilers were always blazing fast.

So when I see statements about other compilers generating slower code than C ones, I remember those days and smile.

loeg · on July 3, 2017

These tricks work just fine with UTF-8.

user5994461 · on July 3, 2017

Unicode is more than just UTF-8.

pjscott · on July 4, 2017

Unicode can be encoded as UTF-8, and as long as you're okay with defining "whitespace" as code points less than 32, the algorithm here will handle it correctly.

adrianratnapala · on July 6, 2017

The beautiful thing about UTF-8 is that a lot of simple seemingly-ascii algorithms will will work correctly with unicode: as long as you define "correctly" in a way that ignores complicated unicode semantics.

And we can often do that: for example a programming language could just ban non-ASCII characters everywhere but in string literals.

But sadly, we do often need to care about all those complicated unicode character classes etc. And if we are talking explicitly about Unicode, then by default we should be considering the whole standard, not just the bits we like.

kazinator · on July 3, 2017

But, boy, it could speed up Fortran 66 compilation!

tjalfi · on July 4, 2017

The CDC STAR-100 included an instruction[0] for parsing Fortran.

[0] https://groups.google.com/d/msg/comp.arch/QhQDFCcny7w/wVOpIh...

ams6110 · on July 3, 2017

Unless the hot spot in your code is removing spaces from strings, these tricks are just academic and not something you would ever do yourself. You'd use a well tested, safe string handling library in real code.

glangdale · on July 4, 2017

I am not an academic. I have done this myself. Of course, we wrote a regex library...

These tricks are in general something that it's well worth learning how to do if you do any performance critical code, as often the boundary of the way that they are implemented in someone's library is not useful for the actual task (i.e. you may have a fast 'find a space' function, but it might not be fast by the time you painfully iterate through each of its results removing spaces one by one). Sadly, a lot of this sort of processing doesn't compose very well unless you have access to an omniscient optimzer.

But yes, if this isn't your hot spot, don't do this thing in particular. It's still possible you might learn something useful by reading about it.

wongarsu · on July 4, 2017

It's mostly interesting because it's a subproblem of many interesting problems. "get a bunch of data, discard uninteresting/invalid data, do stuff with the rest" is something that easily becomes a hotspot. "Removing spaces" is the "discard uninteresting/invalid data" part, in the easy case that the data is a bunch of bytes and there is some arbitrary threshold that decides what to keep (not unusual with sensor data of all kinds).

So I'm not sure "academic" is quite the right word, but I agree that worrying about Unicode is not the point here.

hedora · on July 4, 2017

Someone has to implement those libraries.

Also, this space remover looks a lot like a CSV parser, and CSV is a fine high level format if your data is a table (and you care at all about performance).