Binary Search Eliminates Branch Mispredictions (2012)

corysama · on Oct 25, 2014

Someone tell me if there's an error in my takeaway:

For short sequences, linear searches can be faster than binary searches because a short sequence of mostly predictable branches (no,no,no,no,yes) is usually faster than an even shorter sequence of unpredictable branches (yes,no,yes). To work around this, the author restricts his searches to compile-time constant sequence lengths so that GCC can unroll the fixed-size binary searches into short, fixed, linear series of branchless conditional move instructions. That setup executes reliably faster than optimized linear searches for in-cache sequences of size 8, 16, 32 and 64.

However, if the data is uncached, then memory latency dominates ALU execution time and the memory prefetcher's preference for forward, linear scans leads to linear searches being faster than branches binary because of the difference in memory access patterns. Interestingly, the behavior of the prefetcher on the author's machine appears to prefer forward scans over backward scans at a sub-cache-line level.

nkurz · on Oct 25, 2014

Yes, I think your first paragraph is the correct takeaway, but I'm not sure about the second.

Using Haswell as the example (because it's current and I'm more familiar with it right now), the linear search comparisons are really fast: the processor can handle 2 correctly-predicted branches-not-taken per cycle. But the single branch misprediction at the end costs about 15 cycles, plus the latency of reloading from L1 any registers where the load had already been speculatively executed (about another 5 cycles).

The conditional moves are slower than simple comparisons, but still very fast: the CMP/CMOVcc pair can have a throughput of 1 per cycle if unrolled, so half as fast as a CMP/JMPcc that is correctly predicted as not taken. But there is no penalty at the end of the loop, so the overall speed is better for small sizes out to a 64-byte cacheline or so. I haven't tested, but I think the branchless approach has extended his lead in recent years. He's using Westmere, which is pre-Sandy Bridge, which is pre-Ivy Bridge, which is pre-Haswell.

But the question of how this changes for data that needs to be brought into a cacheline from memory is probably more complicated than just the memory latency. I think the issue is more that the memory latency swamps the reading, and it isn't (or shouldn't be) that linear searches can take advantage of prefetching but binary can not. It just might take some 'trickery'. For the small sizes he's dealing with, you should be able to just load it all without waiting to see what you need. And I don't think that there is currently any advantage of forward versus reverse scans regarding prefetch, although one or the other might allow the loop to be structured differently.

jhallenworld · on Oct 25, 2014

Cool! I wonder if you can improve things even more by running several binary searches at once, by interleaving several of them together, then combine the results. This way there is work for the CPU to do during the data load latencies (or write back if it's not fully bypassed).

jhallenworld · on Oct 25, 2014

Well I tried it on an IvyBridge Xeon E5-2680-V2, gcc version 4.4.3 with -O3, table size = 32. Single table at a time ends up being faster:

    155 M searches / sec for one table at a time.

    108 M searches / sec for two tables at a time.

    88 M searches / sec for four tables at a time.

The interleaved version of the code looks like this (for two at a time):

    #define LOGSIZE 5

    unsigned dsearch (unsigned key, unsigned * vector)
    {
        unsigned size = (1 << LOGSIZE) / 2;
        unsigned * low1 = vector;
        unsigned * low2 = vector + ((1 << LOGSIZE) / 2);
        unsigned i;

        for (i = LOGSIZE - 1; i != 0; i--) {
                size /= 2;
                unsigned mid1 = low1[size];
                unsigned mid2 = low2[size];
                if (mid1 <= key)
                        low1 += size;
                if (mid2 <= key)
                        low2 += size;
        }

        if (*low2 == key)
                low1 = low2;

        return low1 - vector;
    }

The compiled code ends up branch free (gcc does a good job), but is 30 instructions instead of 18.

Oh well, was worth a shot..

tjpoutanen · on Oct 26, 2014

I think the reason it's slower for you is that the CPU pipeline is starved for data... even when it is in the same cache line.

Since we are geeking out on binary search on a Saturday night, here's an approach I developed for traversing multiple binary trees in parallel with SIMD instructions, no branches and data prefetching. It's used at Bing to calculate the document score that ranks Web search results, and traverses thousands of Machine Learning derived decision trees. It resulted in a 4x speedup over the naive approach.

http://web.cse.ohio-state.edu/~ren/papers/CGO13_BinRen.pdf

vtuulos · on Oct 26, 2014

This is very cool. Thanks for sharing!

nkurz · on Oct 25, 2014

Excellent for doing the the actual test!

My interpretation might be the opposite of yours: it's not that the interleaved version fails to overcome the latency, rather the apparently one-at-a-time version is already able to issue all the loads speculatively, and thus isn't being hurt (much?) by latency. If the addresses are known ahead of time, processors are great about issuing loads as soon as possible. They 'retire' them in order, but if the situation is right they can 'execute' them dozens of cycles ahead of time, even across loop iterations.

My guess would be that the slowdown from the interleaved searches is because spreading the loads across multiple cachelines actually works against you, since stalls become more rather than less frequent. Ivy Bridge can have (about?) 10 outstanding L1 requests, which it can issue two per cycle. With a 5 cycle (I think?) L1 latency, this is enough to cover up the latency. Sustaining the two loads per cycle is hard, though, since only 4 ops per cycle can retired. But there are not many other operations here, so the unrolled loop may come close.

a_e_k · on Oct 25, 2014

You could perhaps implement std::equal_range() [0] that way. It behaves somewhat similarly to doing two slightly different binary searches (to find both the lower and upper bounds on where a key could be legally inserted into a sorted range).

[0] http://en.cppreference.com/w/cpp/algorithm/equal_range

readerrrr · on Oct 25, 2014

I'm not an expert, but I think cpu can't do anything during a stall.

sliverstorm · on Oct 25, 2014

You can schedule another thread, if you have SMT. You can execute out-of-order to try avoiding the stall.

But in general you are right in that once a stall happens, you aren't going anywhere.

jhallenworld · on Oct 25, 2014

>You can execute out-of-order to try avoiding the stall.

Exactly. By running multiple interleaved branch-free binary searches at once you offer independent work to the out of order engine.

seattlewag · on Oct 25, 2014

That does sound like a pretty powerful optimization!

qwerta · on Oct 25, 2014

Why? Do you have a task which requires that?

What you want is called fork & join, pretty much standard in Java 8 and Scala.

jhallenworld · on Oct 25, 2014

I doubt it's worth forking to parallelize a small search.

qwerta · on Oct 25, 2014

fork join framework takes care of that.

userbinator · on Oct 26, 2014

It would be so very awesome if Intel/AMD decided to make REPNZ/REPZ SCAS(B/W/D/Q) search, in parallel, an entire cache line at once (and in the case of multiple matches, a priority encoder would select the first one) - they're already doing this with associative memory for the cache tags. Much like the vectorised SSE version, but smaller (can't get better than a single instruction) and could even be put inside a binary search of the cache-line-sized blocks for more speed, basically a vectorised binary search. Would certainly be interesting to compare with the other approaches here.

wfunction · on Oct 26, 2014

In my experiments conditional moves have been actually slower than predicted branches.

nkurz · on Oct 26, 2014

That's not a terrible rule of thumb, but one can often do better. If you (the programmer) know that consistently correct prediction is impossible (as it is often is for binary search) then it's often beneficial to convince the compiler to generate conditional moves rather than branches. This is what the article shows for the test results.

Conditional moves (CMP/CMOVcc) will always be slower than correctly predicted branches (CMP/JMPcc) if the branch is predicted correctly. If the prediction accuracy if 50%, the conditional move will always be faster. The threshold for prediction accuracy differs for different processors, but 85% is a reasonable approximation. This necessary threshold is moving higher with modern processors (CMOV is getting relatively faster).