...This is news? You've got a tight loop with a branch, and depending on which way the branch goes, you're going to touch completely different locations in memory.
The solution (at least for some problems) is just to lay your data out in memory better - optimal is a binary search tree in an array, like the way heaps are implemented.
I have some really ugly things to say to a new poster who cannot be bothered to read an article before trying to show off, but I'll refrain. In the future, read first, think, then comment. You'll be doing everyone else a favor.
I did read it. And none of that stuff about cache behavior _matters_ when your data isn't in cache and you're spending 70 ns on every loop iteration waiting for the next key to fetch from dram.
One of the first things to learn about caches is how dependent modern processors are on prefetching, and a binary search just completely breaks prefetching.
Anyways, I didn't come here to show off, but I did implement a b+ tree that can do > 1 million random lookups/second on a single core, on a > 100 mb tree. So believe it or not, I might know what I'm talking about.
TL;DR - If your element sizes are powers of two, and your cache is set-associative, you run the risk of set aliasing. Caches always use lower bits of the address as the key, probably because it's really fast.
If you have more aliases than there are ways in each set, items will become evicted from the cache. If you do it just right, you can end up with the worst case, and you get 100% cache misses. Therefore even something that might on the face of it seem perfectly cacheable (well... if you don't know quite how the cache works), with a small working set (note that the test involves repeating the same search over and over again), can actually end up suffering from 100% cache misses.
The L2 and L3 caches use physical addresses rather than virtual ones, so you can't fully control this effect from user mode.
Yeah, I get that. What I was getting at is that binary searches are terrible in general are _terrible_ as soon as you're not hitting cache anymore - and for any problem worth optimizing that's what happens, if nothing else because your data set doesn't fit in l2 anymore.
Worrying about stuff not being cached because of associativity is pointless when the last 5-10 levels aren't going to be in cache anyways because they don't fit; those last 5-10 cache misses are going to _utterly dominate_ your search time.
IOW, even if associativity isn't a problem at all, binary searches suck w.r.t. caches - I'm not at all saying they're wrong, just that it's not terribly relevant - if performance matters you need to be doing something other than a binary search.
So, a "binary search tree in an array, like the way heaps are implemented" actually would not experience the problem described in this article. The typical layout of the tree for a heap is "breadth-first, level-order" as opposed to the "depth-first, in-order" tree that is implicit to a sorted array. This means that your accesses to the first couple levels are in the same cache line, and your subsequence accesses as you move down the tree are less likely to line up. This is definitely not to say this is "optimal", however.
If you regularly search for the same stuff with binary search in a large array, at some size limit your set-associative caches will run out due to aliasing. At that point your data is then never in the cache, even though you could only be accessing 30 or 40 different words of memory.
Using the heap layout, you can sometimes even eliminate branch misspredictions and subsequent (expensive) pipeline flushes.
The basic idea is to allow your compiler to use predicated instructions. Do do this, you have to transform control dependencies (e.g., `if (a > b)`) to data dependencies on index computations (e.g., `j = 2*j + (a > b)`).
> Optimal is a binary search tree in an array, like the way heaps are implemented.
Not quite. The heap layout has bad spatial locality. Think about the last row of leaves - it occupies the last 50% of the array. The next-to-last-row (all their parents) is in the middle 25%-50%. All those leaves are far away from their parents, and thus the heap is not optimal.
Better is to cluster parents and children into small groups. With clusters of three, two thirds of the time a child will adjacent to its parent.
There's a tradeoff between latency and amount of memory touched.
The beautiful thing about the heap layout is that you can prefetch more than one level ahead. If your keys are 4 bytes, 16 of them fit on a cacheline, and you can prefetch 4 levels ahead with a single cacheline.
That means that if nothing is in cache, worst case each loop iteration will take < 20 ns.
With any kind of clustering (really, it's just a variation on B-trees) you touch less memory but you can't prefetch as far ahead.
In practice I suspect a 4-ary heap would perform better than a binary heap for my application, but there's a bunch of math I'd have to work out again that was hard enough for the binary tree version.
Not quite sure how what you're describing would work, but I can't see offhand how you'd implement it without losing the prefetching the heap layout gives you.
A Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation
http://www.cs.amherst.edu/~ccm/cs34/papers/ladnerbst.pdf