A Fast x86 Implementation of Select (2017) [pdf]

ot · on Oct 14, 2018

The authors claim their algorithm is novel, but it actually had been a well known folklore technique for quite a while. For example it was used in Folly as early as Jan 2016:

https://github.com/facebook/folly/commit/b28186247104f8b90cf...

stephencanon · on Oct 14, 2018

Yeah, I definitely saw this technique in Intel sample code demonstrating use of BMI before HSW was released. Probably 2011, maybe 2012.

That said, we worry too much about who the first person to have an idea was. Any idea worth having is come upon independently by many people, and putting the idea to work and communicating it to others are more valuable contributions anyway.

Veedrac · on Oct 14, 2018

> Recently, Pandey et al. [15] gave a new implementation of machine-word select, which we call PTSelect.

It's worth noting Jukka Suomela had this PDEP idea late 2014, and I've used the trick a few times from that source.

https://stackoverflow.com/a/27453505/1763356

lostmsu · on Oct 14, 2018

Quick tip: this is about relational algebra's select operation (which is basically row filtering; choosing columns is called 'project' there).

espeed · on Oct 14, 2018

To be precise, this is about the bitlevel SDS rank and select operation.

https://en.wikipedia.org/wiki/Succinct_data_structure

asdfasgasdgasdg · on Oct 14, 2018

I didn't get that impression at all. This is "get me the rank (index) of the first occurrence of x" in a succinct data structure. Nothing to do with relational algebra AFAICT except to the extent I suppose that some relational algebra implementations may use such data structures under the hood.

cmrdporcupine · on Oct 14, 2018

Yes, but 'select' is the name of that operation in the relational algebra. Which confuses some people whose first exposure to that model is through SQL, where the keyword implies a combination of 'select' and 'project'. I think the commenter's intent was to clarify that.

espeed · on Oct 14, 2018

Yeah, succinct data structures aren't something taught in most intro to data structures courses, and the concept may even be unfamiliar to oldhats since SDSs are relatively new on the scene -- the public datasphere only started becoming conscious of the idea in 1988 when Guy Jacobson introduced them in his CMU thesis. And I have yet to find a digital copy of it online so for those who want to read the original thesis, you may have to order a paper copy of one or go to the CMU library to read it.

https://scholar.google.com/scholar?q=guy+jacobson

But, there's been a lot of development on succinct data structures since, and this post on the fast select algorithm was in-part a continuation of a conversation from a post yesterday on succinct data structures in general and the open-source lib SDSL 2.0. So if anyone wants to learn more about SDSs in general, there's a bunch of links to videos and reference material in the comments of yesterday's thread...

SDSL – Succinct Data Structure Library 2.0 https://news.ycombinator.com/item?id=18204432

purplezooey · on Oct 14, 2018

I thought this was about select() with file descriptors, which also needs to be fast. :)

derf_ · on Oct 14, 2018

I wondered why they used TZCNT instead of BSF. If their algorithm has correctly located the word with the jth bit set, the output of PDEP shouldn't be zero, and these should be equivalent (but BSF doesn't require a dependency on BMI1 support, though if you already require BMI2 for PDEP, maybe this doesn't matter).

But then I found <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168>.

BeeOnRope · on Oct 14, 2018

Aside from that bug, it's like you said: might as well use tzcnt from BMI1 if you are already using BMI2 since there are no CPUs implementing the latter but not the former (and IIRC the reverse is even true).

Using tzcnt in principle is faster than bsf in many cases since bsf always has a dependency on the output register (due to the zero input case), but tzcnt doesn't. In practice tzcnt had a false dependency anyways, at least until Skylake where that was fixed (again IIRC, I recall that one of these bit manipulation didn't have their false report fixed, but the rest did).

nkurz · on Oct 14, 2018

What did you think of their explanation of "memory parallelism" in section 4.3? I read it quickly, but thought it was pretty accurate. I liked that they made explicit the link between the number of instructions and the degree of parallelism. I was a little disappointed by the final statement that they "suspect" that the CPU is capable of only 8 parallel requests.

BeeOnRope · on Oct 14, 2018

The analysis looks exactly right to me (they don't explain how they calculated MLP, but the numbers look about as expected for x86 ROB sizes/FB sizes).

One thing they don't mention is that the low MLP isn't inherent in the other algorithms! With the default/obvious implementation of a benchmark loop using each method you hit this issue, but in principle I think you could re-write almost any algorithm which is "instruction count MLP limited" like this into another one that isn't. A simple sketch is to simply do a batch of N loads up front, storing them into a temporary buffer, then do the computation part.

The loading part will have maximum MLP (after all, it's pure loads) and then the computation part will get all L1 hits unless you picked N wrong.

It loses a bit because the loads aren't as effectively overlapped with computation, but if the loads time "dominates" as presented here, it would work well.

A more sophisticated approach would still try to overlap computation and loads, at max MLP. You could do this by inserting additional prefetches or actual loads into the computation stream, enough to get the required MLP. This is actually kind of complicated and more dependent on the actual hardware parameters which is why I mentioned the other way first.

BeeOnRope · on Oct 14, 2018

Just double checked and the false dependency problem is fixed for tzcnt and lzcnt, but not popcnt, in Skylake. So that's one (slightly) good reason to use tzcnt over bsf if possible, as of Skylake.

0xFFC · on Oct 14, 2018

Stony Brook University is one of those special universities which although from an academic perspective not that famous, but in Systems software, they are top-notch university.

aportnoy · on Oct 14, 2018

Their grad math program is pretty good.

skissane · on Oct 14, 2018

I was confused by the conference this is submitted to: "42nd Conference on Very Important Topics (CVIT 2016)." I couldn't find any such conference with such an odd name.

Eventually I worked it out. I think they are planning to submit this as a conference paper, but at the time of creating the preprint weren't sure which conference. So "Conference on Very Important Topics" is just a placeholder for where the actual conference/proceedings info would go on the article first page.

kolderman · on Oct 14, 2018

Oh that Select. Yawn.