The authors claim their algorithm is novel, but it actually had been a well known folklore technique for quite a while. For example it was used in Folly as early as Jan 2016:
Yeah, I definitely saw this technique in Intel sample code demonstrating use of BMI before HSW was released. Probably 2011, maybe 2012.
That said, we worry too much about who the first person to have an idea was. Any idea worth having is come upon independently by many people, and putting the idea to work and communicating it to others are more valuable contributions anyway.
I didn't get that impression at all. This is "get me the rank (index) of the first occurrence of x" in a succinct data structure. Nothing to do with relational algebra AFAICT except to the extent I suppose that some relational algebra implementations may use such data structures under the hood.
Yes, but 'select' is the name of that operation in the relational algebra. Which confuses some people whose first exposure to that model is through SQL, where the keyword implies a combination of 'select' and 'project'. I think the commenter's intent was to clarify that.
Yeah, succinct data structures aren't something taught in most intro to data structures courses, and the concept may even be unfamiliar to oldhats since SDSs are relatively new on the scene -- the public datasphere only started becoming conscious of the idea in 1988 when Guy Jacobson introduced them in his CMU thesis. And I have yet to find a digital copy of it online so for those who want to read the original thesis, you may have to order a paper copy of one or go to the CMU library to read it.
But, there's been a lot of development on succinct data structures since, and this post on the fast select algorithm was in-part a continuation of a conversation from a post yesterday on succinct data structures in general and the open-source lib SDSL 2.0. So if anyone wants to learn more about SDSs in general, there's a bunch of links to videos and reference material in the comments of yesterday's thread...
I wondered why they used TZCNT instead of BSF. If their algorithm has correctly located the word with the jth bit set, the output of PDEP shouldn't be zero, and these should be equivalent (but BSF doesn't require a dependency on BMI1 support, though if you already require BMI2 for PDEP, maybe this doesn't matter).
Aside from that bug, it's like you said: might as well use tzcnt from BMI1 if you are already using BMI2 since there are no CPUs implementing the latter but not the former (and IIRC the reverse is even true).
Using tzcnt in principle is faster than bsf in many cases since bsf always has a dependency on the output register (due to the zero input case), but tzcnt doesn't. In practice tzcnt had a false dependency anyways, at least until Skylake where that was fixed (again IIRC, I recall that one of these bit manipulation didn't have their false report fixed, but the rest did).
What did you think of their explanation of "memory parallelism" in section 4.3? I read it quickly, but thought it was pretty accurate. I liked that they made explicit the link between the number of instructions and the degree of parallelism. I was a little disappointed by the final statement that they "suspect" that the CPU is capable of only 8 parallel requests.
The analysis looks exactly right to me (they don't explain how they calculated MLP, but the numbers look about as expected for x86 ROB sizes/FB sizes).
One thing they don't mention is that the low MLP isn't inherent in the other algorithms! With the default/obvious implementation of a benchmark loop using each method you hit this issue, but in principle I think you could re-write almost any algorithm which is "instruction count MLP limited" like this into another one that isn't. A simple sketch is to simply do a batch of N loads up front, storing them into a temporary buffer, then do the computation part.
The loading part will have maximum MLP (after all, it's pure loads) and then the computation part will get all L1 hits unless you picked N wrong.
It loses a bit because the loads aren't as effectively overlapped with computation, but if the loads time "dominates" as presented here, it would work well.
A more sophisticated approach would still try to overlap computation and loads, at max MLP. You could do this by inserting additional prefetches or actual loads into the computation stream, enough to get the required MLP. This is actually kind of complicated and more dependent on the actual hardware parameters which is why I mentioned the other way first.
Just double checked and the false dependency problem is fixed for tzcnt and lzcnt, but not popcnt, in Skylake. So that's one (slightly) good reason to use tzcnt over bsf if possible, as of Skylake.
Stony Brook University is one of those special universities which although from an academic perspective not that famous, but in Systems software, they are top-notch university.
I was confused by the conference this is submitted to: "42nd Conference on Very Important Topics (CVIT 2016)." I couldn't find any such conference with such an odd name.
Eventually I worked it out. I think they are planning to submit this as a conference paper, but at the time of creating the preprint weren't sure which conference. So "Conference on Very Important Topics" is just a placeholder for where the actual conference/proceedings info would go on the article first page.
https://github.com/facebook/folly/commit/b28186247104f8b90cf...