The deficiency with AND queries due to term based partitioning reminds me of a hack I did a few years ago trying to build a basic IR system on simpledb:
Of course, I was backed into this corner by the various constraints at the time, and this was just a fun hack. It sounds a bit precarious to consider using Riak for text search, at least compared to something like Solr, if it truly has issues with the basic bread and butter of search like sorting by recency and conjunctions. (of course I am only going on what is in this post, I have no experience with Riak)
It really isn't all that bad. When you compare what we had to do compared to keeping two big data systems in sync (instead of Riak, which is a breeze), I think we got off easy.
The baseline was a substantial test corpus that we scaled several orders of magnitude over a series of runs, all meant to simulate typical clip size with typical word frequencies. The 10-100x gains had two contributions that each was in the 3-10x range. We tested it against the standard search which incorrectly performed pagination because of the misorder on sort versus slice. We also tested in on two types of map/reduce jobs that correctly implemented the sort and slice (and had been in production).
Ideally, we would have kept the data around to give a fuller a report. But the truth is we did this over 9 months ago and didn't save the data. After informally sharing the impact with a lot of people, we heard a lot of encouragement to share the techniques.
And you're right, this is not hard core science nor engineering. But it is a good tip, which you can take or leave.
I'm still not seeing you report something that says "an operation that took X ms now takes Y ms." Note the absolute numbers whose units are in seconds. That's what I'd like to see for a baseline.
And I won't even comment on the "we did this 9 months ago and did not save the data" part. That kind of stuff would not fly in medicine or science or most traditional engineering fields. Why should you be exempt?
To be clear, operations in our test corpus that took X ms on average took 100X ms. Is that the statement you are looking for?
One thing that I think you're missing is that while we experienced a 100x gain for our application, our findings aren't strictly empirical in the sense that the gain is always 100x. In many applications it will be more. The insight in the blog post is analytical, not empirical. Specifically, most people (I think) would assume a text index to be document partitioned. In Riak it's not. On a fundamental level this means that AND operations take time O(max(|A|,|B|)) instead of O(min(|A|,|B|) where |A| and |B| are the sizes of results that match query A and B, respectively. If you pick words at random from a power law distribution (which is typical in all natural languages) you will more often than not see many orders of magnitude difference in the |A| and |B|. If you sample queries from real query streams, you will see the same. For some intuition, just think about a query which has a common tag (like "funny" used in the example) and something restricted (like a particular user). The first will typically be an understood fraction of the size of your corpus while the second will have size that is more like a constant.
Putting this all together, the savings that we find for moving the intersection where we did will only grow on a relative basis as we get more users and clips. This hack costs 2x the storage size but always give O(min(|A|,|B|) complexity results instead of O(max(|A|,|B|)). 100x win is just shorthand for saying that those two set sizes typically varied by two orders of magnitude because of power law distributions. But in academia, we refer to that as "really fucking big".
The "presort" option that we added will scale differently, but have equally dramatic impact because pagination is effectively broken in Riak search if you do not use relevance sort. In this case, the O(.) argument is murkier because one version is done entirely within the index (which is usually in RAM) and the other has to hit the disk. In formal engineering circles, we refer to this type of boost as "unfucking real".
So, we could have measured things with more precision, but this was such an obvious win that it was almost pointless to calibrate it further.
Hi TPSReport - You realize that "gwf" is the Gary William Flake who has written an award winning complex book that is used in colleges worldwide as well as has run R&D for many companies like Microsoft, Yahoo, and Overture right? I'm sorry to say this but personally I'm finding your comments on this post as well other posts negative and lacking value. It would be great and helpful if you have any constructive/positive ideas, tips or experience to share to add value to the community. I'm sharing this in a positive constructive spirit. Thanks.
Let me reiterate my constructive suggestion that the OP provide actual data. I can see how the author wrote a pop book, as he is quite verbose, but any engineer can tell you that he is also quite evasive on technical questions. "What's the baseline?" should not pose a tough question for anyone writing about performance improvements.
http://codingthriller.blogspot.com/2008/04/simpledb-full-tex...
Of course, I was backed into this corner by the various constraints at the time, and this was just a fun hack. It sounds a bit precarious to consider using Riak for text search, at least compared to something like Solr, if it truly has issues with the basic bread and butter of search like sorting by recency and conjunctions. (of course I am only going on what is in this post, I have no experience with Riak)