It's not really related to search, most of the cause is a pretty bizarre optimization method that reddit decided to implement fairly early on. The database structure is unusual and not very conducive to indexing (it's similar to an EAV model), so at some point they decided to basically write their own "secondary indexing"-like system that stores the "listing indexes" in memcached/Cassandra (with the data itself in PostgreSQL).
Whenever something happens that affects any listings (new post created, voting, etc), the site figures out all the listings it needs to update, and where in each listing the affected post now belongs and updates them all. So for example, if you make a new submission to /r/pics, it will go through and add the post's ID in the right spot to the "new posts in /r/pics" listing, the "hot posts in /r/pics" listing, the "new posts by yourusername" listing, and so on. As it's going through and updating all these lists, it also trims each one down to 1000 items.
It's conceptually pretty similar to a normal database indexing system, but basically maintains all the indexes "manually" and restricts them all to the top 1000 items.
Thanks for the explanation. The APIs make a lot more sense in light of it. It seems odd they didn't just use Postgres indexing though. Do you know if they benchmarked it at any point?
For a lot of queries EAV type schemas are really hard to index efficiently. E.g. searching for something like a = ? and b = ? where a and b are dynamic attributes you can't just have a multi-column index when using EAV. So you instead end up with two intermediate query results that then are intersected. If you need even semi-efficient querying EAV usually isn't the answer.
But it is possible to use hybrid data store. Extract important attributes you want to search on efficiently and store them in properly structured and indexed relational database. Other attributes can still be stored EAV style in different schema/database.
It's not really related to search, most of the cause is a pretty bizarre optimization method that reddit decided to implement fairly early on. The database structure is unusual and not very conducive to indexing (it's similar to an EAV model), so at some point they decided to basically write their own "secondary indexing"-like system that stores the "listing indexes" in memcached/Cassandra (with the data itself in PostgreSQL).
Whenever something happens that affects any listings (new post created, voting, etc), the site figures out all the listings it needs to update, and where in each listing the affected post now belongs and updates them all. So for example, if you make a new submission to /r/pics, it will go through and add the post's ID in the right spot to the "new posts in /r/pics" listing, the "hot posts in /r/pics" listing, the "new posts by yourusername" listing, and so on. As it's going through and updating all these lists, it also trims each one down to 1000 items.
It's conceptually pretty similar to a normal database indexing system, but basically maintains all the indexes "manually" and restricts them all to the top 1000 items.
If you're curious enough to dig around in the code, this is probably the main relevant file: https://github.com/reddit-archive/reddit/blob/master/r2/r2/l...