Postgresql's new recursive queries at Disqus: 500% speedup

shimon · on May 30, 2010

I once heard of a system for representing threaded comments in a relational database, wherein the ID of each comment was a slash-separated list of its ancestors, e.g.:

1 Initial comment

1/1 Re: Initial comment

1/2 Re: Initial comment

1/2/1 Re: Re: Initial comment

1/2/2 ...

These sorts of IDs have the wonderful property that you can efficiently select any subtree using a LIKE query, e.g. "SELECT * FROM comments WHERE path LIKE '1/%'". These queries are efficient because they use string btree indexes. And the results may even be sorted the right way, provided you have monotonically increasing IDs within each level.

I think I read about this technique being used on Slashdot a few years ago, but I can't remember the name or find any reference. Still, it seems like a way to achieve the same benefits that the OP cites for comments with PostgreSQL's ltree support on any database with reasonable string indexing.

Anyone know the name of that technique?

larrykubin · on May 30, 2010

I believe you are referring to Materialized Path:

http://www.mjcblog.net/2009/03/full-tree-selects-via-materia...

shimon · on May 30, 2010

Right on. I don't recall hearing that particular name, but it's certainly the same idea.

tszming · on May 30, 2010

It should be called "Materialized Path".

See: http://troels.arvin.dk/db/rdbms/links/#hierarchical

papersmith · on May 30, 2010

I think for that you'll have to use the ltree index, rather than regular string index, else it'll sort like this:

1/2/3

11/2/3

2/2/3

I wonder if there're ltrees for other DBs.

fhars · on May 31, 2010

That, or restrict the maximum number of comments at a given level to a large, but fixed number that you can express as a fixed size number, say two digits in base 62 (digits are 0-9a-zA-Z). Or, if you use postgres, make the index column a list of integers.

seldo · on May 31, 2010

A few DBs within Yahoo which require threading use this pattern (the one I worked on used hyphens as the separator, though).

zeeg · on May 31, 2010

The ltree contrib module also solves this problem in PostgreSQL.

hendler · on May 30, 2010

Personally have used this technique with SOLR/Lucene as well.

megaman821 · on May 30, 2010

If you are using Django, http://bitbucket.org/tabo/django-treebeard/overview, is a great app to use. It implements several different tree storage methods.

bacarter · on May 30, 2010

I used these for a commenting system as well, they're really quite amazing. There's basically three options when storing a tree in a relational database - adjacency lists which are slow to read, nested sets which are slow for writes, and materialized path, which is slow for moving around nodes. It's good to have a fourth option which is straightforward to implement and has excellent performance.

zeeg · on May 31, 2010

Just wanted to throw an update over here. The examples in my post are specifically geared towards a basic threaded comment structure, which Disqus is not. Our usage is much more complex than this example, but we are able to handle sorting by dynamic data (which we use in our "Most Popular" sort algorithm).

In other words, recursion still works, and in our tests has proven faster than doing it at the application layer (no matter how you sugar coat it).

csytan · on May 30, 2010

I tackled this problem on appengine a while ago.

Turns out that the average discussion thread has few enough comments that they can be fetched from the db in one call (e.g. by the thread_id). This makes even deeply nested comments fairly simple to query for, and then it is quite easy to rebuild the tree in memory.

I think reddit uses this method too.

seldo · on May 31, 2010

Ah, the dreaded phrase "it turns out", offering assertions with no evidence ( http://jsomers.net/blog/it-turns-out ) :-)

I believe the author's point is that using these recursive SQL queries is more efficient, as the DB engine is inherently faster than your (presumably) scripting language. So this solution is for when the in-memory method is failing to scale.

jbm · on May 31, 2010

Is the DB engine really inherently faster than the scripting language?

http://highscalability.com/blog/2010/3/23/digg-4000-performa...

Not saying necessarily that the DB engine is less efficient, but rather that it is serves as a limiting factor and anything you can offload onto other servers is a good solution (for highly-visited websites).

trezor · on May 31, 2010

Is the DB engine really inherently faster than the scripting language?

As for anything, the answer is "it depends". It depends on what you want to do. In this case that is grab all rows in a specific dataset.

If you in any case want to grab the dataset X, consisting of a given set of rows, asking the database to figure how to fetch all rows in one operation as efficiently as possible will (or at least should on any respectable database engine) be more efficient than having the scripting language sending multiple requests which can only be optimized one by one.

For reference, I've only tried recursive queries in Microsoft SQL Server, but they are pretty fucking neat from a geeky perspective.

matwood · on June 2, 2010

Yeah, I was about to mention that MSSQL has had this 'feature' for years. I've used it a lot for retrieving organizational structures from the typical parent-child table structure.

The best part of it, is that it is MUCH simpler than either trying to roll your own through cursors or dealing with it in whatever language outside the database.

csytan · on June 1, 2010

Here's the evidence:

http://www.reddit.com/r/programming/comments/8q2pz/ask_reddi...

In my case, Appengine does not support recursive SQL queries, so another method had to be used.

seldo · on May 31, 2010

The Oracle equivalent of this functionality if called CONNECT BY PRIOR:

http://ss64.com/ora/connectby.html

I used it back in 2002 to build some messageboards and it was pretty effing magical. Glad to see equivalent functionality finally make it to OSS.

petervandijck · on May 30, 2010

Cool. I thought I was the only one that couldn't figure out a nice way to do threaded comments, turns out even reddit orders stuff in memory.

iamelgringo · on May 30, 2010

Threaded comments are surprisingly difficult to model in a relational database. I spent about a month on the problem when I rolled my own threaded comment system for newsley. Relational Databases really don't lend themselves to easily modeling tree structures.

If you're interseted in doing something like this, I'd highly recommend this book: http://www.amazon.com/Hierarchies-Smarties-Kaufmann-Manageme...

He has two really good chapters that specifically building a comment system. It takes a bit of work, and you end up pre-computing the comment system on INSERT's instead of SELECT's. But, you can build a comment system that pulls your entire comment tree in order using a single SELECT ... FROM COMMENTS ... ORDER BY comment.node_left;

papersmith · on May 30, 2010

Personally I just sort everything in memory like reddit, since I need to traverse all branches to calculate weighed scores, which is a balance of vote counts and age. I think reddit/HN do something similar? I wonder how Disqus handles this.

I'm not really sure how to do that in SQL.