Help Reddit build a recommender

zach · on March 18, 2012

Some may remember that Reddit used to have an item recommender a long time ago, back in its first year or so. It was a Bayesian classifier that, since it needed a bunch of input, only worked for the most hardcore members — who had already seen almost all of the recommendations!

This was originally the "hard problem" at the center of Reddit.

Let me explain what I mean by that. There used to be a quaint notion that to be a respectable tech startup, you had to have a "hard problem" (technologically speaking) at your core, which you had an innovative "secret sauce" solution for, preferably one you were patenting. After all, if not, then someone can just copy you and squash you like a bug, right?

Since then, YC's insistent focus on making something people want, Eric Ries' lean startup gospel and many entrepreneurs' own experiences have thankfully gone a long way to convince people (most importantly SV investors) that focusing on a "hard problem" is not only unnecessary, but may end up being a fatal distraction.

This is a pretty good example of how the "hard problem" can turn out to be completely irrelevant. Once it was clear that the recommendation engine wasn't a growth vector, the Reddit team seemed to drop it out of sheer pragmatism. They just needed to keep the site running.

I can't recall many who cared or even noticed that the "recommended" tab was gone. But from that point on, Reddit was more free to become not just a quirky "personalized news" startup, but what it has aspired to since: the front page of the internet. And only now, just now, do a good chunk of the millions of users think a recommender might be nice.

It's the startup version of "you aren't gonna need it" — if it doesn't drive growth, push it aside.

philwelch · on March 18, 2012

Thank god tech startups no longer have to actually develop technology.

I suspect the "hard problem" idea might not actually be that antiquated. Most of the newer batch of startups are either not profitable at all or not profitable enough to justify VC investment. There's almost no technology risk, but the tradeoff in market risk is such that even if you succeed, you're not as profitable. This is the trap Reddit is in.

DanielRibeiro · on March 18, 2012

Earlier today, on Clojure West, the founder of GetPrismatic[1] made a very interesting presentation. He is trying to solve this problem. Real time machine learning is a hard problem.

Hopefully it will be something people want, because I want it as well. But if it is not, I'd rather he pivoted into something where he can succeed than taking his startup to the ground because of me. A startup that fails helps nobody.

[1] http://getprismatic.com/

achompas · on March 18, 2012

Is this presentation online somewhere? Sounds v. interesting, especially if it's Bradford Cross.

DanielRibeiro · on March 18, 2012

Not yet: https://github.com/strangeloop/clojurewest2012-slides

InfoQ taped it, but they may not release it for up to 6 months.

And yes, it was Bradford Cross.

citricsquid · on March 18, 2012

I think you went too far.

You're correct that reddit didn't/doesn't need any form of recommendation engine for individual items -- that's what voting + subreddits are -- it most definitely did (and still does) need something for recommending subreddits. This is a problem they should have been addressing from day 1, it isn't make or break but it takes the site way beyond its current usefulness to hardcore users.

defrost · on March 18, 2012

It was something like three years before reddit had subreddits.

Hardly a day 1 problem.

rayhano · on March 18, 2012

Just on a slight tangent - have we gone too far and forgotten to solve hard problems; leaving them to the likes of Apple, Samsung and sequestered Universities?

jstepien · on March 18, 2012

During the previous semester I spent some time building a recommender using this data as a project for a data mining class. It turned out to be far more challenging than I had initially anticipated.

I've used methods known as collaborative filtration, whose goal was to estimate how a given user would rate a given item basing on knowledge of preferences of other users of similar interests. The initial scope included a naïve Bayesian classifier and a technique called Slope One [1]. The latter one is particularly interesting as according to claims of its authors allows to make a very good estimation in a very short time using solely a very simple linear model. The preprocessing is both time- and space-wise expensive though as it requires you to build a matrix of deviations between rated items.

After reducing the data set to a single subreddit and filtering it from users who weren't avid voters I ran the algorithms and after some tuning I was very content to see promising ROC curves and decent AUC values. Models built around NBC and S1 achieved comparable results when it came to such metrics as precision, recall and F-measure.

When I went to discuss the results with the professor teaching the class I've heard "That's indeed promising, but how about comparing those results with a really naïve model which would just take an average of existing votes by a given user?". Guess what: the model built solely using a single call to the avg function was nearly as good as the NBC and S1 models.

Now I understand why the guys from Reddit are looking for external help with the recommender. It's a way less obvious task than it might seem to be.

[1] http://lemire.me/fr/documents/publications/lemiremaclachlan_...

Edit: s/machine learning/data mining/

_dps · on March 18, 2012

Out of curiosity, did you compare to any other baselines? I suspect you did a lot better than you think you did, because that particular baseline is actually very misleading for ranking/recommendation tasks (this is a common source of confusion for newcomers). Here's why, in two parts:

1) Say you estimate (as you propose) that a user will always give their average rating. This might get you good-ish error and ROC as a prediction task, but will give zero recommendation value because the prediction for a given user will be constant for all possible recommendations.

2) Say you estimate that a user will give the average score that the item has received across all users. Again, possibly good-ish in terms of prediction ROC and RMS error, but this offers no personalization (all users get the same predictions, i.e. you're basically just showing the default Reddit ranking).

Both of these baselines are vastly inferior to even really stupid models like "how many times have I upvoted stories from this submitter" in terms of recommendation value, but the latter is (if I recall from my own experiments) much worse when evaluated on the basis of overall ROC.

I would strongly suspect that a correctly implemented NB or S1 would vastly outperform either of the two baselines in terms of actual recommendation utility (even though when you look at the baseline's ability to predict actual numbers, they might be comparably good in an RMS sense).

The moral of the story: one must be very careful when trying to quantify the performance of learning systems; actual utility is often difficult to evaluate merely by looking at standard statistical measures of accuracy.

jstepien · on March 19, 2012

No, I didn't make any comparisons to other baselines. Thanks a lot for sharing your thoughts; I'll have to reconsider the results I got in the light of your comment.

greendestiny · on March 18, 2012

I implemented Slope One for the netflix prize and found its results pretty unimpressive. So I decided to extend it to build a SVD predictor of Slope One values figuring it might do better than SVD by itself. It didn't.

Turns out increasing the dimensionality of the input 17 thousand times just reduces the amount training data for each attribute. Duh :)

espeed · on March 18, 2012

The Neo4j User Group wants to help with this (https://groups.google.com/d/topic/neo4j/rkhjlQx-bfo/discussi...).

Gremlin (https://github.com/tinkerpop/gremlin/wiki) works great for real-time recommendations.

See "A Graph-Based Movie Recommender Engine" by Gremlin's creator, Marko Rodriguez (http://markorodriguez.com/2011/09/22/a-graph-based-movie-rec...)

mikeklaas · on March 18, 2012

For anyone who's trying this, I recommend basing your effort on factor models (i.e., the thing that won the netflix prize). It works very well for us at Zite.

(Content models are the other, probably less interesting, 50% of the solution.)

krelian · on March 18, 2012

This is a bit old, no? Anyway, I don't need to a recommender, I need a better way to let me affect the weight different subs have on the homepage. I need a way to group different low traffic subreddits together so that I won't miss their content among the high traffic ones.

Reddit's old interface doesn't work anymore now that there are so many subs. The fact that there has been so little interface improvements in the last couple of years is pretty sad. I can't imagine browsing the site without RES.

The way things work now only helps to magnify the lower quality trend because the homepage gives undue weight to content from popular subs.

ebf · on March 18, 2012

I've noticed this problem lately. Last night, I saw that a large percentage of my front-page was from 1 subreddit. I'm also subscribed to over 200 subreddits. Some of these subreddits never hit my frontpage, and I tend to visit only 10-20 subreddits. This results in the majority of my subreddit subscriptions being useless.

I think there could be some interesting UI solutions to this problem. If more people treated the Reddit API like the Twitter API, there could be applications that aren't necessarily supposed to replace the traditional Reddit browsing experience, but to make whole new experiences (e.g. Flipboard).

stevengg · on March 18, 2012

You can only see 50 on the front page as a regular user and 100 if you pay for reddit gold they update every 30 minutes. This is the main reason I removed reddits like r/thewire, r/archlinux and /r/bookclub because they so infrequently get posts its not worth having them clogging up one of my 50 spots.

wrath · on March 18, 2012

This is a pretty open ended problem!

How to you measure success? After I create my algorithm how do I know that I'm close to what reddit wants? Without answers to these questions, IMO, this is an exercise in futility. I'm not close enough to the project but written my fair share of classifiers and clustering engines any machine learning problem there needs to be a way to measure success. My point of view on a great result is different from reddits for sure.

thedark · on March 18, 2012

This is exactly the sort of thing a properly implemented tagging system would have solved. Along with their notorious search problems. Along with the difficulty in finding subreddits. Along with discovering old content. 6 years later I maintain this as a mistake.

naner · on March 18, 2012

What would a "properly implemented tagging system" look like on a site like reddit? I know they have been rejecting the idea for years and intentionally went with subreddits to handle the growth, encourage small disparate communities, etc.

citricsquid · on March 18, 2012

A story about Startups can belong in multiple subreddits, eg: r/startups, r/entrepeneurs, r/business.

If a story had tags and there was a system where the frequency of tags appearing in a subreddit mattered it would allow me to look at r/startups and then find the other subreddits relevant to my interests.

reddit made the mistake of treating every subreddit as its own individual isolated community without considering crossovers in interests. If tagging existed then this would not have been a problem. Today 6 years on it's still impossible to find good subreddits relevant to specific interests, tags would have been one of the solutions for that.

raldi · on March 18, 2012

The reddit founders talked and thought about this tremendously, and ultimately decided that it was more important to have distinct communities, so that the same story can be on /r/aww and /r/photography without one group overrunning the other. Or /r/TwoXChromosomes and /r/MensRights. Or /r/politics and /r/economics.

I think that this was one of the most important strategic decisions in reddit's history, and that they got it right.

I'm not saying tags can never work, just that any proposed tags system needs to supplement, not destroy, the siloing of subreddit communities. And be simple to use, even for the 99% of redditors who never even vote or subscribe to anything.

underwater · on March 18, 2012

I was strongly opposed to subreddits when they were proposed. I thought tagging, like Delicious did, would be a better solution.

I'll happily admit that tagging could not have grown Reddit to anywhere near its current size without the site collapsing on itself. The different feel to each community (compare F7U12 are AskScience) is much more appealing than a single homogeneous group. However, I think that it was the first step towards breaking the promise to create an personalized news aggregator. I for one was disappointed when the recommended posts feature was dropped.

The initial missteps with whitelabel sites like the Wired-branded reddit and lipstick.com are amusing in hindsight. I'm not sure how reddit with a pink background with Courier as the primary font was supposed to attract a female audience.

citricsquid · on March 18, 2012

My idea of how tags work would be on top of subreddits. Subreddits are a fantastic idea and make reddit reddit but tags would work along side that and a way to associate stories with multiple subreddits.

For example, if a post could be tagged "startups" and was posted to r/business, when I tried to find other subreddits besides r/startups about startups I could search "subreddits with x or more stories tagged "startups"" and I'd be presented with r/business.

They wouldn't exist as a replacement for subreddits, they'd exist along side and serve as a way to connect subreddits by topic. Subreddits currently exist as their own entities with no crossover which doesn't work well for expanding a users subscriptions to other subreddits relevant to their interest.

tomjen3 · on March 18, 2012

They would be useful for something like what stackoverflow does by allowing people to block tags or highlight others (e.g. Block Ron Paul posts in /r/politics).

That said don't listen to me. I have quit using reddit, except for /r/gonewild.

raldi · on March 18, 2012

Negative filtering would be a disaster. The power users do most of the voting and almost all of the reporting. If they all could block Ron Paul, those stories wouldn't get downvoted and, when offtopic, reported. This would cause the Ron Paul stories to take over the site for the 99% of users who wouldn't be using the filter.

y3di · on March 18, 2012

I completely agree with this. It's actually part of the philosophy that inspired the structure of my new communications platform app.

Check it out: http://ec2-50-16-106-77.compute-1.amazonaws.com/ - (it's still in its infancy)

You can link the same post as a reply to multiple items. This allows for complete flexibility. Posts that are relevant to more than one section can live in each of those places

bostonvaulter2 · on March 18, 2012

Oh, so you're saying tags in addition to the subreddits, but tags don't cause posts to show up in subreddits. I think that could work, but I'd be most wary of burdening down the poster with selecting tags or distracting people by having them "vote" or suggest tags.

tomjen3 · on March 18, 2012

Make it so that tags are suggested only from the page where you can see the comments and that no tag show up unless $NUMBER users have suggested the tag (so arbitary tags aren't developed). Allow users to vote on public tags. If you suggested a tag that gets downvoted, you lose karma.

raldi · on March 18, 2012

That's close to the system I would advocate. But it's hard to get something like that right, and until this past year, reddit never had the manpower for such a huge project.

bane · on March 18, 2012

In effect, the subreddits are the tags. There's been a number of times where I followed a cross post to another subreddit I hadn't known about before.

rhizome · on March 18, 2012

Tagging is still highly underleveraged.

yoshyosh · on March 18, 2012

Could you expand on that? If you were to build or theorize a tagging system for them, how would you have implemented it? From a UX perspective tagging can feel tedious and sometimes leaves the user unsure how to go about tagging the content they put up. It is something I debate a lot when designing the upload flow for user generated content.

swalsh · on March 18, 2012

Fortunately Reddit is the central bank for Karma. They can crowd source it and give karma for tagging articles. If a tag is flagged, you lose the karma. Karma for some unknown reason motivates people.

nostromo · on March 18, 2012

Given Reddit's massive size and tiny payroll -- I'm surprised they've managed to keep the site up at all!

qxcv · on March 18, 2012

Tagging is good for categorisation (which reddit has already implemented using "subreddits"), but tends to result in a lot of noise when users can apply arbitrary tags[1]. Additionally it's a lot of extra work on the part of the user to specify meaningful tags - this is especially undesirable considering that reddit's aim is reducing the amount of work required to submit pages. What would be more useful is an subreddit discovery feature allowing users to subscribe to subreddits which interest them (this would also reduce a lot of the cross-posting because similar subreddits like /r/python and /r/programming would share a large segment of their audience).

[1]: http://youtube.com/

PaulHoule · on March 18, 2012

Collaborative filtering is a boring problem and doesn't get to the heart of what's wrong with Reddit, Hacker News, and such.

For one thing, many good stories languish on the "new" page and never get enough votes to get a fair shake. Collaborative filtering doesn't help with this, if anything it makes it worse.

Last night I made a crude boomerang by glueing two rulers together, this morning it had set and my son pressured me to try throwing it before I'd even finished my breakfast. Right when it started to curve, it hit a telephone pole and broke at the glue joint.

When I see many of the things people want to do on reddit, my first impression is it will wind up like that. For instance, LSI is one of those things that does not work so well in real life... They still seem to be teaching kids about it, but not that you get results almost good doing dimensional reduction with a random basis set.

If you've got some semantic analysis and predictive models, you can make an automated system that picks quality relevant content out of the "new" queue and because you can use smart feature selection you don't need to wrangle as much data -- training is orders of magnitude faster and you don't need to futz around with hadoop.

lars · on March 18, 2012

I don't think you need voting data. Rather, answer the question: "what subreddit is similar to this particular subreddit". Then you use the degree of overlap in subscribers as a distance measure between subreddits. Use a tf*idf like approach, so popular subreddits are weighted less.

Then the similarity of r/programming to r/coding would be based on two numbers:

    b = number of people subscribed to both r/coding and r/programming
    n = number of people subscribed to r/coding
    similarity = b/n

markkat · on March 18, 2012

I'm not sure if it's a recommendation engine they need, but they do need a better way to find subreddits. IMO some sort of a map might work better than a rec engine. Or even just a quick way to see what subreddits another user subscribes to.

mumrah · on March 18, 2012

I think that, given the volume of users on reddit and the volume of content they interact with, any of the various collaborative filtering techniques would work well at this point.

You could take it a step further and incorporate more than explicit up/down vote features, such as "clicked", "commented", "saved", etc.

Then incorporate some business rules that filter recommendations by subreddits, boost results by time, and now you have a decent recommender.

Easier said than done of course.

cop359 · on March 18, 2012

So this is a StumbleUpon for Reddit?