Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Really cool! Couple of comments:

1. I'm assuming you downloaded comment threads from the front page of each the subreddits you looked at and then looked at the subreddit each of the posters had commented in. How many requests did you end up making?

2. Did you hand select the subreddits you analysed? If so, what criteria were you looking for?

3. Have you thought about doing any more research into this area? I made http://redditgraphs.com/ and was looking into ways of guessing a user's age & gender based on their commenting history. I found some papers about similar sites:

twitter: http://www.aclweb.org/anthology-new/D/D11/D11-1120.pdf

blogspot: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136...

youtube: http://static.googleusercontent.com/external_content/untrust... (This one looks the most promising; using their methods, treat subreddits as youtube videos to create more accurate profiles of communities and users. They also examine the propagation of speech patterns which capture the spread of some memes.)

Unfortunately, reddit doesn't have user profiles or name-like user names (so there isn't an easily available training set) and I was having difficulties organizing and analyzing the large amount of data I was downloading, so I put the project aside. There has been basically no research done specific to reddit (http://scholar.google.com/scholar?as_ylo=2008&q=reddit+d...) which is surprising to me because of its size and unique subreddit system.

4. If you want to examine the spread of memes, you need access to old threads. http://stattit.com/ is the best way of getting around the reddit API's 1000 most recent post limitation.

5. Last month, a similar data set (which only looked at reddit) was collected - I think you're trying to do something different and your presention is much better, but you might be interested in the discussion: http://www.reddit.com/r/TheoryOfReddit/comments/126pth/scrap...



I looked at about 60,000 distinct users. But you're right about my overall strategy. I chose all of the subreddits with over some number of subscribers (I forget what the number was now.)I ended up with 433 subs. I filtered out the current default subreddits from this visualization.

One thing I was wondering in terms of reddit research - have you looked into this at all - is that they have users check a specific box if they are ok with their voting data being used for research - even if it's already public. My question then is this - is it somehow wrong to use (already-public) data for research? Anyway, I talk about my original aims for the project in some other comments.

Thanks for the link to stattit. My strategy for getting enough threads for my other project was just to keep a slow scraper running for a month and then go back to it - stattit will be incredibly helpful.


> One thing I was wondering in terms of reddit research - have you looked into this at all - is that they have users check a specific box if they are ok with their voting data being used for research - even if it's already public. My question then is this - is it somehow wrong to use (already-public) data for research? Anyway, I talk about my original aims for the project in some other comments.

Based on the dozens (at least) of papers published each year that use twitter data, I'm pretty sure it's kosher to use public posts. You might want to double check with your irb though. Depending on how you present the information, so users might be concerned about their privacy - I wrote a bot that replied to people posting variations of 'your comment history' with a link to the referenced person's redditgraph and several people said they were creeped out by it (a little more here, if your interested: http://www.roadtolarissa.com/redditgraphs-retrospective/).

Depending on what you are looking for the rate limit might slow you down a lot; you might want to contact the site admins:

> tl;dr If you need old data, we'd much rather work out a way to get you a data dump than to have you scrape.

https://groups.google.com/forum/?fromgroups=#!topic/reddit-d...


Does stattit have an archive that one can download or do they just provide the high-level summary stats shown on the site?


I don't think so; the creator /u/Deimorz is pretty cool, you could try asking him.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: