Hacker News new | past | comments | ask | show | jobs | submit login
Full Reddit Submission Corpus now available for 2006 thru August 2015 (reddit.com)
250 points by voltagex_ on Sept 28, 2015 | hide | past | favorite | 92 comments



This is not as cool as the comment dataset, but it's still pretty cool. I think in the future AIs will make extensive use of these kinds of datasets. It's amazing how much useful information is in there.

Last year I made a simple IRC question answering bot that just searched reddit for your question and returned the top comment of the top result. It worked surprisingly well. It's amazing how many questions someone else has asked on reddit before. And the comment quality is usually pretty good. Sample conversation with it: https://i.imgur.com/LDD9isL.jpg

I improved on it a lot with a whitelist of subreddits and some machine learning to select the best thread. But I was only touching on what is possible with that data.

You can play with it here: https://kiwiirc.com/client/irc.snoonet.org/mybots

EDIT: It does only work well with certain kinds of questions. Thing's that would have been asked before and without too many unique keywords. I see some people ask it really unique questions, or talk to it like a regular chatbot without even asking questions. And then get frustrated when it returns nonsense. There are a lot of improvements that could be made with this with natural language processing and stuff. But right now it's pretty simple.


That chat bot is beyond amazing. Easily the best I've ever seen.


Me: What happens when an irresistible force meets and immovable object?

AMAbot: Too much explanation... Get out of the shower, you're a raisin. Also, makes me think of like babies, wrestling. Kind of like the opposite of atlas. why are you gay?


Do you have any of the source on Github? I'd be very interested in taking a look and potentially using it as an excuse to get playing with a natural language toolkit (in all my spare time haha)



Could you share "C:\\Users\\Daniel\\Documents\\Programming\\AMAbotData.txt" as well or it is not neccessary to run the bot?


Presumably that's the large dataset he was referring to.


Ok I added it and removed some of the stuff that was configured specifically for me. It should work. Please report any and all issues. I'm not sure if javascript supports local filepaths so you might need to edit it.


Thanks man!


Found a [1] pastebin of the code the author made in July when I believe it was updated due to interest from another HN discussion [2].

[1] http://pastebin.com/CM9u17jq

[2] https://news.ycombinator.com/item?id=9871603


This ties directly into the comment dataset, made by the same person and via the same method[1]. Working on pulling down the submission data now, will have to spin up another server if I want to do the comment/submission correlation.

>>"This dataset will go nicely with the full Reddit Comment Corpus that I released a couple months ago. The link_id from each comment corresponds to the id key in each of the submission objects in this dataset."


Is the Comment Corpus still available? I have looked but couldn't see anything obvious (my apologies if I'm being thick here)


Sure, here you go [1] - It unpacks to over 1tb though, hence me considering spinning up another server just for the comments. Might be a good use for my test Hadoop cluster.

[1] https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_eve...


Thank you. I'll be unpacking this on a FreeBSD server, so hopefully ZFS's lz4 file system compression will take some of the bite out of that 1TB file.


This is really cool, is the source code available anywhere?



Thanks for sharing this - that was a lot of fun!


This is very cool! Glad to see that people are making good use of this. Are you pulling the answer with the highest score?


This is super impressive.

My favorite was: "Who is your daddy and what does he do?" "It's not a tumor!"


I wonder if an analysis of this and the comment corpus could quantifiably show shifts in the opinions of the Reddit "hivemind" since the start. Has the Reddit userbase turned more conservative over the years, or is that just my impression?


You will most likely find a shift in the hive mind, but presumably only because Reddit grew from a smaller community with relatively dominant progressive and atheist/agnostic values into something much more mainstream. Reddit 2006 would not reflect mainstream US values and opinions, Reddit 2015 would be much closer. So I'm not sure we can draw any conclusions from the hive mind shift.


> Reddit 2006 would not reflect mainstream US values and opinions, Reddit 2015 would be much closer.

While I agree Reddit 2015 does reflect more of the US mainstream, it is still far from being an actual reflection of popular US opinion. Reddit is far more atheist/libertarian than the US mainstream. And I would argue it is right now far more reactionary/right wing as well. It has become very popular among young white males who have unhealthy attitudes towards women due to the abundance of porn and generally retrograde attitudes towards women of Reddit in general (both the users and the site itself.)

Same goes for its attitudes towards race.


generally retrograde attitudes towards women of Reddit in general

That's a pretty awful accusation to level at ~20 million people. I wonder if you can substantiate that?


I think anyone who reads the default subs can corroborate the rampant sexism and racism.

When I'm on women-centric subs like /r/askwomen, I frequently see comments that would absolutely get torn apart in most other parts of the site. Not offensive or provocative ones, either. Just normal conversation that would get a slew of "shut up, SJW" anywhere else.


That last comment is not necessarily anything-ist. Someone who reads a community centralized around the finder points of, say, raising kittens, probably doesn't desire social politics and its associated toxicity injected. (Imagine this conversation we're having here, right now, in the middle of your favorite place to relax, and talk to people, except a lot less civil, and that's before the meta communities find it and invade, and even if you avoid it entirely, everyone involved gets hacked off at each other because the very nature of the topic is divisive)

It's off-topic meta noise at best and harmful to the community at worst.


Well, the fact of the matter is, most Reddit communities are dominated by male voices. So the "social politics" often aren't artificial incursions into the discussion, but underrepresented groups just trying to be part of the conversation. For example, there was recently some controversy about the portrayal of a female character in Metal Gear Solid V. It's not good when a woman says "I don't like how this character was portrayed" and everyone else pipes in with "lol SJW feminist". It's just a shitty, shitty atmosphere. In contrast, I've found that in communities with a more equal gender split, benefit of the doubt is far more freely given, and some great discussions about potentially touchy topics happen as a result.

Anyway, I've found that the people who use shorthand insults like "SJW" are typically part of the entrenched majority group of their subculture — and want nothing more than to keep those pesky "outsiders" away from their clubhouse. (Despite the fact that those outsiders had, in fact, been there all along.)


Personally, I'd love to know what communities those are. I've yet to find any one that would take that comment you just used as an example (or its converse) and allow anyone else to disagree with it freely. (And when I say freely, I mean without namecalling, without mass downvotes, without banning dissent, without any of the other community-forcing-its-will-on-the-minority-opinion garbage that follows those discussions 99% of the time in my experience)

"I don't like how this character was portrayed" is a fine criticism on its own, but behind that is the shadow of "..so what should be done about it?". Absolutely nothing of value lays down that path.

Put another way: Talk of frame rates leads to discussions about optimization. Talks of nonsensical writing leads to discussions of something else they could have done. Talks about portrayal of a "minority" character go straight into "the developers are *ist".

Why does "being part of the conversation" always seem to wind up in "I don't like this, the devs are assholes, it should be changed?"


Considering not all of the user base comments and that of those who do, not all of them are misogynistic, imputing the opinions of a vocal minority to the whole community isn't really fair.


Sure, but every community has a set of norms. You get a sense of what's not OK to say if you've been around for a while. On Reddit, I can immediately tell when going from a default sub to a nicer, more inclusive sub, because I see comments that would get absolutely eviscerated in the defaults used in regular conversation with no apparent controversy, pile-ons, or swearing. Sure, most people on Reddit probably aren't racist or misogynistic, but the community can still feel that way.


I think that the best way to describe Reddit with regard to misogyny, racism, and other forms of bigotry is that Reddit is not a safe place for everybody. If you don't curate your subreddits well, you'll probably end up trudging through a lot of awful comments and/or submissions.

I view this to be more of a reflection of society as a whole rather than the Reddit community specifically. Subreddits show a marked decrease in quality once they become large. Becoming a default subreddit is, in some ways, a death knell for the community.


By definition the people who don't comment don't contribute to the opinions on the site, so your argument that they are nice people really doesn't matter, even if it were true.


If you go to the comment section of any major site you'll find racism and misogyny. Saying Redditors support misogyny because there are misogynistic comments there is like saying New York Times readers do the same, because they also have misogynistic comments on their site.


Are you kidding me? Have you ever read any comments or half the submissions on the site? Or seen tons of popular subs like Mens Rights and shit?


Popular is a pretty relative term. MensRights has ~120K subscribers, while TwoXChromosomes has ~3.4 million. The latter is a lot more indicative of what the site "likes" than the former.

Anecdotes are not data.


> young white males who have unhealthy attitudes towards women due to the abundance of porn

You sound just like the fundamentalist Evangelicals I grew up with.


> You sound just like the fundamentalist Evangelicals I grew up with.

That reads as if you intended it as a personal attack. Such comments are not welcome on Hacker News. Please don't post them here.


[flagged]


I think your post would have gained more without the personal attack. I understand that his personal triggered your response, but the first part of your comment does a good job of supporting your argument.


Don't be too hard on him. As Ben Franklin said, the price of abundant, free porn is eternal vigilance.


It would also be interesting to see the development of astroturffing campaigns from a marketing and also a tinfoil hat perspective.


"More diverse" is, I believe, the term you're looking for.

On reddit circa 2006, everybody had the same views on everything from politics to movies to cats to spiders.

Eventually it acquired a more diverse cross-section of the population. Of course, they've been doing their best over the past few months to crack down on the expression of views they don't like, so I'm not sure what the demographic looks like nowadays and how many people have fled to less censorship-happy pastures.

It's also possible that young people are becoming more conservative overall. Conservatism is looking a lot better ever since the idiocy of the Bush administration got replaced by the idiocy of the Obama administration.


Perhaps this is not the right thread for it, but what exactly about the Obama administration is idiotic? I'm ashamed to say I don't really follow politics.


Just one example is the "Cars for Clunkers" program, where the government bought old cars just to destroy them. It was supposed to be both an environmental and economic stimulus, but in reality was bad for the environment and is just a form of the Broken Window Fallacy.


It sold a ton of kia souls, and by my figgering probably got most people 10-20 extra miles per gallon.

Not that it makes up for the tragic deaths of many big engined classics....


There sure as hell is lot more fear-mongering since the refugee "crisis" started.


I find that /r/news and /r/worldnews are just utter cesspits of terrible, racist comments. Some people say they're brigaded by Stormfront... dunno if that's true or not, but I wouldn't be surprised.


I think American redditors just don't care much, and there has been a lot of racism in European parts of Reddit for some time.


Why quotes on crisis? I'm not challenging it, because I also find it weird that overnight the phrase started popping up in the news daily. And what is the fear-mongering going on at Reddit?


> And what is the fear-mongering going on at Reddit?

The usual:

• The Arabs will outbreed us! (Before or after the Turks will outbreed us? Or was it the Somalis? I lost track.)

• The refugees will cripple our economy! (153 million EU citizens are on tax-funded pensions or unemployment benefits. Please tell me how a million refugees is going to make a dent in that.)


They don't need to outbreed us - they are just sending us their displaced population, costing €700K per head in social services for "integration". And of those refugees "integrated" at vast expense, how many are already ISIS operatives? How many will be radicalized and return with full bellies and full bank accounts to fight for jihad in Syria and Iraq?

Europe is doing the wrong thing.


3.7 million people that now are the largest refugee population in the world. You can leave off the quotes.


That is quite accurately reflective of what's going on in Europe right now, in fairness.


You could probably just take the same methodology as the Poole-Rosenthal index for partisan politics in the United States. It's always the most widely cited for the current state of divisiveness in politics and I think could be applied here.


Can you take a a sentiment of certain trends in what people are talking about regarding [subject] and turn it into either better marketing or profiting off sentiments in the market?


Last year, I did a statistical analysis with the same data and can confirm that the data is robust: http://minimaxir.com/2014/12/reddit-statistics/

The Comment data set is already on BigQuery, allowing for quick analysis without having to download that corpus (example: https://www.reddit.com/r/bigquery/comments/3kfnmq/reddit_sub... )

Once the Submission dataset is also on BigQuery, I'll write a blog post with more info on how to use it.


Submissions Corpus Magnet Link:

    magnet:?xt=urn:btih:9941b4485203c7838c3e688189dc069b7af59f2e&dn=RS%5Ffull%5Fcorpus.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969
Comments Corpus Magnet Link

    magnet:?xt=urn:btih:32916ad30ce4c90ee4c47a95bd0075e44ac15dd2&dn=RC%5F2015-01.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969

Note: The comment corpus was built a few months ago (2015-01), so the two datasets don't overlap completely.


I'm not quite sure how I feel about it. Yes, that data was publicly available, but it 'feels' different that there is now an unalterable snapshot available to the public. No more comment deletion.


That's already true due to a service called: uneddit reddit: http://uneddit.com/ and https://www.unedditreddit.com/

There's also the web archive and search engine indexing.


I believe this is just the submissions, not comments (though, it appears that's already been released), based on the post authors comment...

> This dataset will go nicely with the full Reddit Comment Corpus that I released a couple months ago. The link_id from each comment corresponds to the id key in each of the submission objects in this dataset.


A comment dataset was actually already released a while back.


> No more comment deletion

Isn't that how hackernews is? not being able to delete comments


Yeah but I'm sure a lot of people post on here knowing that they won't delete a comment. It's a different story with Reddit, where people have a lot more expectations and it's bigger.


If somebody believes that he can post something on the internet and then just remove it from existence, then he's probably a politician in EU.


Agreed; this feels antithetical to Reddit's principles on some level.

When the next snapshot is released, it will be possible to just perform a diff to discover all deletions and edits since.


There's also a backup copy in Utah with all your comments from every site tied to common selectors like your usernames, emails, passwords, ip, credit cards etc. No doubt that'll leak at some point too.

So comment deletion was never going to save you.


But we cannot download that one.

This is actually what has been bothering me for a while now. In the past where search engines (well at least the big one) weren't that thorough, you didn't mind if something stayed on the Internet, since it wasn't easily accessible.

But it's a major difference if it's indexed/searchable or downloadable.


This is why, I think we need to make ICT (Information Communication Technology) a mandatory subject in our schools.

Teach kids about the internet. Many people are still under the assumption just deleting something would make it disappear for ever.

Maybe a lot of people from our generation are doomed to sharing too many things online, but we can at least save the next generation from themselves.

You make a mistake 10 years ago, may be a few close friends in your town know about you. Now you make a mistake, the whole world has access to that information.

Government regulations, bans are not going to do anything to stop the spread of information, we need to educate people to protect themselves from their own selves.


Judging by how much of a hash lots of schools are doing with literacy and numeracy (and we've been teaching these things for ages), I don't have high hopes for ICT.


But we cannot download that one.

Not yet :)

Given their lax security and general cluelessness of the people in charge, I'm quite sure it will leak at some point, at least partially, perhaps to corporations, perhaps to the public, just as internal NSA docs have leaked. It's very hard to keep things like that airtight forever - all it takes is one slip up and all the info stored could be accessible at some point in the future. It's already indexed and searchable, just not by you.

The important point here is that looking at present day tech (as in your comment on search engine prowess) is not the way to look at it - one day all this information will be accessible to much more intelligent future algorithms, able to link it together in myriad ways and form an almost perfect picture of your life in retrospect. The data is there, and will be stored forever.


I wonder whether they transformed top secret networks into intentionally slow networks now for large scale leaks to take years or tens of years :)


Limiting outgoing traffic would be enough.


Snowden used "a memory stick and other removable media, including a CD-ROM that he labeled as a Lady Gaga music CD" [1], so this kind of limiting wouldn't work. But he still did loads of internal network data queries, that's what I'm talking about.

[1] http://www.wired.co.uk/news/archive/2013-06/14/snowden-memor...


You're mixing up Snowden with Manning.


What is this referring to?


Lots of state agencies collect info like Reddit comments as well as emails, and store it indefinitely. They're already quite good at linking online identities.

https://theintercept.com/2015/09/25/gchq-radio-porn-spies-tr...

https://en.m.wikipedia.org/wiki/Utah_Data_Center


NSA's Utah Data Center[0], where they store all the data siphoned off their various programs on the Internet.

[0] https://en.wikipedia.org/wiki/Utah_Data_Center


I've spent some time turning this dataset into a Torrent - magnet link is here. If the Reddit OP doesn't want it to be shared in this way for whatever reason, I will of course kill it. Just trying to save him some bandwidth.

magnet:?xt=urn:btih:9941b4485203c7838c3e688189dc069b7af59f2e&dn=RS%5Ffull%5Fcorpus.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80


magnet:?xt=urn:btih:9941b4485203c7838c3e688189dc069b7af59f2e&dn=RS%5Ffull%5Fcorpus.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&ws=http://reddit-data.s3.amazonaws.com/RS_full_corpus.bz2 adds the S3 link as a web seed.


That's pretty cool. So would it pull the data from the web or the seedbox? Sorry - I'm fairly newbish with p2p technology.


Both, if the client is new enough. I'm not sure what weight is given to each. You have no contact details and the site on your keybase profile is dead - can you get in touch? I'd like to experiment with this.


Shooting you an email to the email listed on your profile. I really should fix my keybase.


The same author just posted about a realtime stream of comment data: https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_d...


Now we wait for someone to train a RNN and produce a fake but believable reddit parody with generated titles :)



Some of them are amazing - "TIL Robin Williams died a year ago yesterday, Donald Trump and Bernie Sanders Will Consider Legalizing Dank Maymays if Jet Fueled"


Yes, but this works with Markov chains, not RNNs


Correlate the comment corpus against LinkedIn DB and other public data sources, and one could create an auto-dox system.


Crossref did some analysis on when DOIs are submitted to reddit. DOIs are identifiers / persistent links, mostly for scholarly content.

http://crosstech.crossref.org/2015/09/dois-in-reddit.html

HN thread here: https://news.ycombinator.com/item?id=10303295


Is it it possible to directly use it with Spark? (I.e. without downloading it from AWS.) I mean, if the file is public, does it mean that there is a s3n:// address and it is accessible?


I couldn't see from the source what form the data are in? It mentions what properties are available, but is it a DB snapshot? Are 'submission objects' only text?


Hi there. I'm the one that released this. The data is just a bunch of JSON objects separated by new lines (\n). It's basically the exact same info you would get in JSON format from Reddit's API.


Does anyone know what license this would be under?


Damn shame this excludes all my censored content.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: