Hacker News new | past | comments | ask | show | jobs | submit login

Since it was not mentioned in the post, here's a direct link to the Reddit comment corpus likely being used: http://files.pushshift.io/reddit/comments/

The full table (up to end of 2015) is available on BigQuery, with separate tables for each month thereafter: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_p... (there is a similar table for comments)

And here's a year-old post I wrote on how to use that Reddit dataset with BigQuery: http://minimaxir.com/2015/10/reddit-bigquery/




If you want to download the set as a torrent (to save pushshift.io some bandwidth cost), you can do so via

  magnet:?xt=urn:btih:UGFLA4QNEXGEFKYYY5ZU37JIHWEEYY5R&dn=reddit_data&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2fopen.demonii.com%3a1337&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969
which I have put together. It contains the data up to April 2016.

If you want to work with this dataset on your workstation, there are some code examples in https://github.com/dewarim/reddit-data-tools


So we have all this data, but still there do not seem to be a reasonable way to search comments within Reddit...


Cost. We could have implemented comment search 8 years ago. Actually, we did implement it. But it just cost way too much to maintain the index.


Why do you not allow google to index discussion threads?

Sometimes I can remember the comments on an article, but not the article itself. Unfortunately I can't search google using comment text, because Reddit doesn't allow google to index its comments.


What are you talking about? I search reddit using google all the time !?


https://www.reddit.com/robots.txt

    Disallow: /*/comments/*?*sort=
    Disallow: /r/*/comments/*/*/c*
    Disallow: /comments/*/*/c*


Those only block the various sorts and individual comment links. The main comments pages are still searchable.

Google was getting overzealous and indexing every page hundreds of times because it would follow every link, which included every "context" link and every sort.


It works fine. I don't know if there's a way to get only comments, but if you scroll down to the link to /r/madlads it's clearly indexing them.

"site:reddit.com don't quote me"

I use this all the time to search specific subreddits for phrases in comments I remember, but I don't want to give examples or subreddits I visit.


He actually used the site: keyword! The absolute madman!


> /r/madlads

Look at this maverick!


But it does, it works fine.


A lot of times, I see an interesting quote or word on reddit, and I google to find out more, and the first (sometimes only) result is the comment itself. That even happens with comments that are less than an hour old.

Try adding site:reddit.com to your search.


Fellow QA here. I got started with BigQuery because of your post that I noticed on Reddit. Thanks!


That’s 260 gigabyte, that’s actually a tiny dataset, you can query and index that on a normal workstation in seconds.

Even training models on that is possible in realistic times on normal systems with that.


[deleted due to wrongness]


Which one is which? Thanks in advance if you know, because i'd rather not download a huge torrent just to find out I should have been using that bandwidth to download a different one.


Never mind, misread, the comment data is indeed what I linked to.

However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.


> However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.

Even then, that’s easily doable on a consumer system.

I’ll download it in the night between friday and saturday, after I install my new HDD, and just run queries over it for fun. (far slower, but also far cheaper than BigQuery. Even at German electricity prices).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: