Since it was not mentioned in the post, here's a direct link to the Reddit comme...

chokma · on Oct 11, 2016

If you want to download the set as a torrent (to save pushshift.io some bandwidth cost), you can do so via

  magnet:?xt=urn:btih:UGFLA4QNEXGEFKYYY5ZU37JIHWEEYY5R&dn=reddit_data&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2fopen.demonii.com%3a1337&tr=udp%3a%2f%2ftracker.coppersurfer.tk%3a6969&tr=udp%3a%2f%2ftracker.leechers-paradise.org%3a6969

which I have put together. It contains the data up to April 2016.

If you want to work with this dataset on your workstation, there are some code examples in https://github.com/dewarim/reddit-data-tools

digi_owl · on Oct 11, 2016

So we have all this data, but still there do not seem to be a reasonable way to search comments within Reddit...

jedberg · on Oct 11, 2016

Cost. We could have implemented comment search 8 years ago. Actually, we did implement it. But it just cost way too much to maintain the index.

chatmasta · on Oct 11, 2016

Why do you not allow google to index discussion threads?

Sometimes I can remember the comments on an article, but not the article itself. Unfortunately I can't search google using comment text, because Reddit doesn't allow google to index its comments.

tachyonbeam · on Oct 11, 2016

What are you talking about? I search reddit using google all the time !?

chatmasta · on Oct 11, 2016

https://www.reddit.com/robots.txt

    Disallow: /*/comments/*?*sort=
    Disallow: /r/*/comments/*/*/c*
    Disallow: /comments/*/*/c*

jedberg · on Oct 11, 2016

Those only block the various sorts and individual comment links. The main comments pages are still searchable.

Google was getting overzealous and indexing every page hundreds of times because it would follow every link, which included every "context" link and every sort.

NoodleIncident · on Oct 11, 2016

It works fine. I don't know if there's a way to get only comments, but if you scroll down to the link to /r/madlads it's clearly indexing them.

"site:reddit.com don't quote me"

I use this all the time to search specific subreddits for phrases in comments I remember, but I don't want to give examples or subreddits I visit.

chatmasta · on Oct 11, 2016

He actually used the site: keyword! The absolute madman!

monksy · on Oct 11, 2016

> /r/madlads

Look at this maverick!

ClassyJacket · on Oct 11, 2016

But it does, it works fine.

ytjohn · on Oct 12, 2016

A lot of times, I see an interesting quote or word on reddit, and I google to find out more, and the first (sometimes only) result is the comment itself. That even happens with comments that are less than an hour old.

Try adding site:reddit.com to your search.

qxf2 · on Oct 11, 2016

Fellow QA here. I got started with BigQuery because of your post that I noticed on Reddit. Thanks!

kuschku · on Oct 11, 2016

That’s 260 gigabyte, that’s actually a tiny dataset, you can query and index that on a normal workstation in seconds.

Even training models on that is possible in realistic times on normal systems with that.

minimaxir · on Oct 11, 2016

[deleted due to wrongness]

natch · on Oct 11, 2016

Which one is which? Thanks in advance if you know, because i'd rather not download a huge torrent just to find out I should have been using that bandwidth to download a different one.

minimaxir · on Oct 11, 2016

Never mind, misread, the comment data is indeed what I linked to.

However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.

kuschku · on Oct 11, 2016

> However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.

Even then, that’s easily doable on a consumer system.

I’ll download it in the night between friday and saturday, after I install my new HDD, and just run queries over it for fun. (far slower, but also far cheaper than BigQuery. Even at German electricity prices).