Why do you not allow google to index discussion threads?
Sometimes I can remember the comments on an article, but not the article itself. Unfortunately I can't search google using comment text, because Reddit doesn't allow google to index its comments.
Those only block the various sorts and individual comment links. The main comments pages are still searchable.
Google was getting overzealous and indexing every page hundreds of times because it would follow every link, which included every "context" link and every sort.
A lot of times, I see an interesting quote or word on reddit, and I google to find out more, and the first (sometimes only) result is the comment itself. That even happens with comments that are less than an hour old.
Which one is which? Thanks in advance if you know, because i'd rather not download a huge torrent just to find out I should have been using that bandwidth to download a different one.
> However, uncompressed, it hits over 1+TB, if the BigQuery sizes are indication.
Even then, that’s easily doable on a consumer system.
I’ll download it in the night between friday and saturday, after I install my new HDD, and just run queries over it for fun. (far slower, but also far cheaper than BigQuery. Even at German electricity prices).
The full table (up to end of 2015) is available on BigQuery, with separate tables for each month thereafter: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_p... (there is a similar table for comments)
And here's a year-old post I wrote on how to use that Reddit dataset with BigQuery: http://minimaxir.com/2015/10/reddit-bigquery/