Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How F5Bot Slurps All of Reddit (intoli.com)
255 points by foob on July 30, 2018 | hide | past | favorite | 87 comments


> The other 95% of it is just wasted bandwidth.

You can save a lot of bandwith by requesting compressed responses:

  $ curl -s --user-agent moo/1 -H 'Accept-Encoding: gzip' "$pretty_long_url" > test.gz
  $ wc -c < test.gz 
  63507
  $ gzip -d < test.gz | wc -c
  426941
(OK, that's 85% saved, not 95%, but hey.)


Trading cpu for bandwidth, so optimize for what you want.


Decompression is usually cheaper (CPU-wise) than compression so technically you’re trading their CPU for less of your bandwidth and CPU.

If you do it right you can even keep the content stored compressed without re-compressing by saving the compressed byte stream directly.


Also use pigz instead of gzip so it can use multiple threads on decoding (although you're probably Limited by sending bandwidth)


pigz doesn't help much with decompression:

https://github.com/madler/pigz/issues/36#issuecomment-249041...

Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.


> I mean do you really want subreddit name and subreddit_name_prefixed? They’re the same, one just has an “r/” in front of it.

This is (unfortunately) not quite true. Since Reddit introduced "profile posts," there can be a post where the subreddit name is something like "u_Shitty_Watercolour" but the subreddit_name_prefixed is actually "u/Shitty_Watercolour", rather than "r/u_Shitty_Watercolour".

Example: https://www.reddit.com/user/Shitty_Watercolour/comments/84nh...


I'm not sure that's true. I think they both work, see: https://www.reddit.com/r/u_Shitty_Watercolour/

Maybe one is just an alias though? I wonder if you can make a r/u_$unused_username and then later register $unused_username

edit: nope, you can't make a sub that starts with "u_"


They do both work, you're correct.

However the point of subreddit_name_prefixed (I assume) is to display something in a user-facing way. For this purpose, r/u_something is correct but not proper.


Users can post to their own profile subreddit, or any of the other subreddits that they have permission to post to. That's relatively new, and came with the new profiles and part of the redesign. Both are needed, information could be missed.


It is difficult for me to describe just how angry it makes me that reddit doesn't provide a way for users to even do basic things like "see all of my own comments" or "see all of the posts made to the subreddit I moderate". They keep nerfing the search APIs and claim it is so they could make the indexes more efficient, but while that might make sense for a full-text search interface, that is entirely unreasonable for basic functionality like "I'm scrolling back through time on my own user page" (where the efficient index is pretty obvious). Both of "see all of the content I posted" and "see all of the content I'm supposedly responsible for" seems like it should be basic, if not required, functionality for any website.

https://www.reddit.com/r/changelog/comments/7tus5f/update_to...

https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_re...

https://www.reddit.com/r/help/comments/1u0scj/get_full_post_...


(I worked at reddit)

It's not really related to search, most of the cause is a pretty bizarre optimization method that reddit decided to implement fairly early on. The database structure is unusual and not very conducive to indexing (it's similar to an EAV model), so at some point they decided to basically write their own "secondary indexing"-like system that stores the "listing indexes" in memcached/Cassandra (with the data itself in PostgreSQL).

Whenever something happens that affects any listings (new post created, voting, etc), the site figures out all the listings it needs to update, and where in each listing the affected post now belongs and updates them all. So for example, if you make a new submission to /r/pics, it will go through and add the post's ID in the right spot to the "new posts in /r/pics" listing, the "hot posts in /r/pics" listing, the "new posts by yourusername" listing, and so on. As it's going through and updating all these lists, it also trims each one down to 1000 items.

It's conceptually pretty similar to a normal database indexing system, but basically maintains all the indexes "manually" and restricts them all to the top 1000 items.

If you're curious enough to dig around in the code, this is probably the main relevant file: https://github.com/reddit-archive/reddit/blob/master/r2/r2/l...


Thanks for the explanation. The APIs make a lot more sense in light of it. It seems odd they didn't just use Postgres indexing though. Do you know if they benchmarked it at any point?


For a lot of queries EAV type schemas are really hard to index efficiently. E.g. searching for something like a = ? and b = ? where a and b are dynamic attributes you can't just have a multi-column index when using EAV. So you instead end up with two intermediate query results that then are intersected. If you need even semi-efficient querying EAV usually isn't the answer.


But it is possible to use hybrid data store. Extract important attributes you want to search on efficiently and store them in properly structured and indexed relational database. Other attributes can still be stored EAV style in different schema/database.


I have no idea, this system was created years before I started working there.


Reddit doesn't even allow users to save more than 1000 posts, and worse does not visibly document this or provide any kind of warning that the limit has been exceeded. Anecdotally, I've read users say that revisiting the saved pages will still show an "unsave" button so the information is recorded somewhere. But once a user exceeds 1000 entries on their "saved" page, adding new ones will silently vaporize old ones.

https://www.reddit.com/r/help/comments/6nxqjm/maximum_of_100...


It's a bit weirder than that. It actually does save all the posts, but the "saved" page (like almost every other page on the site) will only show you 1000 items, so there's just no way to access all the older items once they've been "pushed off the end".

I posted some more information about it a while ago here: https://www.reddit.com/r/help/comments/7en0uu/my_saved_posts...


If you unsave a newer one, wouldn't the older one pop up again because it updates the index with limit 1000 again? If the data is there, it should find all posts and then truncate, in that order, if I'm understanding this correctly.


Nope, because the old one's already been pushed off the end of the list and it doesn't re-generate the list when you unsave, just removes the item from the list.

Imagine I have a list of max length 5, when I initially fill it up it looks like [5, 4, 3, 2, 1]. If I save one more thing, it adds 6 at the front, then truncates the list and removes the 1 from the end, so now you have [6, 5, 4, 3, 2]. At that point, if I unsave #3, it just removes it from the list so you'd have [6, 5, 4, 2]. #1 is still saved, but nothing happens to pull it back into the list.


What if you unsave 2 items and save 1. Is then the list remade?


No, the list is never remade. The new item just gets inserted into the list, exactly the same as if it had never reached the limit in the first place.


why would they make the limit so low?


Not sure, the 1000 limit is in the file from its very first version 10 years ago: https://github.com/reddit-archive/reddit/commit/33fd4e9684ca...

1000 probably seemed like a lot at the time or had reasonable performance, and it's just never been changed.


Saving posts is also notoriously unreliable, in my experience. I've had posts where I've clicked the "save" button and it's flipped to "unsave" but upon refreshing the page the button resets as if it was never pressed (and it never gets added.)


A good work around tip is to set up an https://ifttt.com/ to trigger when you save to offload your saves to somewhere else. I save all my reddit saves to Evernote, personally. It can help with this known issue!


Their API is so fucked that it is literally impossible to get more than a couple hundred posts from a subreddit.

Thankfully, services like pushshift[1] exist, which has a sane API and the option to use plain elasticsearch.

[1] https://github.com/pushshift/api


To answer GP's questions with Pushshift links:

"see all of my own comments": http://api.pushshift.io/reddit/comment/search?author=saurik&... (use &before=[epoch] for pagination)

"see all of the posts made to the subreddit I moderate": https://api.pushshift.io/reddit/search/submission/?subreddit...


Wow, It's so weird recognizing so many people in this discussion. Thanks for giving examples.

I have a subreddit (/r/pushshift) that gives examples on how to use the API. I'm always happy to answer questions about the API and take suggestions to improve it.

I'm very very close to launching the new API which will have a lot more features and a better design than the current one.

Edit: I just realized my username on this site got truncated to uhhh ... wow.


The big concern I have is that the endpoints used in the article are pretty old and given that Reddit is now making their site more modern like Facebook and Twitter, they'll just close the endpoints holes like they did.


Yeah, Reddit's search is unusable. Some of the important features used to be corralled into CloudSearch but I think they killed that too.

I pulled the dumps from Bigquery and am going to load them into PG at some point here, so I can do arbitrary queries without hitting BigQuery every time. Haven't looked at how to do that in realtime though, retrospective queries are mostly what I'm interested in.

If you don't want to do that, there's also https://redditsearch.io/ - if you want to search back farther, be sure to set the toggle to "all" instead of "day"!


I built redditsearch.io -- It uses the pushshift API for the back-end. It was thrown together and barely works but hey, I'm just one guy maintaining this as a labor of love. :)


"You may think PHP is slow"

Why would we think php is slow? PHP is blazing fast certain applications (looking at you sugarcrm) make this into a mockery by rewriting queries and loading unnecessary data into each page request.

Nice to see a php related show and tell.


PHP used to be very slow, it got better with v7.0.

It's still quite slow compared to C/C++/Rust/Go, more than 10x slower:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


An interpreted language is slower than a compiled binary? Color me shocked.


> An interpreted language is slower than a compiled binary? Color me shocked.

That same benchmark shows JS code that’s 6x faster than PHP.


Yes, but JS is faster than Ruby and Python, and indeed most of your go-to scripting options.


JS has the advantage of being the thing that makes your website run faster on the screen of the user whose clicks are earning you ad dollars. JS is in a good spot.




PHP is JIT-compiled, which helps.


Just curious - why did you choose fasta as your comparison?


Because for a long time (critically, when PHP was very popular) it was quite slow. It has gotten better, but a lot of people haven't used it since it was slow.


Quite slow compared to what?

PHP was popular when it had MySQL bindings that were always up to date with MySQL changes and apache PHP setup was simplistic. Thus the LAMP stack, right? It has always been rather fast, in the larger ecosystem of web scripting languages (especially if you avoided particular BIFs) but it got a LOT faster in 7 as the slowest bits were optimized alongside everything else.


Related: Jason Baumgartner has maintained a Reddit scraping pipeline for a few years now, and wrote up some notes about making it robust: https://pushshift.io


You can also use the Pushshift real-time feed in BigQuery to query for keywords in submissions in real time (unfortunately the comments feed broke last month)

Example query which searches for 'f5bot' in the past day and correctly finds the corresponding posts on Reddit:

   #standardSQL
   SELECT title, subreddit, permalink
   FROM `pushshift.rt_reddit.submissions`
   WHERE created_utc > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
   AND REGEXP_CONTAINS(LOWER(title), r'f5bot')


There has been a lot of interest expressed in getting this working and dependable. It's part of my plan when releasing the new API. There is A LOT of internal code managing everything. I've got terabytes of indexes alone just to handle the 5 million API requests I'm currently getting each month to the Pushshift API (I have around 20 terabytes of SSD / NVMe space and around 512 GB of ram behind this project).


Aho-Corasick is really great. It’s a bit complicated to set up, but once you have the modified true set up it’s really fast. By the way,

> Basically I use the selftext, subreddit, permalink, url and title. The other 95% of it is just wasted bandwidth.

It’d probably be better for Reddit if they allowed for specifying the fields we care about rather than just returning the whole thing…


Also, imagine the joules saved worldwide if known key names didn't have to be sent each time, or, better yet, data was packed in a format that optimized for size and processing speed rather than readability.

I mean, I enjoy the idea of human readability as much as Jon Postel, but at certain scales you have to wonder about the hidden cost of petabytes of human-readable data flying over the wire, never to be seen by anything but computers.


Except that the data stream is, I should hope, compressed so that the data is actually packed into a format optimized for size.


Not necessarily, and even so not for free.

(Client must specify compression support)


Most clients support it implicitly; you probably have to go out of your way to get an uncompressed stream. Now compressing a verbose text string is not optimal but given the past attempts I'd hesitate against using a pre-packed format. Historically that has not worked out well. Compressing the text format is ultimately the worse-is-better solution.


They are using libcurl, for which you need to request compression explicitly:

https://curl.haxx.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html


> It’s a bit complicated to set up […] it’s really fast

Sigh, just use Perl. Writing code with the general regex engine took me only one minute of effort, but it runs already nearly 500× faster than codeplea's optimised special purpose code.

Why 100000 loops and not 10 like in the original code? Otherwise Benchmark.pm will show "(warning: too few iterations for a reliable count)".

----

benchmark.php 100000 loops:

    Loaded 3000 keywords to search on a text of 19377 characters.

    Searching with aho corasick...
    time: 329.3522541523
----

benchmark.pl 100000 loops:

    Benchmark: timing 100000 iterations of regex...
    regex: 0.691561 wallclock secs ( 0.69 usr +  0.00 sys =  0.69 CPU) @ 144927.54/s (n=100000)
----

benchmark.pl (fill in the abbreviated ... parts from benchmark_setup.php):

    #!/usr/bin/env perl
    use Benchmark qw(timethese :hireswallclock);
    require Time::HiRes;
    my @needles = qw(
    abandonment abashed abashments abduction ...
    );
    my $haystack = 'unscathed grampus ...
    heroically';
    my $n = join '|', @needles;
    timethese 100000, {
        regex => sub {
            my @found;
            while ($haystack =~ /($n)/cg) {
                push @found, [$1, pos $haystack];
            }
            return @found;
        },
        index => sub {
            my @found;
            for (@needles) {
                my $pos = index $haystack, $_;
                push @found, [$_, $pos] if -1 < $pos;
            }
            return @found;
        }
    };


Is your solution broken for the cases where keywords are prefixes or suffixes of each other? This situation is very common in my use-case. Also, does your solution work if a keyword appears multiple times?

I get what you're saying, but it's not quite as easy as you imply.

Pulling in an entire programming language is a much bigger dependency and maintenance cost than spending a couple hours writing an algorithm. It would make more sense to just use a C extension.

I did try PHP's regex. It was much, much slower.


You are right, the solution is broken. I can't make it work, so I take back what I said.

I learnt something valuable, thank you for that.


I only took a quick skim, but it doesn't look like they do the optimization to aho-corasick where you store connections directly to "leaf" nodes (ie: nodes where you can have a finished match).

If I'm right, that would probably speed things up significantly.


Sorry, what optimization are you talking about? If you're just path compression, then I don't think this works because you miss out on generating failure links as you go along.


It basically is path compression. You still need to maintain failure links, but at every step, when you step up your parents to find all matches, you jump directly to leaves.


This is just scraping JSON, I'm surprised it made it to the front page. The only thing worth noting is that Reddit is is able to serve that much JSON


Yeah, was thinking the same.


I made something just like this that worked on forums. Basically any forum that was using the tapatalk plugin (pretty much any busy forum uses it these days) you could subscribe too. It doesn't look like this will handle mispellings of works, or anything like that. I was handling that, however it took a LOT of processing power and quickly realized that the more people used it, the more it wasn't going to scale really well. Good luck with your project.


I run a forum and would be very interested in seeing your code. Can you share?


> So here’s the approach I ended up using, which worked much better: request each post by its ID. That’s right, instead of asking for posts in batches of 100, we’re going to need to ask for each post individually by its post ID. We’ll do the same for comments.

Seems a bit over the top imho. Maybe a better approach is to ask for a 1,000 and look for any missing — which you can grab individually.

I’d be a little annoyed at people not using batch mode and making so many request but that’s just me.


Each request still returns 100 posts. It's just that you have to specify the 100 post IDs individually.

Their default listing mode works very poorly. It would certainly be more requests to use a hybrid system like you're talking about.


I thought the same at first but a bit further on it turns out to still batch them, you can apparently specify multiple IDs by comma-separating them.


There's a reddit database dump including the interval 2005 - 05/2018 at:

https://bigquery.cloud.google.com/table/fh-bigquery:reddit_c...


Which API do most Reddit bots use? Do they use the Reddit APIs directly, or do they use one of the third-party services (F5Bot, pushshift)? And are there any other options for getting a firehose of new Reddit posts/comments?



Do the social share buttons literally cover the first few paragraphs of content for anyone else?


For the service itself, I've been using it for a long time and it works really well.


Thanks! I'm glad it's working well for you.


"Turns out that Reddit [API] has a limit. It'll only show you 100 posts at a time."

100 sounds like a typical "max-requests" pipelining limit.

He does not mention CURLMOPT_PIPELINING.

Does this mean he makes 100 TCP connections in order to make 100 HTTP requests?


The 100 has nothing to do with HTTP pipelining, it's just a standard REST style "&limit=100" hard limit


You might be right.

With the "&limit" parameter he can change how many items he receives per HTTP request. This has nothing to do with a limit on how many HTTP requests he can make per TCP connection (pipelining). Maybe that is the "100" he is complaining about, i.e., 100 items per HTTP request.

However you failed to answer my question: Is he making 100 TCP connections to make 100 HTTP requests?

Does the Reddit server set a limit on how many HTTP requests he can make per connection? (100 is a common limit for web servers)

Sometimes the server admins may set a limit of 1 HTTP request per TCP connection. This prevents users from pipelining outside the browser, e.g., with libcurl or some other method.


I didn't feel the need to answer your question because it was abundantly clear in the code that it's not using pipelining. You posted the exact curl option that he's not using.


I apologise if I confused you. I was simply wondering why he is not using pipelining, which IME can be ideal for the sort of text retrieval he is performing.


edit: cool


Please don't remove your old text when it turns out you're wrong or have an unpopular opinion. You're not the only person in this thread who missed the batch part of individual ID requests and I initially was confused as well, but removing your post breaks the comment thread, and after 3h you can't edit anymore so I'll have to downvote.


> Don't be that guy

She/He's not. One request can supply 100 IDs.

> So here are 2,000 posts, spread out over 20 batches of 100 that we download simultaneously. It assumes you’ve already got the last post ID loaded into $id base-36.


He's able to ask for multiple posts in one request so the actual request rate is fairly low:

> Here’s how we do it. We find the starting post ID, and then we get posts individually from https://api.reddit.com/api/info.json?id=. We’ll need to add a big list of post IDs to the end of that API URL.


This is mostly why I left Reddit. The API allows far too much control and I started questioning what was even real. Being able to quickly find keywords and then have a network of bots that creates replies/upvotes/downvotes is very disturbing thought to me. I can't even imagine something like that on a large scale to change opinions.


The API is a reason why I _love_ Reddit. Being able to moderate my communities tools I can write has been a savior more times than I can count.


> I started questioning what was even real.

The votes aren't real and they don't matter. This is true for FB likes or whatever else you can imagine. Reddit goes to a LOT of trouble to counter bad actors, but if you're in some subreddit that you think is full of abuse, then it's like any other part of the internet. Go somewhere else. This has little to do with Reddit as a whole, so the proclamation seems unnecessarily volatile.


Out of interest in this I made Https://linksforreddit.com. You can view who links to certain articles. I backed up and truncated the 2017 data because the VPS ran out of space but there where and are interesting patterns. Nothing proof like but the same pdfs sometimes get linked close to a hundred times by similar themed and structured essays but textually different comments.


How do you change opinions using that? Would it be an illegitimate way to change someone's opinion?


I think what they're describing are propaganda bots. If so, what you're asking is how does propaganda work, and is it not legitimate?


Yes. Voting patterns, for example, how do those change people's opinions? It would seem that the most they could do is give an impression of what comments are and are not well received. Do people base much of their opinions on that? In which direction?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: