Decompression can’t be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.
> I mean do you really want subreddit name and subreddit_name_prefixed? They’re the same, one just has an “r/” in front of it.
This is (unfortunately) not quite true. Since Reddit introduced "profile posts," there can be a post where the subreddit name is something like "u_Shitty_Watercolour" but the subreddit_name_prefixed is actually "u/Shitty_Watercolour", rather than "r/u_Shitty_Watercolour".
However the point of subreddit_name_prefixed (I assume) is to display something in a user-facing way. For this purpose, r/u_something is correct but not proper.
Users can post to their own profile subreddit, or any of the other subreddits that they have permission to post to. That's relatively new, and came with the new profiles and part of the redesign. Both are needed, information could be missed.
It is difficult for me to describe just how angry it makes me that reddit doesn't provide a way for users to even do basic things like "see all of my own comments" or "see all of the posts made to the subreddit I moderate". They keep nerfing the search APIs and claim it is so they could make the indexes more efficient, but while that might make sense for a full-text search interface, that is entirely unreasonable for basic functionality like "I'm scrolling back through time on my own user page" (where the efficient index is pretty obvious). Both of "see all of the content I posted" and "see all of the content I'm supposedly responsible for" seems like it should be basic, if not required, functionality for any website.
It's not really related to search, most of the cause is a pretty bizarre optimization method that reddit decided to implement fairly early on. The database structure is unusual and not very conducive to indexing (it's similar to an EAV model), so at some point they decided to basically write their own "secondary indexing"-like system that stores the "listing indexes" in memcached/Cassandra (with the data itself in PostgreSQL).
Whenever something happens that affects any listings (new post created, voting, etc), the site figures out all the listings it needs to update, and where in each listing the affected post now belongs and updates them all. So for example, if you make a new submission to /r/pics, it will go through and add the post's ID in the right spot to the "new posts in /r/pics" listing, the "hot posts in /r/pics" listing, the "new posts by yourusername" listing, and so on. As it's going through and updating all these lists, it also trims each one down to 1000 items.
It's conceptually pretty similar to a normal database indexing system, but basically maintains all the indexes "manually" and restricts them all to the top 1000 items.
Thanks for the explanation. The APIs make a lot more sense in light of it. It seems odd they didn't just use Postgres indexing though. Do you know if they benchmarked it at any point?
For a lot of queries EAV type schemas are really hard to index efficiently. E.g. searching for something like a = ? and b = ? where a and b are dynamic attributes you can't just have a multi-column index when using EAV. So you instead end up with two intermediate query results that then are intersected. If you need even semi-efficient querying EAV usually isn't the answer.
But it is possible to use hybrid data store. Extract important attributes you want to search on efficiently and store them in properly structured and indexed relational database. Other attributes can still be stored EAV style in different schema/database.
Reddit doesn't even allow users to save more than 1000 posts, and worse does not visibly document this or provide any kind of warning that the limit has been exceeded. Anecdotally, I've read users say that revisiting the saved pages will still show an "unsave" button so the information is recorded somewhere. But once a user exceeds 1000 entries on their "saved" page, adding new ones will silently vaporize old ones.
It's a bit weirder than that. It actually does save all the posts, but the "saved" page (like almost every other page on the site) will only show you 1000 items, so there's just no way to access all the older items once they've been "pushed off the end".
If you unsave a newer one, wouldn't the older one pop up again because it updates the index with limit 1000 again? If the data is there, it should find all posts and then truncate, in that order, if I'm understanding this correctly.
Nope, because the old one's already been pushed off the end of the list and it doesn't re-generate the list when you unsave, just removes the item from the list.
Imagine I have a list of max length 5, when I initially fill it up it looks like [5, 4, 3, 2, 1]. If I save one more thing, it adds 6 at the front, then truncates the list and removes the 1 from the end, so now you have [6, 5, 4, 3, 2]. At that point, if I unsave #3, it just removes it from the list so you'd have [6, 5, 4, 2]. #1 is still saved, but nothing happens to pull it back into the list.
Saving posts is also notoriously unreliable, in my experience. I've had posts where I've clicked the "save" button and it's flipped to "unsave" but upon refreshing the page the button resets as if it was never pressed (and it never gets added.)
A good work around tip is to set up an https://ifttt.com/ to trigger when you save to offload your saves to somewhere else. I save all my reddit saves to Evernote, personally. It can help with this known issue!
Wow, It's so weird recognizing so many people in this discussion. Thanks for giving examples.
I have a subreddit (/r/pushshift) that gives examples on how to use the API. I'm always happy to answer questions about the API and take suggestions to improve it.
I'm very very close to launching the new API which will have a lot more features and a better design than the current one.
Edit: I just realized my username on this site got truncated to uhhh ... wow.
The big concern I have is that the endpoints used in the article are pretty old and given that Reddit is now making their site more modern like Facebook and Twitter, they'll just close the endpoints holes like they did.
Yeah, Reddit's search is unusable. Some of the important features used to be corralled into CloudSearch but I think they killed that too.
I pulled the dumps from Bigquery and am going to load them into PG at some point here, so I can do arbitrary queries without hitting BigQuery every time. Haven't looked at how to do that in realtime though, retrospective queries are mostly what I'm interested in.
If you don't want to do that, there's also https://redditsearch.io/ - if you want to search back farther, be sure to set the toggle to "all" instead of "day"!
I built redditsearch.io -- It uses the pushshift API for the back-end. It was thrown together and barely works but hey, I'm just one guy maintaining this as a labor of love. :)
Why would we think php is slow? PHP is blazing fast certain applications (looking at you sugarcrm) make this into a mockery by rewriting queries and loading unnecessary data into each page request.
JS has the advantage of being the thing that makes your website run faster on the screen of the user whose clicks are earning you ad dollars. JS is in a good spot.
Because for a long time (critically, when PHP was very popular) it was quite slow. It has gotten better, but a lot of people haven't used it since it was slow.
PHP was popular when it had MySQL bindings that were always up to date with MySQL changes and apache PHP setup was simplistic. Thus the LAMP stack, right? It has always been rather fast, in the larger ecosystem of web scripting languages (especially if you avoided particular BIFs) but it got a LOT faster in 7 as the slowest bits were optimized alongside everything else.
Related: Jason Baumgartner has maintained a Reddit scraping pipeline for a few years now, and wrote up some notes about making it robust: https://pushshift.io
You can also use the Pushshift real-time feed in BigQuery to query for keywords in submissions in real time (unfortunately the comments feed broke last month)
Example query which searches for 'f5bot' in the past day and correctly finds the corresponding posts on Reddit:
#standardSQL
SELECT title, subreddit, permalink
FROM `pushshift.rt_reddit.submissions`
WHERE created_utc > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND REGEXP_CONTAINS(LOWER(title), r'f5bot')
There has been a lot of interest expressed in getting this working and dependable. It's part of my plan when releasing the new API. There is A LOT of internal code managing everything. I've got terabytes of indexes alone just to handle the 5 million API requests I'm currently getting each month to the Pushshift API (I have around 20 terabytes of SSD / NVMe space and around 512 GB of ram behind this project).
Also, imagine the joules saved worldwide if known key names didn't have to be sent each time, or, better yet, data was packed in a format that optimized for size and processing speed rather than readability.
I mean, I enjoy the idea of human readability as much as Jon Postel, but at certain scales you have to wonder about the hidden cost of petabytes of human-readable data flying over the wire, never to be seen by anything but computers.
Most clients support it implicitly; you probably have to go out of your way to get an uncompressed stream. Now compressing a verbose text string is not optimal but given the past attempts I'd hesitate against using a pre-packed format. Historically that has not worked out well. Compressing the text format is ultimately the worse-is-better solution.
> It’s a bit complicated to set up […] it’s really fast
Sigh, just use Perl. Writing code with the general regex engine took me only one minute of effort, but it runs already nearly 500× faster than codeplea's optimised special purpose code.
Why 100000 loops and not 10 like in the original code? Otherwise Benchmark.pm will show "(warning: too few iterations for a reliable count)".
----
benchmark.php 100000 loops:
Loaded 3000 keywords to search on a text of 19377 characters.
Searching with aho corasick...
time: 329.3522541523
Is your solution broken for the cases where keywords are prefixes or suffixes of each other? This situation is very common in my use-case. Also, does your solution work if a keyword appears multiple times?
I get what you're saying, but it's not quite as easy as you imply.
Pulling in an entire programming language is a much bigger dependency and maintenance cost than spending a couple hours writing an algorithm. It would make more sense to just use a C extension.
I only took a quick skim, but it doesn't look like they do the optimization to aho-corasick where you store connections directly to "leaf" nodes (ie: nodes where you can have a finished match).
If I'm right, that would probably speed things up significantly.
Sorry, what optimization are you talking about? If you're just path compression, then I don't think this works because you miss out on generating failure links as you go along.
It basically is path compression. You still need to maintain failure links, but at every step, when you step up your parents to find all matches, you jump directly to leaves.
I made something just like this that worked on forums. Basically any forum that was using the tapatalk plugin (pretty much any busy forum uses it these days) you could subscribe too. It doesn't look like this will handle mispellings of works, or anything like that. I was handling that, however it took a LOT of processing power and quickly realized that the more people used it, the more it wasn't going to scale really well. Good luck with your project.
> So here’s the approach I ended up using, which worked much better: request each post by its ID. That’s right, instead of asking for posts in batches of 100, we’re going to need to ask for each post individually by its post ID. We’ll do the same for comments.
Seems a bit over the top imho. Maybe a better approach is to ask for a 1,000 and look for any missing — which you can grab individually.
I’d be a little annoyed at people not using batch mode and making so many request but that’s just me.
Which API do most Reddit bots use? Do they use the Reddit APIs directly, or do they use one of the third-party services (F5Bot, pushshift)? And are there any other options for getting a firehose of new Reddit posts/comments?
With the "&limit" parameter he can change how many items he receives per HTTP request. This has nothing to do with a limit on how many HTTP requests he can make per TCP connection (pipelining). Maybe that is the "100" he is complaining about, i.e., 100 items per HTTP request.
However you failed to answer my question: Is he making 100 TCP connections to make 100 HTTP requests?
Does the Reddit server set a limit on how many HTTP requests he can make per connection? (100 is a common limit for web servers)
Sometimes the server admins may set a limit of 1 HTTP request per TCP connection. This prevents users from pipelining outside the browser, e.g., with libcurl or some other method.
I didn't feel the need to answer your question because it was abundantly clear in the code that it's not using pipelining. You posted the exact curl option that he's not using.
I apologise if I confused you. I was simply wondering why he is not using pipelining, which IME can be ideal for the sort of text retrieval he is performing.
Please don't remove your old text when it turns out you're wrong or have an unpopular opinion. You're not the only person in this thread who missed the batch part of individual ID requests and I initially was confused as well, but removing your post breaks the comment thread, and after 3h you can't edit anymore so I'll have to downvote.
> So here are 2,000 posts, spread out over 20 batches of 100 that we download simultaneously. It assumes you’ve already got the last post ID loaded into $id base-36.
He's able to ask for multiple posts in one request so the actual request rate is fairly low:
> Here’s how we do it. We find the starting post ID, and then we get posts individually from https://api.reddit.com/api/info.json?id=. We’ll need to add a big list of post IDs to the end of that API URL.
This is mostly why I left Reddit. The API allows far too much control and I started questioning what was even real. Being able to quickly find keywords and then have a network of bots that creates replies/upvotes/downvotes is very disturbing thought to me. I can't even imagine something like that on a large scale to change opinions.
The votes aren't real and they don't matter. This is true for FB likes or whatever else you can imagine. Reddit goes to a LOT of trouble to counter bad actors, but if you're in some subreddit that you think is full of abuse, then it's like any other part of the internet. Go somewhere else. This has little to do with Reddit as a whole, so the proclamation seems unnecessarily volatile.
Out of interest in this I made Https://linksforreddit.com. You can view who links to certain articles. I backed up and truncated the 2017 data because the VPS ran out of space but there where and are interesting patterns. Nothing proof like but the same pdfs sometimes get linked close to a hundred times by similar themed and structured essays but textually different comments.
Yes. Voting patterns, for example, how do those change people's opinions? It would seem that the most they could do is give an impression of what comments are and are not well received. Do people base much of their opinions on that? In which direction?
You can save a lot of bandwith by requesting compressed responses:
(OK, that's 85% saved, not 95%, but hey.)