>In recent years, the web has gotten very hostile to the lowly web scraper. It's a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS.
Dunno, a lot of the time it actually makes scraping easier because the content that's not in the original source tends to be served up as structured data via XHR- JSON usually- you just need to take a look at the data you're interested in and if it's not in 'view-source', it's coming from somewhere else.
Browser based scraping makes sense when that data is heavily mangled or obfuscated, laden with captchas and other anti-scraping methods. Or if you're interested in if text is hidden, what position it's on the page etc.
Agreed! multiple times I wasted hours figuring out what selectors to use, but then remembered that I can just look at the network tab and have perfectly structured JSON data.
For those curious about how this can work in production, Puppeteer's setRequestInterception and page.on('response') are incredibly powerful. Platforms like Browserless can make this easy to orchestrate as well. Also, many full-stack JS frameworks will preload JSON payloads into the initial HTML for hydration. There are tons of possibilities beyond DOM scraping.
That said, it's surprising how many high-traffic sites still use "send an HTML snippet over an AJAX endpoint" - or worse yet, ASP.NET forms with stateful servers where you have to dance with __VIEWSTATE across multiple network hops. Part of the art of scraping is knowing when it's worthwhile to go down these rabbit holes, and when it's not!
The point was, you don't have to wait for JS to re-arrange the dom, sometimes it's a simple request to example.com/api/endpoint?productid=123 and you have all the data you need. No worries about HTML markup.
I think btown's point was sometimes what you're served is not just "the data you need" from that request, but a portion of the page that will be inserted, rather than built from raw data and inserted, so you need to parse the HTML in the response given since it's an HTML snippet.
It's still generally easier, because you don't have to worry about zeroing in on the right section of the page before you start pulling the data out of the HTML, but not quite as easy as getting a JSON structure.
I think you're still misunderstanding. Sometimes, sites haven't adopted a pure data-driven model, and when example.com/api/endpoint?productid=123 is requested it doesn't return JSON for the product with id 123, but instead returns a div or table row of HTML which has the data for that product already in it, which is then inserted directly where it's meant to be in the current page, rather then built into HTML from JSON and then inserted.
What I was saying is that method is not quite as easy as pure JSON to get data from, but still easier to parse and find the specific data for the specific item you're looking for, as it's a very small amount of markup all related to the entry in question.
My interpretation of btown's comment is along the same lines, that it's surprising how many sites still serve HTML snippets for dynamic pages.
But also, some more modern sites with JSON API endpoints will have extremely bespoke session/auth/state management systems that make it difficult to create a request payload that will work without calculations done deep in the bowels of their client-side JS code. It can be much easier, if slower and more costly, to mimic a browser and listen to the equivalent of the Network tab, than to find out how to create valid payloads directly for the API endpoints.
If you can see the request in the network tab, you can just right click - copy as curl and then replay the request from the command line and noodle with the request parameters that way. Works great!
Honestly, from prior experience any scraping requirements that require browser implementation tend to be due to captchas and anti-scraping measures, nothing to do with the data layout.
It's either in the DOM or in one or two other payloads.
Isn't this sort-of-why people hide themselves behind Cloudflare, to remove the lowest common denominators of scraping.
Yes. Sometimes you can see that there's a static (per session) header they add to each request, and all you have to do is find and record that header value (such as shimming addRequestHeader) and append it to your own requests from that context...
The challenge there is automating it, though - usually the rest endpoints require some complex combination of temporary auth token headers that are (intentionally) difficult to generate outside the context of the app itself and expire pretty quickly.
Care to provide some examples. The majority of sites submitted to HN do not even require cookies let alone tokens in special headers. A site like Twitter is an exception not the general rule.
Not sure about "scraping targets". I'm referring to websites that can be read without using Javascript. Few websites submitted to HN try to discourage users with JS disabled from reading them by using tokens in special headers. Twitter is an exception. Twitter's efforts to annoy users into enabling Javascript are ineffective anyway.
But what if their backend blocks you? I'm trying to develop a instagram scraper and I find that I'll have to spend money on rotating proxies.
It doesn't matter if you scrape the DOM or get some Json.
I just need to scrape some public account posts and, I may be dumb, but I dunno how to do that with the official APIs (developers.facebook is hard to understand for me).
Hah, I literally just fought this for the past month. We run a large esports league that relies on player ranked data. They have the data, and as mentioned above, they send it down to the browser in beautiful JSON objects.
But they're sitting behind Cloudflare and aggressively blocking attempts to fetch data programmatically, which is a huge problem for us with 6000+ players worth of data to fetch multiple times every 3 months.
So... I built a Chrome Extension to grab the data at a speed that is usually under their detection rate. Basically created a distributed scraper and passed it out to as many people in the league as I could.
For big jobs when we want to do giant batches, it was a simple matter of doing the pulls and when we start getting 429 errors (rate limit blocking code they use), switch to a new IP on the VPN.
The only way they can block us now is if they stop having a website.
Give one of the commercial VPN providers a try. They're usually pretty cheap and have tons of IPs all over the place. Adding a "VPN Disconnect / Reconnect" step to the process only added about 10 seconds per request every so often.
It probably doesn't save you much, since you already built the chrome extension, but having done both I found that tampermonkey is often much easier to deal with in most cases and also much quicker to develop for (you can literally edit the script in the tampermonkey extension settings page and reload the page you want it to apply to for immediate testing).
I might be wrong, but some sites can block 'self' origin scripts by leaving it out of the Content Security Policy and only allowing scripts they control served by a CDN or specified subdomain to run on their page. Not sure when I last tried this and on what browser(s).
You'd have to disable CSP manually in your browser config to make it work, but that leaves you with an insecure browser and a lot of friction for casual users. Not sure if you can tie about:config options to a user profile for this use case. Distributing a working extension/script is getting harder all the time.
I don't recall if I've encountered that specific problem in tampermonkey (or if I did and it didn't cause a problem worth remembering), but you can run things in the extension's context as well to bypass certain restrictions, as well as use special extension provided functions (GM_* from the greasemoney standard) that allow for additional actions.
I do recall intercepting requests when I used a chrome extension to change CSP values though and not needing to when doing something similar later in tampermonkey, but it may not have been quite the same issue as you're describing, so I can't definitively say whether I had a problem with it or not.
You can buy proxies to use, of varying quality, but they are somewhat expensive depending on what you need.
I'll just say that firefox still runs tampermonkey, and that includes firefox mobile, so depending on how often you need a different IP and how much data you're getting, you might be able to do away with the whole idea of proxies and just have a few mobile phones that can be configured as workers that take requests through a tampermonkey script. Or that a laptop tethers to that does the same, or that runs puppeteer itself. It depends on whether a worker needs a new IP every few minutes, hours or days as to whether a real mobile phone works (as some manual interaction is often required to actively change the IP).
> the content that's not in the original source tends to be served up as structured data via XHR- JSON usually-
Yes, you can overwrite fetch and log everything that comes in or out of the page you're looking at. I do that in Tampermonkey but one can probably inject the same kind of script in Puppeteer.
I'm grateful that GraphQL proliferated, because I don't even have to scrape such resources - I just query.
A while ago, when I was looking for an apartment, I noticed that only the mobile app for a certain service allows for drawing the area of interest - the web version had only the option of looking in the area currently visible on the screen.
Or did it? Turns out it was the same GraphQL query with the area described as a GeoJSON object.
GeoJSON allows for disjointed areas, which was particularly useful in my case, because I had three of those.
I am still using Casperjs ameith phantomjs. Old tech. But works perfectly. Some scripts are running for 10 years on the same sites without ever made a change.
Are there really that many opportunities where you need to scrape it with a browser as opposed to just fetching from the same JSON endpoint the web site is getting it from?
There are some, not many, but when possible I would rather just use a simple request library to fetch it than have to spin up a browser.
Dunno, a lot of the time it actually makes scraping easier because the content that's not in the original source tends to be served up as structured data via XHR- JSON usually- you just need to take a look at the data you're interested in and if it's not in 'view-source', it's coming from somewhere else.
Browser based scraping makes sense when that data is heavily mangled or obfuscated, laden with captchas and other anti-scraping methods. Or if you're interested in if text is hidden, what position it's on the page etc.