If you change your browser's UserAgent string to Googlebot, then your client will be treated as a first-class citizen, by many of these sites. Google always wins, so let's all be Google.
It's extremely rare to be ip-blocked by any website just for using the Google's user agent from a non-specific range. IP's get re-used and you can switch to a new one easily, so it's really not common or good practice for this to happen.
> IP's get re-used and you can switch to a new one easily, so it's really not common or good practice for this to happen.
On the flip side, some people can't change their IP addresses easily, and getting IP banned (even if rare because of the reasons you stated) is actually a major hassle when it actually happens for those people. :/
Is that really a thing? That must be such a hazard for their developers. I usually have a test for sites that I work on, that scrapes a few URLs as Googlebot, to verify that they are getting an optimized view (no JS, structural-only css).
Yes. Googlebot only crawls from legit addresses (even when their developers are trying new things) so it's an easy scraper/scammer signal to key off of.
No. It allows Google bots to see full articles, but shows only the first paragraph or so to non-subscribers. Even if they're coming from Google search results.
However, I don't see cache links on Google :(
Edit: Oops, I'm wrong. The article does say that the Google bot only sees the first paragraph or so.
"The reason: Google search results are based on an algorithm that scans the internet for free content. After the Journal’s free articles went behind a paywall, Google’s bot only saw the first few paragraphs and started ranking them lower, limiting the Journal’s viewership."
I call that enacting the nuclear option. It's almost guaranteed to win the war with ad-tech! It should be enacted for sites with run-away ad engines that spin up your CPU fans and make scrolling laggy.
Of course, the problem with nuclear is collateral damage. Drop the bomb and ads don't work, but neither does a lot of other stuff. E.g., the site shows a blank screen, images are invisible or blurry, drop-down menus don't drop. And, of course, the deal-breaker: videos don't play.
The remedy for killing JavaScript is more JavaScript (and CSS). But supplied inside a Chrome extension targeted at the offending site. An injected stylesheet makes `<body>` visible again, hides assorted useless junk, and styles injected UI elements. Your content scripts load the missing images, drop the menus down, and play the unplayable videos in button-activated pop-over windows displayed at superior resolution.
Of course, the problem is, there are a lot of sites out there, and they change unpredictably, requiring your extension library to change in response. That argues for crowd-sourcing the extension library, but the crowd needs to be proficient in HTML, JavaScript, and CSS and know the ins and outs of browser extensions and care and have time.
You can completely change how a site presents. E.g., change a slide-show in a static slide window that barely moves due to the background ad-tech load changes into a set of `divs` that roll upwards as your finger swipes.
It's a hobby at best. Disabling ad-tech components by origin is the practical option.
Call me Dr. Strangelove, then. I usually browse with JS off, enabling it on occasion. And there are some whitelisted sites.
I used to play around with filtering sites to make them less antisocial, but find that slog less entertaining these days. So now when confronted with a site that's useless without JS, eh, there's almost always another site out there that doesn't mind the terms I demand for my attention.