Hacker News new | past | comments | ask | show | jobs | submit login

Possible, but seems unlikely. To set that up the WSJ website would have to allow Googlebot access while denying others. Any filtering based on the url or HTTP headers would be discovered and abused by others. An approach based on a security token or IP filter could work, but would be un-managable on Google's side because of the scale of their spidering operation. It would be much more effective for them to use their position to force the WSJ to be an open website, or to accept that their paywalled content does not get indexed.



Just allow access from all of Google's IP space, as long as the user-agent contains "googlebot'. It's pretty trivial to do...

Sure, some people could set up GCP instances and proxy it, but that's a very tiny percentage of people.


It might start out as a tiny percentage, but all it takes is one person setting it up and letting the world know about it. Then pretty soon the WSJ is faced with millions of people getting free content again. They're complaining about people accessing their articles via a Google search and then clearing cookies to reset their counters. That's hardly mainstream; browsers have been burying the clear-cookies functionality deeper and deeper over the years because it's seen as an advanced-user-only kind of thing. And yet the WSJ has millions of people doing it, enough to make an impact on their bottom line.


I'd bet it's browing in incognito mode versus actually clearing cookies.


You can verify if an IP belongs to Googlebot or not, no need to whitelist GCP.

https://support.google.com/webmasters/answer/80553?hl=en


Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard-coded them, so you must run a DNS lookup as described next.

DNS lookups are far more expensive to perform than an IP filter, and couldn't be done in realtime. So WSJ would have to set up a system where they regularly find all Googlebot referers in their logs that were rejected, do DNS lookups on the IPs, and add any that were valid to a whitelist so that they won't get rejected again. This will cause new Googlebot IPs to get rejected until the whitelist is updated, hurting indexing and ranking. The WSJ would also have to go through their whitelist regularly and do DNS lookups to verify that all of those IPs are still valid Googlebot IPs, and remove any that aren't valid anymore. That opens a window for invalid IPs to continue to get access, which may or may not be a problem depending on how often IPs change and where they get reassigned to.

The IP whitelist would need to be distributed to WSJ's webserver farms and used to update firewall rules, in an automated way that may or may not integrate with how that stuff is currently managed. (Generally, those rules would be tightly controlled in a big org like the WSJ.) The HTTP access log gathering from the farms and their analysis would also need to be automated, which again might be a management issue if the logs contain anything sensitive. (Like, I don't know, records of particular individuals reading particular stories which certain government agencies might be interested in acquiring without the hassle of a warrant.)

So yeah, there's a way to find out if an IP belongs to Googlebot. That's a long way from a manageable filtering solution at the WSJ's scale, even if Google wouldn't penalize them for doing it, which they would.


There are commercial solutions¹² to this which are widely used for cloaking.

[1] https://my.bseolized.com/products/ipgrabber

[2] http://wpcloaker.com/


Googlebot's IP range and GCP's IP ranges are disjoint.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: