You can verify if an IP belongs to Googlebot or not, no need to whitelist GCP. h...

DougWebb · on June 5, 2017

Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard-coded them, so you must run a DNS lookup as described next.

DNS lookups are far more expensive to perform than an IP filter, and couldn't be done in realtime. So WSJ would have to set up a system where they regularly find all Googlebot referers in their logs that were rejected, do DNS lookups on the IPs, and add any that were valid to a whitelist so that they won't get rejected again. This will cause new Googlebot IPs to get rejected until the whitelist is updated, hurting indexing and ranking. The WSJ would also have to go through their whitelist regularly and do DNS lookups to verify that all of those IPs are still valid Googlebot IPs, and remove any that aren't valid anymore. That opens a window for invalid IPs to continue to get access, which may or may not be a problem depending on how often IPs change and where they get reassigned to.

The IP whitelist would need to be distributed to WSJ's webserver farms and used to update firewall rules, in an automated way that may or may not integrate with how that stuff is currently managed. (Generally, those rules would be tightly controlled in a big org like the WSJ.) The HTTP access log gathering from the farms and their analysis would also need to be automated, which again might be a management issue if the logs contain anything sensitive. (Like, I don't know, records of particular individuals reading particular stories which certain government agencies might be interested in acquiring without the hassle of a warrant.)

So yeah, there's a way to find out if an IP belongs to Googlebot. That's a long way from a manageable filtering solution at the WSJ's scale, even if Google wouldn't penalize them for doing it, which they would.

heyalexej · on June 5, 2017

There are commercial solutions¹² to this which are widely used for cloaking.

[1] https://my.bseolized.com/products/ipgrabber

[2] http://wpcloaker.com/