How to make fun of Google Bot (PicoLisp Wiki)

skrebbel · on July 15, 2011

I once made a similar (but less real language-like) site to fool spambots on my now-defunct web consultancy's page (http://www.resolution.nl/food if you care). The idea was that if a crawler that searches the internet for email addresses to spam would fill its DB with bogus, after which, hopefully, the spammer would simply dump that day's result in annoyance, including our real email address. Never figured out whether that really worked but it was fun to make.

What did work, however, was fool a searchbot: The whole thing got me a very angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck on it for an entire day. I had no robots.txt (didn't even know what it was), which the search engine team decided was a really nasty case of lack of netiquette.

troels · on July 15, 2011

The whole thing got me a very angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck on it for an entire day.

So, somehow their ill-coded bot crashing was your fault? It's not like you forced them to crawl your site.

skrebbel · on July 15, 2011

Correct, which is why I laughed.

bauchidgw · on July 15, 2011

see here for a spec how the robots.txt is parsed

http://code.google.com/web/controlcrawlindex/docs/robots_txt...

the robots.txt of http://picolisp.com (found at http://picolisp.com/robots.txt) allows the indexing of http://picolisp.com/21000 and all follow up pages.

why? see the spec:

The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.

see here https://github.com/franzenzenhofer/robotstxt for coffeescript implementation of a robots.txt parser

budgi3 · on July 15, 2011

so what should his robots.txt look like? at the moment it is:

User-Agent: *

Disallow: /21000/

saalweachter · on July 15, 2011

It's mostly sufficient. /21000/ will not match "http://picolisp.com/21000, which is the first URL in the sequence, but the remaining URLs look like "http://picolisp.com/21000/!start?*Page=+2, so Googlebot will likely only continue to download a single page once it has re-read the robots.txt.

Which is what you deserve for using non-standard URL formats.

Florin_Andrei · on July 15, 2011

Hold on, slash at the end is not standard?

saalweachter · on July 15, 2011

No, I'm saying /21000/ will match a path with a directory named /21000 but not a file named /21000.

When I say "non-standard", I am saying am saying that if the website's URLs looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to craft a "Disallow" rule that would have successfully blocked all of the desired pages.

bauchidgw · on July 15, 2011

   User-Agent: *
   Disallow: /21000

or

   User-Agent: *
   Disallow: /

Matt_Cutts · on July 15, 2011

The vast majority of the time when I see a complaint like this about robots.txt, it's usually because the site has missed a character here or there, or because they're not putting a robots.txt file on the right hostname.

Google has a free robots.txt checker that lets you test your robots.txt files. Given a robots.txt file, you can enter specific urls and check whether that url would be blocked or not. Here's a link for more info on that free tool: http://www.google.com/support/webmasters/bin/answer.py?hl=en...

coverband · on July 15, 2011

Another incorrect assumption in the article is to assume that every bot with the Google UA is originating from Google. There are plenty of other (often malicious) bot sources that simply copy Google's signature to make themselves less obvious. He needs to check the IP block to make sure the bot with the observed behavior was really from Google.

rodion_89 · on July 15, 2011

I just took a look at the last offending Googlebot IP and it seems to originate from Google.

http://www.ip-adress.com/ip_tracer/66.249.71.203

adpowers · on July 15, 2011

That reminds me of this page in which the author created a large binary tree of pages and watched how various crawlers walked the tree.

http://www.drunkmenworkhere.org/219

j_col · on July 15, 2011

Very interesting experiment, and suprising that the Google bot appears (in this instance at least) to be ignoring robots.txt.

saalweachter · on July 15, 2011

The problem is not Googlebot.

robots.txt on ticker.picolisp.com says "Disallow: /", but ticker.picolisp.com redirects to picolisp.com/21000, and the robots.txt on picolisp.com says "Disallow:". If he wants Googlebot to stop crawling those URLs, he needs to add "Disallow: /21000" to picolisp.com.

j_col · on July 15, 2011

Hmmmm, maybe he just did: http://picolisp.com/robots.txt

zorked · on July 15, 2011

I don't think it's ignoring robots.txt, it's probably just cached at another thread/machine/datacenter. It will pick up the new robots.txt eventually. This behavior is well-known.

afhof · on July 15, 2011

But whats the point of it downloading the robots.txt file if it isn't going to honor it?

robtoo · on July 15, 2011

Presumably, one cluster (or whatever google calls them) is honoring it, while a different cluster isn't.

sygeek · on July 15, 2011

Dumb bots are dumb