Hacker News new | past | comments | ask | show | jobs | submit login
How to make fun of Google Bot (PicoLisp Wiki) (picolisp.com)
85 points by markokocic on July 15, 2011 | hide | past | favorite | 20 comments



I once made a similar (but less real language-like) site to fool spambots on my now-defunct web consultancy's page (http://www.resolution.nl/food if you care). The idea was that if a crawler that searches the internet for email addresses to spam would fill its DB with bogus, after which, hopefully, the spammer would simply dump that day's result in annoyance, including our real email address. Never figured out whether that really worked but it was fun to make.

What did work, however, was fool a searchbot: The whole thing got me a very angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck on it for an entire day. I had no robots.txt (didn't even know what it was), which the search engine team decided was a really nasty case of lack of netiquette.


The whole thing got me a very angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck on it for an entire day.

So, somehow their ill-coded bot crashing was your fault? It's not like you forced them to crawl your site.


Correct, which is why I laughed.


see here for a spec how the robots.txt is parsed

http://code.google.com/web/controlcrawlindex/docs/robots_txt...

the robots.txt of http://picolisp.com (found at http://picolisp.com/robots.txt) allows the indexing of http://picolisp.com/21000 and all follow up pages.

why? see the spec:

The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.

see here https://github.com/franzenzenhofer/robotstxt for coffeescript implementation of a robots.txt parser


so what should his robots.txt look like? at the moment it is:

User-Agent: *

Disallow: /21000/


It's mostly sufficient. /21000/ will not match "http://picolisp.com/21000, which is the first URL in the sequence, but the remaining URLs look like "http://picolisp.com/21000/!start?*Page=+2, so Googlebot will likely only continue to download a single page once it has re-read the robots.txt.

Which is what you deserve for using non-standard URL formats.


Hold on, slash at the end is not standard?


No, I'm saying /21000/ will match a path with a directory named /21000 but not a file named /21000.

When I say "non-standard", I am saying am saying that if the website's URLs looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to craft a "Disallow" rule that would have successfully blocked all of the desired pages.


   User-Agent: *
   Disallow: /21000
or

   User-Agent: *
   Disallow: /


The vast majority of the time when I see a complaint like this about robots.txt, it's usually because the site has missed a character here or there, or because they're not putting a robots.txt file on the right hostname.

Google has a free robots.txt checker that lets you test your robots.txt files. Given a robots.txt file, you can enter specific urls and check whether that url would be blocked or not. Here's a link for more info on that free tool: http://www.google.com/support/webmasters/bin/answer.py?hl=en...


Another incorrect assumption in the article is to assume that every bot with the Google UA is originating from Google. There are plenty of other (often malicious) bot sources that simply copy Google's signature to make themselves less obvious. He needs to check the IP block to make sure the bot with the observed behavior was really from Google.


I just took a look at the last offending Googlebot IP and it seems to originate from Google.

http://www.ip-adress.com/ip_tracer/66.249.71.203


That reminds me of this page in which the author created a large binary tree of pages and watched how various crawlers walked the tree.

http://www.drunkmenworkhere.org/219


Very interesting experiment, and suprising that the Google bot appears (in this instance at least) to be ignoring robots.txt.


The problem is not Googlebot.

robots.txt on ticker.picolisp.com says "Disallow: /", but ticker.picolisp.com redirects to picolisp.com/21000, and the robots.txt on picolisp.com says "Disallow:". If he wants Googlebot to stop crawling those URLs, he needs to add "Disallow: /21000" to picolisp.com.


Hmmmm, maybe he just did: http://picolisp.com/robots.txt


I don't think it's ignoring robots.txt, it's probably just cached at another thread/machine/datacenter. It will pick up the new robots.txt eventually. This behavior is well-known.


But whats the point of it downloading the robots.txt file if it isn't going to honor it?


Presumably, one cluster (or whatever google calls them) is honoring it, while a different cluster isn't.


Dumb bots are dumb




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: