I once made a similar (but less real language-like) site to fool spambots on my now-defunct web consultancy's page (http://www.resolution.nl/food if you care). The idea was that if a crawler that searches the internet for email addresses to spam would fill its DB with bogus, after which, hopefully, the spammer would simply dump that day's result in annoyance, including our real email address. Never figured out whether that really worked but it was fun to make.
What did work, however, was fool a searchbot: The whole thing got me a very angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck on it for an entire day. I had no robots.txt (didn't even know what it was), which the search engine team decided was a really nasty case of lack of netiquette.
It's mostly sufficient. /21000/ will not match "http://picolisp.com/21000, which is the first URL in the sequence, but the remaining URLs look like "http://picolisp.com/21000/!start?*Page=+2, so Googlebot will likely only continue to download a single page once it has re-read the robots.txt.
Which is what you deserve for using non-standard URL formats.
No, I'm saying /21000/ will match a path with a directory named /21000 but not a file named /21000.
When I say "non-standard", I am saying am saying that if the website's URLs looked like "/21000/foo" and "/21000/foo?page=2", it would have been easier to craft a "Disallow" rule that would have successfully blocked all of the desired pages.
The vast majority of the time when I see a complaint like this about robots.txt, it's usually because the site has missed a character here or there, or because they're not putting a robots.txt file on the right hostname.
Google has a free robots.txt checker that lets you test your robots.txt files. Given a robots.txt file, you can enter specific urls and check whether that url would be blocked or not. Here's a link for more info on that free tool: http://www.google.com/support/webmasters/bin/answer.py?hl=en...
Another incorrect assumption in the article is to assume that every bot with the Google UA is originating from Google. There are plenty of other (often malicious) bot sources that simply copy Google's signature to make themselves less obvious. He needs to check the IP block to make sure the bot with the observed behavior was really from Google.
robots.txt on ticker.picolisp.com says "Disallow: /", but ticker.picolisp.com redirects to picolisp.com/21000, and the robots.txt on picolisp.com says "Disallow:". If he wants Googlebot to stop crawling those URLs, he needs to add "Disallow: /21000" to picolisp.com.
I don't think it's ignoring robots.txt, it's probably just cached at another thread/machine/datacenter. It will pick up the new robots.txt eventually. This behavior is well-known.
What did work, however, was fool a searchbot: The whole thing got me a very angry mail from a Dutch search engine team (ilse.nl) whose bot had been stuck on it for an entire day. I had no robots.txt (didn't even know what it was), which the search engine team decided was a really nasty case of lack of netiquette.