Hacker News new | past | comments | ask | show | jobs | submit login

> Before someone tells me to fix my robots.txt, this is a content farm so rather than being one web site with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one page.



The reason that bit is relevant is that robots.txt is only applicable to the current domain. Because each "page" is a different subdomain, the crawler needs to fetch the robots.txt for every single page request.

What the poster was suggesting is blocking them at a higher level - e.g. a user-agent block in an .htaccess or an IP block in iptables or similar. That would be a one-stop fix. It would also defeat the purpose of the website, however, which is to waste the time of crawlers


The real question is how is GPTBot finding all the other subdomains? Currently the sites have GPTBot disallowed, https://www.web.sp.am/robots.txt

If GPTBot is compliant with the robots.txt specification then it can't read the URL containing the HTML to find the other subdomains.

Either:

  1. GPTBot treats a disallow as a noindex but still requests the page itself. Note that Google doesn't treat a disallow as a noindex. They will still show your page in search results if they discover the link from other pages but they show it with a "No information is available for this page." disclaimer.
  2. The site didn't have a GPTBot disallow until they noticed the traffic spike and they bot has already discovered a couple million links that need to be crawled.
  3. There is some other page out there on the internet that GPTBot discovered that links to millions of these subdomains. This seems possible and the subdomains really don't have any way to prevent a bot from requesting millions of robots.txt files. The only prevention here is to firewall the bot's IP range or work with the bot owners to implement better subdomain handling.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: