A "honeypot" is a system designed to trap unsuspecting entrants. In this case, t...

gardenhedge · 2024-04-11T15:03:46 1712847826

What does trap mean here? I presumed crawlers had multiple (thousands of or more) instances. One being 'trapped' on this web farm won't have any impact

everforward · 2024-04-11T15:27:50 1712849270

I would presume the crawlers have a queue-based architectures with thousands of workers. It’s an amplification attack.

When a worker gets a webpage for the honeypot, it crawls it, scrapes it, and finds X links on the page where X is greater than 1. Those links get put on the crawler queue. Because there’s more than 1 link per page, each worker on the honeypot will add more links to the queue than it removed.

Other sites will eventually leave the queue, because they have a finite number of pages so the crawlers eventually have nothing new to queue.

Not on the honeypot. It has a virtually infinite number of pages. Scraping a page will almost deterministically increase the size of the queue (1 page removed, a dozen added per scrape). Because other sites eventually leave the queue, the queue eventually becomes just the honeypot.

OpenAI is big enough this probably wasn’t their entire queue, but I wouldn’t be surprised if it was a whole digit percentage. The author said 1.8M requests; I don’t know the duration, but that’s equivalent to 20 QPS for an entire day. Not a crazy amount, but not insignificant. It’s within the QPS Googlebot would send to a fairly large site like LinkedIn.

anonymousDan · 2024-04-11T15:40:01 1712850001

While the other comments are correct, I was alluding to a more subtle attack where you might try to indirectly influence the training of an LLM. Effectively, if OpenAI is crawling the open web for data to use for training, then if they don't handle sites like this properly their training dataset could be biased towards whatever content the site contains. Now in this instance this website was clearly not set up target an LLM, but model poisoning (e.g. to insert backdoors) is an active area of research at the intersection of ML and security. Consider as a very simple example the tokenizer of previous GPTs that was biased by reddit data (as mentioned by other comments).

danpalmer · 2024-04-11T15:22:59 1712848979

In this case there are >6bn pages with roughly zero value each. That could eat a substantial amount of time. It's unlikely to entirely trap a crawler, but a dumb crawler (as is implied here) will start crawling more and more pages, becoming very apparent to the operator of this honeypot (and therefore identifying new crawlers), and may take up more and more share of the crawl set.