The IETF AI-Preference standard group is currently discussing whether or not to include an example of bypassing AI preferences to support assistive technologies. Oddly enough, many publishers oppose that.
Probably ignoring things like robots.txt, I'm guessing? But I'd be curious what exactly the list of things is, and if it's growing. Would it go as far as ChatGPT filling in CAPTCHAs?
autocomplete="off" is an instance of something that user agents willfully ignore based on their own heuristics, and I'm assuming accessibility tools have always ignored a lot of similar things.
That's not how search engines work. They have a good idea of which pages might be frequently updated. That's how "news search" works, and even small startup search engines like blekko had news search.
Indeed. My understanding is that crawl is a real expense at scale so they optimize for "just enough" to catch most site update rhythms and then use other signals (like blog pings, or someone searching for a URL that's not yet crawled, etc) to selectively chase fresher content.
My experience is that a news crawl is not a big expense at scale, but so far I've only built one and inherited one. BTW No one uses blog pings, the latest hotness is IndexNow.
The Fastly report[1] has a couple of great quotes that mention Common Crawl's CCBot:
> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.
...
> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.
And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:
> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.
One way that Cloudflare is gatekeeping is by declaring which bots are AI Bots. Common Crawl's CCBot is used for a lot of stuff -- it's an archive, there are more than 10,000 research papers citing common crawl, mostly not AI -- but Cloudflare deems CCBot to be an "AI Bot", and I suspect most website owners don't have any idea what the list of AI Bots is and how they were chosen.
It's a similar loophole as public libraries. When I was a kid, I read thousands of books from the library, without paying anyone anything.
But as for the crawl loophole: CCBot obeys robots.txt, and CCBot also preserves all robots.txt and REPL signals so that downstream users can find out if a website intended to block them at crawl time.
Conventional crawlers already have a way to identify themselves, via a json file containing a list of IP addresses. Cloudflare is fully aware of this defacto standard.
we can do much more with direct laser strikes, a laser will be able to blast through the dust and gass corona, if there was radar directed at the same time for precise timing and targeting.
size, velocity, shape, and details of its shape, and composition in many areas and reach below the imediate surface
we cant catch it, but we can certainly get a much better look
reply