More

ccgreg · 2025-09-03T09:45:35 1756892735

The IETF AI-Preference standard group is currently discussing whether or not to include an example of bypassing AI preferences to support assistive technologies. Oddly enough, many publishers oppose that.

1gn15 · 2025-09-03T14:32:41 1756909961

What does bypassing AI preferences mean? Just ignoring them?

ianbicking · 2025-09-03T15:42:53 1756914173

Probably ignoring things like robots.txt, I'm guessing? But I'd be curious what exactly the list of things is, and if it's growing. Would it go as far as ChatGPT filling in CAPTCHAs?

autocomplete="off" is an instance of something that user agents willfully ignore based on their own heuristics, and I'm assuming accessibility tools have always ignored a lot of similar things.

ccgreg · 2025-09-03T17:19:30 1756919970

A lot of publishers do not care about blind people, and would prefer that they be unable to use AI to read.

ccgreg · 2025-09-02T20:54:46 1756846486

I don't think you're correct about Google. Caching webpages is bread-and-butter for search engines, that's how they show snippets.

danudey · 2025-09-02T21:21:54 1756848114

They might cache it, but what if it changed in the last 30 seconds and now their information is out of date? Better make another request just in case.

ccgreg · 2025-09-02T21:42:46 1756849366

That's not how search engines work. They have a good idea of which pages might be frequently updated. That's how "news search" works, and even small startup search engines like blekko had news search.

TheServitor · 2025-09-02T23:17:52 1756855072

Indeed. My understanding is that crawl is a real expense at scale so they optimize for "just enough" to catch most site update rhythms and then use other signals (like blog pings, or someone searching for a URL that's not yet crawled, etc) to selectively chase fresher content.

ccgreg · 2025-09-03T07:43:27 1756885407

My experience is that a news crawl is not a big expense at scale, but so far I've only built one and inherited one. BTW No one uses blog pings, the latest hotness is IndexNow.

ccgreg · 2025-09-02T17:57:34 1756835854

The Fastly report[1] has a couple of great quotes that mention Common Crawl's CCBot:

> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.

...

> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.

And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:

> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.

1: https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat...

2: https://www.theregister.com/2025/08/21/ai_crawler_traffic/

ccgreg · 2025-09-02T17:55:06 1756835706

The blekko search engine index was only 1 billion pages, compared to Common Crawl Foundation's crawl of 3 billion webpages per month.

ccgreg · 2025-09-02T04:09:14 1756786154

Salt water without oxygen and salt water with oxygen are different.

ccgreg · 2025-09-01T17:58:34 1756749514

One way that Cloudflare is gatekeeping is by declaring which bots are AI Bots. Common Crawl's CCBot is used for a lot of stuff -- it's an archive, there are more than 10,000 research papers citing common crawl, mostly not AI -- but Cloudflare deems CCBot to be an "AI Bot", and I suspect most website owners don't have any idea what the list of AI Bots is and how they were chosen.

lyu07282 · 2025-09-02T00:40:59 1756773659

would be an obvious loophole if you could just use CC instead of paying cloudflare no?

ccgreg · 2025-09-02T01:24:36 1756776276

It's a similar loophole as public libraries. When I was a kid, I read thousands of books from the library, without paying anyone anything.

But as for the crawl loophole: CCBot obeys robots.txt, and CCBot also preserves all robots.txt and REPL signals so that downstream users can find out if a website intended to block them at crawl time.

ccgreg · 2025-09-01T17:55:31 1756749331

As the Cloudflare post indicates, most crawlers can be verified by IP address.

ccgreg · 2025-08-30T08:35:30 1756542930

Conventional crawlers already have a way to identify themselves, via a json file containing a list of IP addresses. Cloudflare is fully aware of this defacto standard.

ccgreg · 2025-08-30T03:49:25 1756525765

Given that this discussion was started by someone mentioning only these 3 related things, I'd guess that the motivation might a negative one.

ccgreg · 2025-08-29T21:28:15 1756502895

You can leave off the laser beams and just look at them with telescopes. The paper we're discussing is exactly that.

metalman · 2025-08-30T20:18:19 1756585099

we can do much more with direct laser strikes, a laser will be able to blast through the dust and gass corona, if there was radar directed at the same time for precise timing and targeting. size, velocity, shape, and details of its shape, and composition in many areas and reach below the imediate surface we cant catch it, but we can certainly get a much better look