90% of traffic being crawlers seems (on the face) just absolutely batshit insane...

amenhotep · on Dec 23, 2023

Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.

mckn1ght · on Dec 23, 2023

I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.

0xDEAFBEAD · on Dec 23, 2023

Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?

rented_mule · on Dec 23, 2023

We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.

The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.

tech234a · on Dec 23, 2023

In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].

[1]: https://www.indexnow.org/

[2]: https://www.bing.com/indexnow

[3]: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...

0xDEAFBEAD · on Dec 23, 2023

>The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.

Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.

drivebycomment · on Dec 23, 2023

https://developers.google.com/search/apis/indexing-api/v3/us...

heroprotagonist · on Dec 23, 2023

Or you could just delete it, if your content isn't valuable enough that you'll pay to have it served once a week without ad-dollars to subsidize it.

deno · on Dec 23, 2023

You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.

rented_mule · on Dec 23, 2023

Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.

cyanydeez · on Dec 24, 2023

seems like it'd make more sense to just send your html to dedicated ports.

Retr0id · on Dec 23, 2023

Adds a lot of weight to the dead internet theory.

geraldhh · on Dec 23, 2023

tbh alot of this "theory" is common sense for internet natives.

"most of everything is shit" comes to mind, but "most of email being spam" and "most of web-traffic being porn" are well known.

syndicatedjelly · on Dec 23, 2023

can you elaborate?

pmontra · on Dec 23, 2023

I googled this: https://en.wikipedia.org/wiki/Dead_Internet_theory

Actually everything we discussed here is the result of genuine human activities.

syndicatedjelly · on Dec 23, 2023

I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.

I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability

jgalt212 · on Dec 23, 2023

We have some very long tail content and experienced this in 2023 after all the VC funded LLM start-ups tried to scrape every page ever.

sidewndr46 · on Dec 23, 2023

The number I came up with last time I looked into this was about 60% of page requests are by bots on any normal website.