Hacker News new | past | comments | ask | show | jobs | submit login

90% of traffic being crawlers seems (on the face) just absolutely batshit insane.



Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.


I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.


Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?


We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.

The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.


In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].

[1]: https://www.indexnow.org/

[2]: https://www.bing.com/indexnow

[3]: https://blogs.bing.com/webmaster/october-2021/IndexNow-Insta...


>The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.

Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.



Or you could just delete it, if your content isn't valuable enough that you'll pay to have it served once a week without ad-dollars to subsidize it.


You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.


Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.


seems like it'd make more sense to just send your html to dedicated ports.


Adds a lot of weight to the dead internet theory.


tbh alot of this "theory" is common sense for internet natives.

"most of everything is shit" comes to mind, but "most of email being spam" and "most of web-traffic being porn" are well known.


can you elaborate?


I googled this: https://en.wikipedia.org/wiki/Dead_Internet_theory

Actually everything we discussed here is the result of genuine human activities.


I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.

I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability


We have some very long tail content and experienced this in 2023 after all the VC funded LLM start-ups tried to scrape every page ever.


The number I came up with last time I looked into this was about 60% of page requests are by bots on any normal website.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: