About 10 years ago, I was working on a site that served several hundred million non-crawler hits a month. Many of our millions of pages had their content change multiple times a day. Because of the popularity and frequent changes, the crawlers hit us constantly... crawlers accounted for ~90% of our traffic - billions of hits per month. Bing was ~70% of the crawler traffic and Google was ~25% of it. We noticed it because Bing quickly became very aggressive about crawling, exposing some of our scaling limits as they doubled our already significant traffic in a few short months.
I was working on the system that picked ads to show on our pages (we had our own internal ad system, doing targeting based on our own data). This was the most computationally intensive part of serving our pages and the ads were embedded directly in the HTML of the page. When we realized that 90% of our ad pick infrastructure was dedicated to feeding the crawlers, we immediately thought of turning ads off for them (we never billed advertisers for them anyway). But hiding the ads seemed to go directly against the spirit of Google's policy of showing their crawlers the same content.
Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page. This dropped our ad pick infra costs by nearly 80%, saving 6-figures a month. It also let us take a step back to decide where we could make long term investments in our infra rather than being overwhelmed with quick fixes to keep the crawlers fed.
This kind of thing is what people are missing when they wonder why a company needs more than a few engineers - after all, someone could duplicate the core functionality of the product in 100 lines of code. At sufficient scale, it takes real engineering just to handle the traffic from the crawlers so they can send you more users. There are an untold number of other things like this that have to be handled at scale, but that are hard to imagine if you haven't worked at similar scale.
Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.
I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.
Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?
We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.
The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].
>The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.
You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.
Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.
I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.
I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability
I have never used Bing. I use duckduckgo though and they buy their results from Bing. At least they did in the past, I don't follow them closely enough to necessarily notice every possible change.
This seems very cannibalistic of their own business. That means somebody running Google or Microsoft (or really any web ads) only has a 10% chance to start with of getting served to an actual human (if they're not trying to block each other constantly).
And on the other side, that means every customer or ad placer, has to try and filter all the bots so people with actual credit cards and money will see the Google, TEMU, or FB ads (or others).
In some ways, almost feels like Microsoft is griefing online search by burying it under massive robot crawls. Like an ad DDOS.
They're serving first party targeted ads based on only their own data. If you're going to complain about that, it's close to saying that websites shouldn't be able to make money from advertising at all.
Very much this. It's a site/app that has probably been used by 80-90% of adults living in America over the last decade. It would not exist if these ads weren't targeted. I know because we knew (past tense because I'm no longer there) exactly how much targeting increased click-through-rate and how that affected revenue.
On top of that, they were ads for doing more of what the user was doing right then, tailored to tastes we'd seen them exhibit over time. Our goal was that the ads should be relevant enough that they served as an exploration mechanism within the site/app. We didn't always do as well as we hoped there, but it was a lot better than what you see on most of the internet. And far less intrusive because they weren't random (i.e., un-targeted). I have run ad blockers plus used whole house DNS ad blocking as long as I've been aware of them, but I was fine working on these ads because it felt to me like ads done right.
If we can't even allow for ads done right, then vast swaths of the internet have to be pay-walled or disappear. One consequence of that... only the rich get to use most of the internet. That's already too true as it is, I don't want to see it go further.
I have no problems with this (first party, targeted) as far as I can read English and understand.
In fact one of my bigger problems have been that Google has served me generic ads that are so misplaced they go far into attempted insult territory (shady dating sites, pay-to-win "strategy games" etc).
> websites shouldn't be able to make money from advertising at all.
This is the case. Advertising is a scourge, psychological warfare waged by corporations against our minds and wallets. Advertisers have no moral qualms, they will exploit any psychololgical weakness to shill products, no matter how harmful. Find a "market" of teenagers with social issues? Show them ads of happy young people frolicking with friends to make them buy your carbonated sugar water; never mind that your product will rot their teeth and make them fat. Advertisers don't care about whether products are actually good for people, all they care about is successful shilling.
Advertising is warfare waged by corporations against people and pretending otherwise makes you vulnerable to it. To fight back effectively we must use adblockers and advocate for advertising bans. If your website cannot exist without targeted advertising, then it is better for it to not exist.
Think about what it would mean to not have any advertising whatsoever. Most current large brands would essentially be entrenched forever. No matter how good a new product or service is, it's going to be almost impossible to reach a sustainable scale through purely organic growth starting from zero. Advertising in some form is necessary for an economy to function.
The problem is, as was mentioned above by someone, all content has to be paid for. If there were no ads we wouldn’t have had TV and radio for the past few decades. 90% of the internet would disappear, and the only stuff left would be paywalled - i.e. only the rich could use the web.
I’m sure you try to avoid ads - I do too, they suck. But don’t pretend you don’t use a lot of websites that are not paid for with ads.
The internet began in 1969 and by 1992 was by far the largest network of computers and had exactly zero ads and zero paywalls. (The US government imposed a rule against commercial use of the internet to appease private businesses that didn't want competition from the internet. The rule remained in force till 1992.)
Also, you're currently using a very large non-paywalled site with no ads.
So, no, ads are not needed to have a nice internet available to all.
I don’t think you’re being intellectually honest. I didn’t have access to the Internet in my home in 1992, and the rest of the world didn’t either. I did pay for and have access to Compuserve forums. There was very little content back then. Certainly no huge video sites where you can learn practically anything, or hardly any of the good benefits we enjoy from being online today. If you loved the 1992 internet I can probably find an AOL disk to send you. And just because there is one ad free site we are both using hardly means the rest of the sites wouldn’t somehow disappear. YC is paid for by some rich folks who have made plenty of money that ultimately (though not exclusively) came from ads. Like it or not, ads are an economic necessity. If you have a better solution start a company that gives away free, valuable content and prove it.
>I don’t think you’re being intellectually honest.
Do you think I'm outright telling falsehoods? Which part do you think is false: that the internet had many millions of users in 1992? That the internet pre-1993 was completely non-commercial with absolutely zero ads and no paywalls?
1992 internet had email, mailing list, newsgroups, Internet Relay Chat, massively-multiplayer online games (called MUDs) and places (mostly using the "anonymous FTP" protocol) where you could download free software like Linux and GNU utilities.
>There was very little content back then.
The newsgroups were absolutely huge in 1992: if you spend all day every day reading newsgroups, you could keep up with less than 1% of it. The same could be said of Internet Relay Chat and probably also of mailing lists (though I didn't subscribe to enough mailing lists to say that with 100% confidence).
Just because you never had access to it in 1992 does not mean that it is irrelevant to the topic of our conversation. AOL users had limited access to the Internet in 1992. They could send email for example I think to non-AOL users over the Internet, and 1992 I think is the year that they gain access to the newsgroups (including famously the ability to post to the newsgroups). But if in 1992 all you knew was Compuserv and AOL, you didn't know the Internet.
And again, one of the few rules of the internet (imposed again by the US government, which was footing the bill) was no commercial use. So for example there was a newsgroup called ba.jobs (the "ba" stood for "bay area") where employers could advertise job openings and employees could make posts announcing their availability for a job. But contractors (i.e., 1099 workers as opposed to W2 workers) were prohibited from making such a post because that was considered too commercial (in that an individual contractor is a lot like a small business and for such a contractor to use the internet to announce his availability was too much like a small business posting an ad).
>I didn’t have access to the Internet in my home in 1992, and the rest of the world didn’t either.
In 1992, most users of the internet got their access from their employer or their school of higher education. You could've bought access for $20 a month in 1992, its just that the Internet was not being advertised, so you didn't know about it. (Also, if you were living in a rural area, you might've had to pay your telephone company long-distant charges for every minute you were connected.)
Actually, it is not just that the internet was not being advertised, the people running it actively discouraged journalists from writing about it because there was a senator named William Proxmire who was good at getting the press to repeat his accusations of governmental wasteful spending, and the internet was an easy target for Proxmire: there were for example academics of every department using the newsgroups to discuss ideas, and Proxmire could say (truthfully, but misleadingly) that the US government was spending taxpayer money so that professors could discuss <pick the most ridiculous things academics might discuss>. (Here's an example of a journalist losing his access to the internet in 1984 in part because he wouldn't stop writing about the internet (then called ARPANET): https://www.stormtiger.org/bob/humor/pournell/story.html)
So you see there was an availability bias at play in which advertising is loud and designed to get attention (of course) and it tends to drown out information that is not part of the advertising-dependent information-ecosystem. (And again, the people in charge of the infrastructure of the internet pre-1993 were even actively striving to avoid any publicity.) Particularly, hardly anyone knows nowadays that many millions of users were using the completely-noncommercial internet of 1969 - 1992. People tend to think that the internet was created in 1993 or that advertising-dependent companies were essential to its creation.
I don’t think you’re taking scale into account. Millions of internet users then vs billions now makes a difference. Generous hobbyists and some universities payed for those services back then. The “massive” in MUD was a few thousand simultaneous players, with mostly text and maybe limited graphics. I very much doubt any of them could/would have paid if their usage went up by 10,000 times, with the higher quality and expectations that we have today. Again, I challenge you to come up with a service for a hundred million people that is open to everyone and doesn’t require ads. I hate ads too - I’ll join your service if you can make it work.
Just for reference, I was there too. I started with a shiny 300 baud modem. To compare the old days to today and say they’re even comparable in terms of information, media, knowledge, access, gaming, entertainment … it’s not even close.
Earlier you wrote that "I did pay for and have access to Compuserve forums", and that "if you loved the 1992 internet I can probably find an AOL disk to send you".
Could you clarify whether you had direct access to the internet (the newsgroups, email, ftp sites, web sites, not mediated by AOL or Compuserv) before mid-1993? Also, if yes, how many hours did you spend on it? I ask because I would be surprised to learn that it is possible for someone with your opinions to have had extensive experience with the internet pre-1993 (and I go looking for surprises).
I remember seeing spyglass and using NCSA mosaic at work and school, and Compuserve from home. There was definitely stuff out there, I downloaded images, a song or two and some programs. I saw a very early version of (I think?) Windows 95 (or 3.1?) that could play different videos in different windows and was amazed (these were from disk, not the web). Used a sysadmin for a Netware network.
It was a really fun time. But the breadth of what we have now more than dwarfs what existed then. It’s not surprising - that was 30 years ago. I don’t see any way to get from there to here without a ton of money being spent. Some of it was spent by governments and individuals, but I’m guessing the bulk was by companies. Economic realities require those companies to get something for their investments - they’re not charities. Advertising is the major vehicle for that investment. I’ll bet we’d find radio and TV followed a similar historical trajectory.
I use uBlock and avoid ads because they’re irritating (and I feel like a hypocrit for doing it). I hate going to recipe sites for all the garbage you have to wade through to get to the recipe. So I get it. The web, at current scale, doesn’t and can’t exist outside of economic realities. Micro transactions might have been the solution but it wasn’t. Kagi has a great model (happy customer here), but everyone can’t afford to subscribe to everything.
> “if you all dropped dead”, “you smarmy parasitic prick”
Dude. I hope you’re just having a bad day. If this is your normal mode of discourse you should get some counseling. I say this from a place of good will advice.
What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)? If paid search is the only option left, is it okay that poor people can't use the web? Is it okay if poor people don't get access to news?
Oh, and they don't get to vote because voting day and locations can't be advertised by the government, especially in targeted mailings that are personalized with your party affiliation and location. The US Postal Service will also collapse, so those mailings can't go out, even if allowed. At least the rich can still search for their polling location on the web [<- sarcasm].
None of that is okay with me. More/better regulation? Yes! But our world doesn't know how to function without ads. Being absolute about banning ads is unrealistic and takes focus away from achieving better regulation, thereby playing into the hands of the worst advertisers.
> What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)?
Not my problem. Those companies, and any other with business models reliant on advertising, don't have a right to exist. If your business can't be profitable without child labor, your business has no right to exist. This is no different.
I was working on the system that picked ads to show on our pages (we had our own internal ad system, doing targeting based on our own data). This was the most computationally intensive part of serving our pages and the ads were embedded directly in the HTML of the page. When we realized that 90% of our ad pick infrastructure was dedicated to feeding the crawlers, we immediately thought of turning ads off for them (we never billed advertisers for them anyway). But hiding the ads seemed to go directly against the spirit of Google's policy of showing their crawlers the same content.
Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page. This dropped our ad pick infra costs by nearly 80%, saving 6-figures a month. It also let us take a step back to decide where we could make long term investments in our infra rather than being overwhelmed with quick fixes to keep the crawlers fed.
This kind of thing is what people are missing when they wonder why a company needs more than a few engineers - after all, someone could duplicate the core functionality of the product in 100 lines of code. At sufficient scale, it takes real engineering just to handle the traffic from the crawlers so they can send you more users. There are an untold number of other things like this that have to be handled at scale, but that are hard to imagine if you haven't worked at similar scale.