> The spam site was checking for Googlebot IP addresses. If the visitor’s IP address matched as belonging to Google then the spam page displayed content to Googlebot.
>
> All other visitors got a redirect to other domains that displayed sketchy content.
Years ago, Google had an explicit policy that sites that showed different content to Googlebot than they showed to regular unauthenticated users were not allowed, and they got heavily penalized. This policy is long gone, but it would help here (assuming the automated tooling to enforce it was any good, and I assume to was).
More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.
About 10 years ago, I was working on a site that served several hundred million non-crawler hits a month. Many of our millions of pages had their content change multiple times a day. Because of the popularity and frequent changes, the crawlers hit us constantly... crawlers accounted for ~90% of our traffic - billions of hits per month. Bing was ~70% of the crawler traffic and Google was ~25% of it. We noticed it because Bing quickly became very aggressive about crawling, exposing some of our scaling limits as they doubled our already significant traffic in a few short months.
I was working on the system that picked ads to show on our pages (we had our own internal ad system, doing targeting based on our own data). This was the most computationally intensive part of serving our pages and the ads were embedded directly in the HTML of the page. When we realized that 90% of our ad pick infrastructure was dedicated to feeding the crawlers, we immediately thought of turning ads off for them (we never billed advertisers for them anyway). But hiding the ads seemed to go directly against the spirit of Google's policy of showing their crawlers the same content.
Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page. This dropped our ad pick infra costs by nearly 80%, saving 6-figures a month. It also let us take a step back to decide where we could make long term investments in our infra rather than being overwhelmed with quick fixes to keep the crawlers fed.
This kind of thing is what people are missing when they wonder why a company needs more than a few engineers - after all, someone could duplicate the core functionality of the product in 100 lines of code. At sufficient scale, it takes real engineering just to handle the traffic from the crawlers so they can send you more users. There are an untold number of other things like this that have to be handled at scale, but that are hard to imagine if you haven't worked at similar scale.
Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.
I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.
Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?
We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.
The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].
>The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.
You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.
Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.
I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.
I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability
I have never used Bing. I use duckduckgo though and they buy their results from Bing. At least they did in the past, I don't follow them closely enough to necessarily notice every possible change.
This seems very cannibalistic of their own business. That means somebody running Google or Microsoft (or really any web ads) only has a 10% chance to start with of getting served to an actual human (if they're not trying to block each other constantly).
And on the other side, that means every customer or ad placer, has to try and filter all the bots so people with actual credit cards and money will see the Google, TEMU, or FB ads (or others).
In some ways, almost feels like Microsoft is griefing online search by burying it under massive robot crawls. Like an ad DDOS.
They're serving first party targeted ads based on only their own data. If you're going to complain about that, it's close to saying that websites shouldn't be able to make money from advertising at all.
Very much this. It's a site/app that has probably been used by 80-90% of adults living in America over the last decade. It would not exist if these ads weren't targeted. I know because we knew (past tense because I'm no longer there) exactly how much targeting increased click-through-rate and how that affected revenue.
On top of that, they were ads for doing more of what the user was doing right then, tailored to tastes we'd seen them exhibit over time. Our goal was that the ads should be relevant enough that they served as an exploration mechanism within the site/app. We didn't always do as well as we hoped there, but it was a lot better than what you see on most of the internet. And far less intrusive because they weren't random (i.e., un-targeted). I have run ad blockers plus used whole house DNS ad blocking as long as I've been aware of them, but I was fine working on these ads because it felt to me like ads done right.
If we can't even allow for ads done right, then vast swaths of the internet have to be pay-walled or disappear. One consequence of that... only the rich get to use most of the internet. That's already too true as it is, I don't want to see it go further.
I have no problems with this (first party, targeted) as far as I can read English and understand.
In fact one of my bigger problems have been that Google has served me generic ads that are so misplaced they go far into attempted insult territory (shady dating sites, pay-to-win "strategy games" etc).
> websites shouldn't be able to make money from advertising at all.
This is the case. Advertising is a scourge, psychological warfare waged by corporations against our minds and wallets. Advertisers have no moral qualms, they will exploit any psychololgical weakness to shill products, no matter how harmful. Find a "market" of teenagers with social issues? Show them ads of happy young people frolicking with friends to make them buy your carbonated sugar water; never mind that your product will rot their teeth and make them fat. Advertisers don't care about whether products are actually good for people, all they care about is successful shilling.
Advertising is warfare waged by corporations against people and pretending otherwise makes you vulnerable to it. To fight back effectively we must use adblockers and advocate for advertising bans. If your website cannot exist without targeted advertising, then it is better for it to not exist.
Think about what it would mean to not have any advertising whatsoever. Most current large brands would essentially be entrenched forever. No matter how good a new product or service is, it's going to be almost impossible to reach a sustainable scale through purely organic growth starting from zero. Advertising in some form is necessary for an economy to function.
The problem is, as was mentioned above by someone, all content has to be paid for. If there were no ads we wouldn’t have had TV and radio for the past few decades. 90% of the internet would disappear, and the only stuff left would be paywalled - i.e. only the rich could use the web.
I’m sure you try to avoid ads - I do too, they suck. But don’t pretend you don’t use a lot of websites that are not paid for with ads.
The internet began in 1969 and by 1992 was by far the largest network of computers and had exactly zero ads and zero paywalls. (The US government imposed a rule against commercial use of the internet to appease private businesses that didn't want competition from the internet. The rule remained in force till 1992.)
Also, you're currently using a very large non-paywalled site with no ads.
So, no, ads are not needed to have a nice internet available to all.
I don’t think you’re being intellectually honest. I didn’t have access to the Internet in my home in 1992, and the rest of the world didn’t either. I did pay for and have access to Compuserve forums. There was very little content back then. Certainly no huge video sites where you can learn practically anything, or hardly any of the good benefits we enjoy from being online today. If you loved the 1992 internet I can probably find an AOL disk to send you. And just because there is one ad free site we are both using hardly means the rest of the sites wouldn’t somehow disappear. YC is paid for by some rich folks who have made plenty of money that ultimately (though not exclusively) came from ads. Like it or not, ads are an economic necessity. If you have a better solution start a company that gives away free, valuable content and prove it.
>I don’t think you’re being intellectually honest.
Do you think I'm outright telling falsehoods? Which part do you think is false: that the internet had many millions of users in 1992? That the internet pre-1993 was completely non-commercial with absolutely zero ads and no paywalls?
1992 internet had email, mailing list, newsgroups, Internet Relay Chat, massively-multiplayer online games (called MUDs) and places (mostly using the "anonymous FTP" protocol) where you could download free software like Linux and GNU utilities.
>There was very little content back then.
The newsgroups were absolutely huge in 1992: if you spend all day every day reading newsgroups, you could keep up with less than 1% of it. The same could be said of Internet Relay Chat and probably also of mailing lists (though I didn't subscribe to enough mailing lists to say that with 100% confidence).
Just because you never had access to it in 1992 does not mean that it is irrelevant to the topic of our conversation. AOL users had limited access to the Internet in 1992. They could send email for example I think to non-AOL users over the Internet, and 1992 I think is the year that they gain access to the newsgroups (including famously the ability to post to the newsgroups). But if in 1992 all you knew was Compuserv and AOL, you didn't know the Internet.
And again, one of the few rules of the internet (imposed again by the US government, which was footing the bill) was no commercial use. So for example there was a newsgroup called ba.jobs (the "ba" stood for "bay area") where employers could advertise job openings and employees could make posts announcing their availability for a job. But contractors (i.e., 1099 workers as opposed to W2 workers) were prohibited from making such a post because that was considered too commercial (in that an individual contractor is a lot like a small business and for such a contractor to use the internet to announce his availability was too much like a small business posting an ad).
>I didn’t have access to the Internet in my home in 1992, and the rest of the world didn’t either.
In 1992, most users of the internet got their access from their employer or their school of higher education. You could've bought access for $20 a month in 1992, its just that the Internet was not being advertised, so you didn't know about it. (Also, if you were living in a rural area, you might've had to pay your telephone company long-distant charges for every minute you were connected.)
Actually, it is not just that the internet was not being advertised, the people running it actively discouraged journalists from writing about it because there was a senator named William Proxmire who was good at getting the press to repeat his accusations of governmental wasteful spending, and the internet was an easy target for Proxmire: there were for example academics of every department using the newsgroups to discuss ideas, and Proxmire could say (truthfully, but misleadingly) that the US government was spending taxpayer money so that professors could discuss <pick the most ridiculous things academics might discuss>. (Here's an example of a journalist losing his access to the internet in 1984 in part because he wouldn't stop writing about the internet (then called ARPANET): https://www.stormtiger.org/bob/humor/pournell/story.html)
So you see there was an availability bias at play in which advertising is loud and designed to get attention (of course) and it tends to drown out information that is not part of the advertising-dependent information-ecosystem. (And again, the people in charge of the infrastructure of the internet pre-1993 were even actively striving to avoid any publicity.) Particularly, hardly anyone knows nowadays that many millions of users were using the completely-noncommercial internet of 1969 - 1992. People tend to think that the internet was created in 1993 or that advertising-dependent companies were essential to its creation.
I don’t think you’re taking scale into account. Millions of internet users then vs billions now makes a difference. Generous hobbyists and some universities payed for those services back then. The “massive” in MUD was a few thousand simultaneous players, with mostly text and maybe limited graphics. I very much doubt any of them could/would have paid if their usage went up by 10,000 times, with the higher quality and expectations that we have today. Again, I challenge you to come up with a service for a hundred million people that is open to everyone and doesn’t require ads. I hate ads too - I’ll join your service if you can make it work.
Just for reference, I was there too. I started with a shiny 300 baud modem. To compare the old days to today and say they’re even comparable in terms of information, media, knowledge, access, gaming, entertainment … it’s not even close.
Earlier you wrote that "I did pay for and have access to Compuserve forums", and that "if you loved the 1992 internet I can probably find an AOL disk to send you".
Could you clarify whether you had direct access to the internet (the newsgroups, email, ftp sites, web sites, not mediated by AOL or Compuserv) before mid-1993? Also, if yes, how many hours did you spend on it? I ask because I would be surprised to learn that it is possible for someone with your opinions to have had extensive experience with the internet pre-1993 (and I go looking for surprises).
I remember seeing spyglass and using NCSA mosaic at work and school, and Compuserve from home. There was definitely stuff out there, I downloaded images, a song or two and some programs. I saw a very early version of (I think?) Windows 95 (or 3.1?) that could play different videos in different windows and was amazed (these were from disk, not the web). Used a sysadmin for a Netware network.
It was a really fun time. But the breadth of what we have now more than dwarfs what existed then. It’s not surprising - that was 30 years ago. I don’t see any way to get from there to here without a ton of money being spent. Some of it was spent by governments and individuals, but I’m guessing the bulk was by companies. Economic realities require those companies to get something for their investments - they’re not charities. Advertising is the major vehicle for that investment. I’ll bet we’d find radio and TV followed a similar historical trajectory.
I use uBlock and avoid ads because they’re irritating (and I feel like a hypocrit for doing it). I hate going to recipe sites for all the garbage you have to wade through to get to the recipe. So I get it. The web, at current scale, doesn’t and can’t exist outside of economic realities. Micro transactions might have been the solution but it wasn’t. Kagi has a great model (happy customer here), but everyone can’t afford to subscribe to everything.
> “if you all dropped dead”, “you smarmy parasitic prick”
Dude. I hope you’re just having a bad day. If this is your normal mode of discourse you should get some counseling. I say this from a place of good will advice.
What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)? If paid search is the only option left, is it okay that poor people can't use the web? Is it okay if poor people don't get access to news?
Oh, and they don't get to vote because voting day and locations can't be advertised by the government, especially in targeted mailings that are personalized with your party affiliation and location. The US Postal Service will also collapse, so those mailings can't go out, even if allowed. At least the rich can still search for their polling location on the web [<- sarcasm].
None of that is okay with me. More/better regulation? Yes! But our world doesn't know how to function without ads. Being absolute about banning ads is unrealistic and takes focus away from achieving better regulation, thereby playing into the hands of the worst advertisers.
> What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)?
Not my problem. Those companies, and any other with business models reliant on advertising, don't have a right to exist. If your business can't be profitable without child labor, your business has no right to exist. This is no different.
That 'policy' is still actually in effect, I believe, in Google's webmaster guidelines. They just don't enforce it.
Years ago (early 2000s) Google used to mostly crawl using Google-owned IPs, but they'd occasionally use Comcast or some other ISPs (partners) to crawl. If you were IP cloaking, you'd have to look out for those pesky non-Google IPs. I know, as I used to play that IP cloaking game back in the early 2000s, mostly using scripts from a service called "IP Delivery".
Is it even well defined? On the one hand, there’s “cloaking,” which is forbidden. On the other hand, there’s “gating,” which is allowed, and seems to frequency consist of showing all manner of spammy stuff and requests for personal information in lieu of the indexed content. Are these really clearly different?
And then there’s whatever Pinterest does, which seems awfully like cloaking or bait-and-switch or something: you get a high ranked image search result, you click it, and the page you see is in no way relevant to the search or related to the image thumbnail you clicked.
For context, my team wrote scripts to automate catching spam at scale.
Long story short, there are non spam-related reasons why one would want to have their website show different content to their users and to a bot. Say, adult content in countries where adult content is illegal. Or political views, in a similar context.
For this reason, most automated actions aren't built upon a single potential spam signal. I don't want to give too much detail, but here's a totally fictitious example for you:
* Having a website associated with keywords like "cheap" or "flash sale" isn't bad per say. But that might be seen as a first red flag
* Now having those aforementioned keywords, plus "Cartier" or "Vuitton" would be another red flag
* Add to this the fact that we see that this website changed owners recently, and used to SERP for different keywords, and that's another flag
=> 3 red flags, that's enough for some automation rule to me.
Again, this is a totally fictitious example, and in reality things are much more complex than this (plus I don't even think I understood or was exposed to all the ins and outs of spam detection while working there).
But cloaking on its own is kind of a risky space, as you'd get way too many false positives.
Do you have any example searches for the Pinterest results you're describing? I feel like I know what you're talking about but wondering what searches return this.
As the founder of SEO4Ajax, I can assure you that this is far from the case. Googlebot, for example, still has great difficulty indexing dynamically generated JavaScript content on the client side.
I think they did this because lots of publishers show paywalls to people but still want their content indexed by Google. In other words, they want their cake and eat it too!
You'd think they could make fine money as neutral brokers since everyone served their ads and for a long period they did make money as semi-neutral brokers. But since, IDK, 2019 they have become more and more garbage. This is broadly part of the concentration of wealth and power you see everywhere else but I don't know the specifics but you can see the result.
Sure I have my viewpoint. But I'm also genuinely interested in your viewpoint.
My viewpoint is that I don't buy the idea that there is a group (or groups) of people that have both the means (money) and the ideas they made up themselves and they use the money to push the ideas to the passive masses who are then brainwashed by these rich people.
I think the masses produce the ideas. Those ideas are then selected and amplified by all sorts of people leveraging all sorts of means driven by all sorts of motives.
In fact there are plenty of examples of populist leaders that are not rich. The fact that the US has the cult of the millionaire sometimes obfuscates that fact; for some reason for populist leaders in the US to raise they have to be millionaire (or pretend to be) to begin with.
My point is that, sure, the moneyed class does play a role, but reality is much more complex than that and I don't really buy the idea that the world is "controlled" by a bunch of "supermen" who are both incredibly wealthy and also incredibly intelligent and play 4d chess.
I'm not sure you believe that, that's why I wanted to ask a question instead of implying anything for your position. But since you asked.
> I think the masses produce the ideas. Those ideas are then selected and amplified by all sorts of people leveraging all sorts of means driven by all sorts of motives.
> My point is that, sure, the moneyed class does play a role, but reality is much more complex than that and I don't really buy the idea that the world is "controlled" by a bunch of "supermen" who are both incredibly wealthy and also incredibly intelligent and play 4d chess
These don't contradict what I said at all. You are arguing with a straw man.
I'm willing to answer your questions, but I just didn't understand that last one. Anyway it sounds like we are probably in agreement. I recognize the world to be complex and that there are many parties with different interests. My point only was that Google is willing to support narrow and even inaccurate narratives at the behest of those willing to pay them lots of money.
That's not what I'm saying. My intent is not to defend rich people. Yes obviously most of them don't spend their time controlling the media but instead spend time showing off on their yachts.
My point is something else: I don't but the idea that there are two factions, the rich and the poor, and that all rich people have the same interests and thus are allies and that all poor people have the same interests and are allied (or so they should).
Sure, this view is partially grounded in realty and that's why Marx did come up with it and it's why it stuck to this day as sensible to so many people.
But I don't think it's true. I think it oversimplifies reality to the point that a spherical cow in comparison is anatomically accurate.
But it's worse than just being wrong. It actively stifles conversation. Any attempt to have a nuanced conversation about these topics ultimately resolves in an accusation of "you're defending the rich, just admit it". That's what turns a idea into an ideology. Ideologies are ideas with built-in self-defense mechanisms.
I wonder if Google trains its AI on paywalled data, that other scrapers don’t have access to but which those paywalled sites give full access for the Google bot to.
The thing that annoys me most is that sites are allowed to use the http referrer from Google to see what you're searching for.
That + spam sites spamming as many keywords as they can just mean whatever you search for 95% of the sites are spam after the first page.
Idk why we've let the Internet get like this. There's gotta be a way to sign off on real/trusted content. That's certainly not ssl certs. Could probably crowd source the legitimacy rating of a site or something.
That's another reason why people flock to the big names, reddit, youtube, etc. It's like McDonald's, people know that what they get this time will be exactly what they got before.
> More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.
See also, pages behind Red Hat and Oracle tech support paywalls.
I’ve switched to Kagi a couple of months ago. Every once in a while I struggle to get good search results, but then I check Google and it’s not any better there. It’s not always the greatest at promoting the sites I like, but I’ve already started boosting and pinning various domains to tailor the results to my own preference.
Still using a lot of other google stuff including gmail and maps. Just not search anymore.
But also, for the past few months, I’ve completely stopped searching the internet. ChatGPT-4 does the job way more effectively and I don’t see why I would go back to searching the internet (assuming the chatgpt experience doesn’t get nerfed in some way).
I've been using ChatGPT and Perplexity with GPT4 as my replacement for DDG and Google. I never thought the day would come when I thought Google Search was going to be superseded. It's crazy times to me.
If some company makes a less sanitized but equally capable version of GPT4 then I could see it replacing Google.
But for now you must keep your searches to a very limited, sanitized, corporate, non-copyright infringing, non-adult set of knowledge. And it's impossible to know what that will be beforehand, which makes using the tools very frustrating.
For example, try searching for anything medical related. Even if you're clear that you not looking for medical advice, you're just looking for info, it won't give it (sorry, as an AI I can't give medical advice). I imagine this is very frustrating for medical students.
And yes, I'm sure that I could coerce it into responding. Pretend your my grandmother telling me about her old medical recipes or some such. But that's still too annoying to do as anything except for testing the boundaries of the tool.
Simply asking it, “What would you tell a medical student…” works around this. I can’t imagine how or why they’re bothering with this when a simple disclaimer would probably protect them legally. WebMD seems to do just fine.
Yeah but my point is that it's not a complete search tool while it makes you jump through hoops of indirectly asking "hypothetical" questions.
It needs to directly answer all the questions that are given to it. If it must finish out the post with a disclaimer like "I am not a doctor/lawyer, this is not medical/legal advice" or whatever, that's fine.
ChatGPT is like living in an information tunnel. It’s amazing but it doesn’t replace search for me at all. Irony is when it does search, because it’s obviously just doing a quick crawl it actually makes things worse as it treats whatever shit it finds as authoritative - which, as anyone working on RAG knows, is a whole world of problems on its own.
This is absolutely not to say that Google can be considered ‘good’ these days.
Yeah they merged browsing into ChatGPT and it worsened the experience. I dread seeing the browsing icon show up to a question as the ai will get dumber.
Exactly the same for me, I've found answers derived from chat GPT training to be way more useful than the browser search answers. Half the time it doesn't work, it ads a lot of waiting time and it provides answers way less comprehensive. I have used the custom "classic" GPT bot when I have wanted to avoid Bing search answers.
I've addressed this by adding a custom instruction to only search when necessary, when it can't give a good answer otherwise. Pretty much cuts the searching down to when I explicitly ask it to do so.
Same for me. GPT-3.5 is good enough and pretty fast for most basic queries. GPT-4 is great at more detailed queries.
Any time I do end up going to Google, I’m so disappointed by the search results that I just leave. The only thing its good for now is searching site:reddit.com
I've been using kagi for almost a year, I think. Before that it was startpage or DDG.
I also have the same experience where kagi doesn't find something I think it should, so I go try google. Holy hell is google bad. Shockingly bad. I genuinely can't believe how bad it is now compared to 10 years ago.
I have essentially given up on Google as a search engine; I use it as bookmark search for bookmarks I consistently fail to add to my list. 99% of the time I google something for a specific URL from a domain I know (wikipedia, arxiv. etc). The other 1% was Google searches, but now I just append :reddit.com from the get go, bringing down my genuine "Google search"es to approximately zero.
even a couple years ago. There is soooooo much AI generate trash nowadays like https://www.oggyboggydoop.com on the first page. How can google's SEO filtering be so bad?
Kagi is designed to show you what you ask for, and not for showing you the ads you're most likely to fall for. It simply takes your query and returns results matching it. That's really it. It's sad that "does what you ask" is a defining feature, but that's what kagi is.
They have advanced features like "lenses" that bias your results toward a specific topic like programming, research, forums. It also lets you add weight to certain domains. For example, I have Pinterest and Facebook totally blacklisted, and some small sites boosted in my results.
They also support advanced query syntax with double quotes, +/-, and other operators.
Which is to say, kagi is the standard for "search engine that works". Nobody else sells a search engine that just works and does what you tell it to, which is literally all I want out of a search engine. It's mundane and unsexy, but it works, it doesn't advertise at me, and it lets me get my work done faster.
I've trialed Kagi but it's not given me the best results, especially not in my native language or related to my local area. I still prefer to use Google or even DDG.
I’ve noticed an increasing number of sites that provide an insane amount of text to answer a simple question.
They almost always follow some sort of structure where the actual answer to your question is all the way at the bottom of the page.
Most of the content appears relevant on the surface, but when you actually read it, it’s completely generic junk. Stuff that a high schooler would use to fluff an essay to hit a word limit.
Blame Google. A few years back, they decided that “topical authority” was important, and a page that targeted as many keywords as possible was “better”. A bunch of SEOs published studies showing how pages with 2,000+ keywords ranked higher, and then the floodgates opened with every company fluffing up their pages with 2000 words of BS just to appeal to Google.
Many don't even have an answer. They simply conclude that they don't know, after having spent pointless paragraphs on filler. "Well there you have it" is a common expression I see.
Lately a lot of these seem to be actually generated with LLMs. You can usually tell from the high school essay structure, often ending with a paragraph along the lines of "in conclusion there are many benefits but also many drawbacks".
After 20 years of being taken for granted that search engines as we know them are equipped to solve the typical problems we throw at them, I wonder if the whole concept of an unsupervised web crawl as the input to a single purpose search engine will just die out.
When I think about my typical web queries across the past year or two, it seems more and more likely that I'd be better off replacing Google with several purpose-built systems, none of which search the "entire web" (whatever that even means anymore). Technical queries? Just search StackOverflow and Github directly. Searching for a local venue of any kind? Search against a dedicated places database where new entries have to pass at least a cursory scrutiny. (Arguably Google Maps or Yelp already serve this purpose today, but I'm not sure if they have enough vetting today). Medical question? Search across a few sites known to be trustworthy.
We have become accustomed to go to Google because it's more convenient to type in a movie title, "chinese restaurant philadelphia", "flights to miami 4/12/24" or "Error code 127 python" into the same single place, but something tells me we'd be better off if that one place made some LLM-assisted guesses of what kind of search it is, and then went to a specialized search that is curated. If we go back toward the DMOZ/Yahoo model of directories that humans curate, I wonder if we could even reverse the trend toward spam and clickbait that has been so lamented in recent years.
For me search would be greatly improved if I could selectively exclude entire domains when I come across them. I want to be able to, with one click, remove GeeksForGeeks from all my search results — forever. And then I want to be able to continue to add to that (once called) "black list".
Never, ever show me Pinterest when I do an image search.
I imagine my search results would improve quickly in short order.
Better still, aggregate those lists from all users and you can improve search for users that have not yet built up a black list.
On the surface this is a good idea, however this would turn wildly anticompetitive. Whether or not your site would have business on the web would be entirely dictated by whether or not you woul be correctly classified (or indeed classified at all) in this engine. if you wanted to start your own stackoverflow competitor for whatever reason, you would have a very hard time getting any traction. this is also true of current general purpose engines, but you still do stand a chance to be referenced well and hit high enough to still get traffic.
the yahoo model collapsed for this very reason. back when you went to more than 5 websites to look at screenshots of the other 4, the directories would not necessarly show you the latest thing, because it wasn't on the list of sites manually added to each directory.
i think the current problem with google isn't to do with spam. i think google has become complacent because their ads are on all the sites anyway, so the function of "maximize revenue per search" doesn't actually care if you find what you're looking for, because you will get shown google ads anyway, and will be coming back to google anyway. in fact, they probably get to show more ads by feeding you bad results, because then you're loading more pages. this didn't used to be the case when google search was on top of spam sites, but it doesn't feel like they're doing anymore algo updates to curb the current trend, and spam sites have caught on to what ranks higher in the results.
> if you wanted to start your own stackoverflow competitor for whatever reason, you would have a very hard time getting any traction. this is also true of current general purpose engines, but you still do stand a chance to be referenced well and hit high enough to still get traffic
Hmm... You started to backpedal but then persisted. In the today world, your SO competitor would have that (slim) chance to rank if you started getting links from sites like HN or from people on Twitter who matter and know about tech. This would give you some PageRank and then you'd start possibly ranking in Google (in theory. In reality, no you probably wouldn't rank for anything since you're competing with 1,000,000 spam sites including whole verbatim clones of every page on SO that Google can't even get under control)
If any directory would be worth using, it would be run by humans who would HAVE to look at each submission. They could also look at who's linking to it, and evaluate "Is this a backlink from like, a gibberish page on `prawns-01-blork.info` or from like, Joel Spolsky's Twitter account?" Yes, it would take a lot of work, but like, it would be creating a truly useful product that people might pay for. And we have examples of other professions where "just rubber stamp everyone who pays" is frowned upon, like building inspectors and journalists. It's a hard problem, but it's far from hopeless.
By limiting web search results to "a few known sites", you'd be expediting the death of parts of it.
The beauty of search engines (in theory) is that you can find something NEW. Keeping the "open web" out would just entrench and ossify the current players.
A directory wouldn't be there to exclude anyone actually producing content of worth. It would serve as a gatekeeper to keep out plagiarism, spam, and utter trash. And people could create networks of sites based on their own real-world webs of trust, which vouch for one another.
Personally I'd rather see a standard which allowed you to add as many directories as you wanted to what your search engine would metasearch across. This also avoids the political problem of "who decides what's trash" -- if you want to add a directory whose main deal is they'll add literally any site, you could. If you want to only add directories which don't allow any <insert hated party> leaning content, you could do that.
For me it started when they began to monkey with the search entries and "second guess" what I was searching for. My guess is that they lost track of unvarnished human interaction and things snowball from there. That's just my hunch. People gave up trying to actually rely on it. We've all learned that Google doesn't care.
Basically it used to be optimized to be a sharp knife but now it's optimized to be a safety knife.
IMO that particular aspect is what made Google stand out to the general audience. Most people don’t search by keywords, or optimize their queries to hint the engine. They just type a question and assume Google will make sense of what they meant. Google usually does that very well, and that high level understanding is not compatible with our keyword query expectations. To us that feels broken, but to the majority of people it’s working as designed.
Instead I think search results have been getting worse for everyone because of SEO. Companies want to optimize for number of ads viewed, not quality of content, thus quality goes down in favor of clickbait and keyword stuffing.
I don’t think there’s much Google can do here to resolve this issue. It’ll always be a game of cat and mouse between Google and companies using SEO to push more ads for less money.
Can we attribute the decline in quality to the decline in informative content proportional to garbage content?
Garbage content seems to be making massive gains year on year, while informative or high quality content has stagnated or even declined from data decay
I think part of the problem is that Google has a recency bias. Newer, more spammy, sources get priority over older ones - even if the older ones are of higher quality.
I would have thought it well known by Search Engine Journal that at intervals Google implements changes that in one way or the other influences what kind of sites are rated in which way... And that, at times of change this may lead to very substantial changes in ranking, sometimes letting a lot of "irrelevant"/"lower quality" (...all this is subjective to some extent) results flow to the top for certain queries, even for a prolonged period of time... Back in the day these algo updates were quite the thing to monitor and discuss on certain SEO-related sites...
That said I only comment out of casual interest as I stopped using Google more than a decade ago.
This has nothing to do with changes in the algorithm. I've been in search for 20+ years, so I'm quite familiar with how Google works. ;) My article explains why it's likely happening.
TL/DR is that spammers are likely exploiting two loopholes.
1. Longtail keywords are low competition and may trigger different algorithms.
2. Some/many of the search queries the spam ranks for trigger the more permissive Local Search algorithm
Plus there are other reasons why those sites are getting through, which are discussed in detail in the article.
Yes. If you roll over the author at the top of the article, there is a link to various other social media profiles where the author has used this username.
What do you use now for all the various things? I cut google out for most things but still use their search from time to time (and of course have to use google docs but that's because the people I'm collaborating with are on google)
Startpage is a good wrapper for Google if you care about privacy. (Or it was, I haven't checked in many years)
DDG is worse than google, IMO. Bing works, I guess, but I trust Microsoft almost as much as I trust google.
At this point I've given up on Google. If I can't find it in kagi after a bit of effort, I'll either work around it or ask a person I know in the given field.
God, the internet sucks so much now. I miss the early 2000s :(
DDG has been much more aggressive about replacing my query with whatever their algorithm decides I meant. They support no advanced query operators, not even double quotes.
A search engine that ignores my query and shows me something else is not super valuable to me
Google has become unusable in many circumstances, and it rewards spam. It claims it doesn't, but it does, a lot of SEO strategies now revolve around spamming the search engine with articles, pages, etc. Not for useful content, but for linkbacks, internal linking, etc. It is especially bad for geo-specific SEO strategies, where you're trying to have different page sets for different regions. Basically, how it was when Google first started and was easily gamed. Now people are spinning up 100s of pages and articles using AI and just spamming it. It has gotten bad, but the worst part is that you have to do it now in order to compete for keywords.
Yeah. Ten years ago Google was fighting those strategies involving content farms and abusive SEO with things like the Panda update, etc. Now it seems they don't care anymore. Since low-quality SEO ranks over legitimate sites, it forces those sites to pay for advertisement. This is very sad.
Google has give up on organic search. I am pretty good as that SEO stuff and I can't Google anymore as I can fathom really fast why a page ranks the way it does. And non of it has to do with accurate valuable information.
Google is a marketplace, and they let most "engaging" results that adhere to a certain content structure win.
By now most paid results offer more value for the users than the organic results. Cause thats what Google wants. Click the paid, ignore the crap.
Yeah that's not how I feel today when I searched for a local buffet by name and accidentally tapped the first, sponsored result which was Golden Corral, yuck. Definitely not "more value" to me
I stopped using google and switched to Bing about a year ago after they started doing more with ChatGPT. For the most part I’m much happier with how it presents what I’m looking for. It’s not perfect, but when I compare to google the few times I’ve been frustrated, it’s not any better and has to do with the topic, not search engine.
I’ve been a DDG user for years now, so I guess a bunch of my results come from Bing.
I don’t generally compare to Google, so I can’t say for sure that the results are ‘as good’, but my experience sure as shit is better.
I search for a thing and I get a page with links. Usually the thing I want is in the first page.
Sometimes it isn’t, or I’m searching for something that I know is recent that Google probably has a later version of, so I just add the !g to the search and there I am at Google.
It’s great. It works. It’s not stressful or horrible or annoying. I recommend it.
You are playing up the noise and playing down the signal. Search, Gmail, and YouTube are far from spam. There are obviously many scenarios/URLs that contain spam, but all 3 of those products are overwhelmingly useful.
When "SEO" first became a trendy buzzword decades ago, I immediately thought "these are fraudulent efforts to artificially appear more relevant than you actually are to the search algo" - and then it became a multi billion dollar industry.
Wow, that's pretty impressive. I didn't experience a "Code Red" when I was at Google but this would certainly qualify. Bing is not affected so it is definitely something to do with rank injection. I am really really interested in the post mortem if it sees the light of day, although if it reveals exploits for ranking it will probably not be made public.
Well the text is HTML entities to escape ASCII text. "People being end with a person. Like everything in OUNASS ProMo CODE Onas OuNaS oNass cOuPOn DiscOUnt NoON SiVvl non toyou NaMshi"
Here Sivvi, toyou, Namshi, noon, and OUNASS are all brands of shopping websites and you can see their logos in the image.
Clearly this is some sort of keyword spam, though it's hard to tell more than that from your screenshot. It's also not clear why they'd bother to use HTML entities... a bug in the spam code? Or perhaps exploiting some parser differential between different twitter systems? Who can say.
So this is pure speculation, but more people should be aware of parser differentials (same thing as that email thing the other day) so let me say what I mean...
Hypothetically say a website has an internal service to index posts for keywords for search, that just so happens to unescape HTML entities during keyword normalization due to a seemingly harmless bug.
Plus a second internal service to identify keyword spam that _doesn't_ do any HTML entity unescaping (because why would you?)
Then you could end up in a situation where a spammer uses HTML entities to avoid spam detection while still showing up in search results. They hope that the user ignores the nonsense text and just clicks their link based on the image (a list of big shopping brands in the middle east) instead.
and right now I just tried a query from the article above "Alabama Casinos" on Google maps, and sure enough I see "Bovada casino" (offshore "illegal in US" casino) affiliate links in the third spot in the "locations" list.
Recently almost every YouTube ad I get is a poorly made deepfake of elon musk telling me about an amazing secret investment opportunity. Google is losing the plot
Most ads I see on YouTube are very dubious and poorly made. A reasonable ammount are plain fraud. It is very frustrating going to "ad center" and reading about how the advertiser did not verify his identity.
All I get is the same damn advertisement for the Internet provider I already have. You’d think they’d be able to say “Google don’t send this ad to our customers.”
I wouldn’t be surprised if there was some kind of “annoyance” metric for the ad selection algorithm. The idea being: show a certain demographic the same annoying ad again and again, and they might end up paying just to get rid of ads. I’m pretty sure Spotify does this.
A couple years ago I created a new gmail, with a very long address. Didn't tell anybody, sign up with ANYTHING. Just parked it and let it sit dormant. Within a DAY I had my first spam.
Google is an ad company and it's entirely reasonable to assume they are simply selling lists of their own email addresses. Same with Yahoo! Mail! Which! I! Ran! The! Same! Experiment! From! With! The! Same! Results!
Ample but because it was one of the gmails I could create with an invitation generated by my own account, this was before you needed ID or anything but a link BTW, it would reveal my true life name which is my "mother" gmail address. I am one of those ass holes who got firstname.lastname@gmail.com as their address. I had a guy offer me $15,000 for my gmail.
Google should be prioritizing sites which have fewer backlinks as it's proof that they didn't cheat and are likely higher quality. Or just random ranking; only require a certain low threshold of backlinks to establish baseline relevance.
I mean, when we get public posts like this one where someone boasts how they’re spamming to get hits through SEO — https://x.com/jakezward/status/1728032634037567509, I can only imagine it’s happening in larger quantities behind the scenes.
Did they really close the 'loophole' though? Jake said on Twitter they were actually hit with a manual penalty. So doesn't seem like it's patched. Seems like if he didn't boast about it they'd still be doing well.
I’m actually curious, how did Google do that? The guy who did it did it in a very obvious way, but I’m assuming you can just schedule a lot of posts that would drop once a day, make the AI to use different language structures and change the underlying AI model in general (e.g. switch between OpenAI, Mistral and whatever) and slow drip submit the posts. How would Google know they’re “mass generated”?
The original poster of that Tweet (Jake) admitted they got a manual penalty. Also, clearly Google didn't fix it because if not Google wouldn't be 'overwhelmed' with this current spam attack going on. If you look into the attack it's mass generated absolute spam garbage pages on hundreds if not thousands of separate domains. So it is definitely not fixed.
The fact that Google puts so much focus on content makes it sound like the Google algorithm has been corrupted. It's almost like a secret society where you need to do a secret handshake in order to get access. In this case you need to know the right word patterns to use in your content to get access to high traffic.
youtube search is problematic. i very often get videos on top that are auto-generate spam with AI narration and random clips (often completely irrelevant to the subject). there are videos i know that aren't shown even when i search with title and channel name.
Google wants to be too relevant , to the point it s unusable
No? Decreased search quality will reduce the number of searches people are doing which redcues the amount of ads Google can show. Decreased searches can come from not only people switching to another search engine, but also from people using the web less. Why look up something on the web when you can watch TikTok for a few hours?
> No? Decreased search quality will reduce the number of searches people are doing which redcues the amount of ads Google can show.
I’m willing to bet that >90% of Google Search users aren’t even aware that alternatives exist.
They are not suddenly going to stop using Google Search. There might even be a significant short-term increase in usage, for example, if they really need to go to page 5 to find the first relevant result.
That is very far from my experience with non-tech people. My friends and students always google something, and almost inevitably fail to find it. Then they do it again, and again, and again.
Over two years now I see specifically two people deterministically failing to find stuff through G search, and yet they still start with it in an almost Pavlovian ritual. Sometimes they will desperately scroll down, click on a blatantly irrelevant result and switch to facebook or some other bookmark aggregator of theirs to continue their brutally inefficient search process, thumb flick after thumb flick after thumb flick.
You're right, but you're thinking long term sustainability. That's not Google's current goal, they'd rather squeeze out as much profit as possible in a short term, then current shareholders can sell their stock to idiots who become bag holders.
I realized the other day that I haven't Googled anything in over a month. And I guess even then, and before Google search was thoroughly enshittified, it was mostly just a convenient way to get to StackOverflow via my search bar, but now it's just ChatGPT for most quick queries I have.
For responsible scientists and researchers who've disclosed to Google how this is being used to execute large-scale phishing attacks, and Google opting to not fix it, this news warms our hearts.
For a bunch of super smart people they sure run a shite service. I've received the same spam multiple times in the last few days. Marked as spam every single time. And each time they failed to identify the exact same message as spam.
It's incredible how brain dead Google have become.
I opened the article from a computer in school library that didn’t have ad-block. Nearly crashed the system. I seem to remember times when SEJ was reputable, can not even describe what this is.
When I googled redactle the first site was some fan site. The second was one of your sites (anybrowser). The third was Reddit which linked directly to your sites.
Then came a pair of unlimited sites. Finally your .net site so not terribly deep and it also came in third in a away. All in all it's not ideal but not 'all spammy advertising sites'
There is (browser?) malware out there that hijacks the google search results to show junk instead of what you are actually looking for. At one point I got bitten by this, it only sometimes replaced the results so it was a bit subtle and took me a while to pin down.
I wonder how many people on HN are infected by such malware and don't realize it? A lot of the complaints about search results are clearly not this, but when someone complains of outright spam for reasonable queries, I do wonder...
Weird, last week when I googled it, all I got was liscicles about "the best order to read WoT" and other blogspam. Just googled it now and it seems fine.
I took a series of screenshots but they all seem fine to me?
Wikipedia, Goodreads, Fandom.com and MacMillan Publishing; these all seem to be reasonable results. I could share the whole page if I could find a place to upload my screenshots (RIP imgur)
> Google's been 98% spam for well over a year now.
I find this hard to believe. How do you even measure for this?
I'd love to see a few more examples of searches you are making that show spam, because the example you gave provided me with the appropriate results. I almost suspect you are either being disingenuous or just have some malware on your computer.
"This search engine I've gone out of my way to not track my search, viewing or other habits and usage is showing me irrelevant ads! Fucking trash!"
They'll complain at the thought of paying for YT premium ("the internet should be free bro! Except my new SaaS calendar app, of course"), pirate Factorio, pay for kagi. A real eclectic bunch.
I see this all the time these days and always wonder how many people making these comments have their own monetized apps, or have done any work on monetization, etc.
Years ago, Google had an explicit policy that sites that showed different content to Googlebot than they showed to regular unauthenticated users were not allowed, and they got heavily penalized. This policy is long gone, but it would help here (assuming the automated tooling to enforce it was any good, and I assume to was).
More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.