Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.
I called it when I wrote it, they are just burning their goodwill to the ground.
I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.
AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.
Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?
"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.
> Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet.
And this is why AI training is not "fair use". The AI companies seek to train models in order to compete with the authors of the content used to train the models.
A possible eventual downfall of AI is that the risk of losing a copyright infringement lawsuit is not going away. If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.
I've pointed this out to a few people in this space. They tend to suggest that the value in AI is so great this means we should get rid of copyright law entirely.
That value is only great if it's shared equitably with the rest of the planet.
If it's owned by a few, as it is right now, it's an existential threat to the life, liberty, and pursuit of a happiness of everyone else on the planet.
We should be seriously considering what we're going to do in response to that threat if something doesn't change soon.
Yep. The "wouldn't it be great if we had robots do all the labor you are currently doing" argument only works if there is some plan to make sure that my rent gets paid other than me performing labor.
It depends if you're the only one out of a job. If it really is everyone then the answer will likely be some variant of metaphorically or literally killing your landlord in favor of a different resource allocation scheme. I put these kinds of things in a "in that world I would have bigger problems" bucket.
And that's the ultimate fail of capitalist ethics - the notion that we must all work just so we can survive. Look at how many shitty and utterly useless jobs exist just so people can be employed on them to survive.
This has to change somehow.
"Machines will do everything and we'll just reap the profits" is a vision that techno-millenialists are repeating since the beginnings of the Industrial Revolution, but we haven't seen that happening anywhere.
For some strange reason, technological progress seem to be always accompanied with an increase on human labor. We're already past the 8-hours 5-days norm and things are only getting worse.
> And that's the ultimate fail of capitalist ethics - the notion that we must all work just so we can survive. Look at how many shitty and utterly useless jobs exist just so people can be employed on them to survive.
This isn't a consequence of capitalism. The notion of having to work to survive - assuming you aren't a fan of slavery - is baked into things at a much more fundamental level. And lots of people don't work, and are paid by a welfare state funded by capitalism-generated taxes.
> "Machines will do everything and we'll just reap the profits" is a vision that techno-millenialists are repeating since the beginnings of the Industrial Revolution, but we haven't seen that happening anywhere.
They were wrong, but the work is still there to do. You haven't come up with the utopian plan you're comparing this to.
> For some strange reason, technological progress seem to be always accompanied with an increase on human labor.
No it doesn't. What happens is not enough people are needed to do a job any more, so they go find another job. No one's opening barista-staffed coffee shops on every corner in the time when 30% of the world was doing agricultural labour.
Yes, it is. The fact we have welfare isn't a refutation of that, it's proof. The welfare is a bandaid over the fundamental flaws of capitalism. A purely capitalist system is so evil, it is unthinkable. Those people currently on welfare should, in a free labor market, die and rot in the street. We, collectively, decided that's not a good idea and went against that.
That's why the labor market, and truly all our markets, are not free. Free markets suck major ass. We all know it. Six year olds have no business being in coal mines, no matter how much the invisible hand demands it.
You have a very different definition of free than I do. Free to me means that people enter into agreements voluntarily. It's hard to claim a market is free when it's participants have no other choice...
You are correct, but the real problem is that copyright needs complete reform.
Let's not forget the basis:
> [The Congress shall have Power . . . ] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Is our current implementation of copyright promoting the progress of science and useful arts?
Or will science and the useful arts be accelerated by culling back the current cruft of copyright laws?
For example, imagine if copyright were non-transferable and did not permit exclusive licensing agreements.
AI is going to implode within 2 years. Once it starts ingesting its own output as training data it is going to be at best capped at its current capability and at worst even more hallucinatory and worthless.
The mistake you make here is to forget that the training data of the original models was also _full_ or errors and biases — and yet they still produced coherent and useful output. LLM training seems to be incredibly resilient to noise in the training set.
That's a talking point for bros looking to exploit it as their ticket.
"The upside of my gambit is so great for the world, that I should be able to consume everyone else's resources for free. I promise to be a benevolent ruler."
That's not how conservatism works. AI oligarchs are part of the "in" group in the "there are laws that protect but do not bind the in group, and laws that bind but do not protect the out group" summary. Anyone with a net worth less than FOTUS is part of the "out" group.
AI is worthless without training data. If all content becomes AI generated because AI outcompetes original content then there will be no data left to train on.
When Google first came out in 1998, it was amazing, spooky how good it was. Then people figured out how to game pagerank and Google's accuracy cratered.
AI is now in a similar bubble period. Throwing out all of copyright law just for the benefit of a few oligarchs would be utter foolishness. Given who is in power right now I'm sure that prospect will find a few friends, but I think the odds of it actually happening before the bubble bursts are pretty small.
Are we not past past critical mass though? The velocity at which these things can out compete human labor is astonishing, any future human creations or original content will already have lost the battle the moment it goes online and gets cloned by AI.
OK. To be clear, that wasn't about the OP, but rather the alleged people promoting the abolition of copyright... which would significantly hurt open source.
The people agitating for such things are usually leeches who want everything free and do, in fact, hold an infantile worldview that doesn't consider how necessary remuneration is to whatever it is they want so badly (media pirates being another example).
Not that I haven't "pirated" media, but this is usually the result of it not being available for purchase or my already having purchased it.
I'm curious what will happen when someone modifies a single byte (or a "sufficient" number of bytes) of AI output, thereby creating a derivative work, and then claiming copyright on that modified work.
> The AI companies seek to train models in order to compete with the authors of the content used to train the models.
When I read someone else’s essay I may intend to write essays like that author. When I read someone else’s code I may intend to write code like that author.
AI training is no different from any other training.
> If a court determines that the AI output you've used is close enough to be considered a derivative work, it's infringement.
Do you mean the output of the AI training process (the model), or the output of the AI model? If the former, yes, sure: if a model actually contains within it it copies of data, then sure: it’s a copy of that work.
But we should all be very wary of any argument that the ability to create a new work which is identical to a previous work is itself derivative. A painter may be able to copy Gogh, but neither the painter’s brain nor his non-copy paintings (even those in the style of Gogh) are copies of Gogh’s work.
If you as an individual recognizably regurgitate the essay you read, then you have infringed. If an AI model recongnizably regurgitates the essay it trained on then it has infringed. The AI argument that passing original content through an algorithm insulates the output from claims of infringement because of "fair use" is pigwash.
> If an AI model recongnizably regurgitates the essay it trained on then it has infringed.
I completely agree — that’s why I explicitly wrote ‘non-copy paintings’ in my example.
> The AI argument that passing original content through an algorithm insulates the output from claims of infringement because of "fair use" is pigwash.
Sure, but the argument that training an AI on content is necessarily infringement is equally pigwash. So long as the resulting model does not contain copies, it is not infringement; and so long as it does not produce a copy, it is not infringement.
> So long as the resulting model does not contain copies, it is not infringement
That's not true.
The article specifically deals with training by scraping sites. That does necessarily involve producing a copy from the server to the machine(s) doing the scraping & training. If the TOS of the site incorporates robots.txt or otherwise denies a license for such activity, it is arguably infringement. Sourcehut's TOS for example specifically denies the use of automated tools to obtain information for profit.
I'm curious how this can be applied with the inevitable combinatorial exhaustion that will happen with musical aspects such as melody, chord progression, and rhythm.
Will it mean longer and longer clips are "fair use", or will we just stop making new content because it can't avoid copying patterns of the past?
> I'm curious how this can be applied with the inevitable combinatorial exhaustion that will happen with musical aspects such as melody, chord progression, and rhythm.
They did this in 2020. The article points out that "Whether this tactic actually works in court remains to be seen" and I haven't been following along with the story, so I don't know the current status.
More germane is that there will be a smoking gun for every infringement case: whether or not the model was trained on the original. There will be no pretending that the model never heard the piece it copied.
> AI training is no different from any other training.
Yes, it is. One is done by a computer program, and one is done by a human.
I believe in the rights and liberties of human beings. I have no reason to believe in rights for silicon. You, and every other AI apologist, are never able to produce anything to back up what is largely seen as an outrageous world view.
You cannot simply jump the gun and compare AI training to human training like it's a foregone conclusion. No, it doesn't work that way. Explain why AI should have rights. Explain if AI should be considered persons. Explain what I, personally, will gain from extending rights to AI. And explain what we, collectively, will gain from it.
I have this line of thought as well but then I wonder, if we are all out of jobs and out of substantial capital to spend, how do these owners make money ultimately? It's a genuine question and I'm probably missing something obvious. I can see a benevolant/post-scarcity spin to this but the non-benevolant one seems self defeating.
"Making money" is only a relevant goal when you need money to persuade humans to do things for you.
Once you have an army of robot slaves ... you've rendered the whole concept of money irrelevant. Your skynet just barters rare earth metals with other skynets and your robot slaves furnish your desired lifestyle as best they can given the amount of rare earth metals your skynet can get its hands on. Or maybe a better skynet / slave army kills your skynet / slave army, but tough tits, sucks to be you and rules to be whoever's skynet killed yours.
That's part of the "rare earth metals" synecdoche - hydroelectric dams, thorium mines, great lakes heat sinks - they're all things for skynets to kill or barter for as expedient
I don’t think you’re missing anything, I think the plan really is to burn it all down and rule over the ashes. The old saw “if you’re so smart, why aren’t you rich?” works in reverse too. This is a foolish, shortsighted thing to do, and they’re doing it anyway. Not really thinking about where value actually comes from or what their grandchildren’s lives would be like in such a world.
Capitalism is an unthinking, unfeeling force. The writing is on the wall that AI is coming, and being altruistic about it doesn’t do jack to keep others from the land grab. Their thinking is, might as well join the rush and hope they’re one of the winners. Every one of us sitting on the sidelines will be impacted in some way or the other. So who’re the smart ones, the ones who grab shovels and start digging, or the ones who watch as the others dig their graves and do nothing?
Sure, maybe in 50 years. At the moment, it's a productivity tool. Strangely, by the look of the down votes, the HN community doesn't quite understand this.
Job market is formed by the presence of needs and the presence of the ability to satisfy them. AI - does not reduce the ability to satisfy needs, so only possible situation where you won't be able to compete - is either the socialists will seize power and ban competition, or all the needs will be met in some other ways. In any other situation - there will be job market and the people will compete in it
> there will be job market and the people will compete in it
maybe there will be. I'm sure there also is a market for Walkman somewhere, its just exceedingly small.
The proclaimed goal is to displace workers on a grand scale. This is basically the vision of any AI company and literally the only way you could even remotely justify their valuations given the heavy losses they incur right now.
> Job market is formed by the presence of needs and the presence of the ability to satisfy them
The needs of a job market are largely shaped by the overall economy. Many industrial nations are largely service based economies with a lot of white collar jobs in particular. These white collar jobs are generally easier to replace with AI than blue collar jobs because you don't have to deal with pesky things like the real, physical world. The problem is: if white collar workers are kicked out of their jobs en masse, it also negatively affects the "value" of the remaining people with employment (exhibit A: tech job marker right now).
> is either the socialists will seize power and ban competition,
I am really having a hard time understanding where this obsession with mythical socialism comes from. The reality we live in is largely capitalistic and a striving towards a monopoly - i.e. a lack of competition - is basically the entire purpose of a corporation, which is only kept in check by government regulations.
>The proclaimed goal is to displace workers on a grand scale.
It doesn't matter. What you need to understand - is that in the source of the job market is needs, ability to meet those needs and ability to exchanges those ability on one another. And nothing of those are hindered by AI.
>Many industrial nations are largely service based economies with a lot of white collar jobs in particular.
Again: in the end of the day it doesn't change anything. In the end of the day you need a cooked dinner, a built house and everything else. So someone must build a house and exchange it for a cooked dinners. That's what happening (white collar workers and international trade balance included) and that's what job market is. AI doesn't changes the nature of those relationship. Maybe it replace white collar workers, maybe even almost all of them - that's only mean that they will go to satisfy another unsatisfied needs of other people in exchange for satisfying their own, job marker won't go anywhere, if anything - amount of satisfied needs will go up, not down.
>if white collar workers are kicked out of their jobs en masse, it also negatively affects the "value" of the remaining people with employment
No, it doesn't. I mean it does if they would be simply kicked out, but that's not the case - they would be replaced by AI. So the society get all the benefits that they were creating plus additional labor force to satisfy earlier unsatisfied needs.
>exhibit A: tech job marker right now
I don't have the stats at hand, but aren't blue collar workers doing better now than ever before?
>I am really having a hard time understanding where this obsession with mythical socialism comes from
From the history of the 20th century? I mean not obsession, but we we are discussing scenarios of the disappearance (or significant decrease) of the job market, and the socialists are the most (if not only) realistic reason for that at the moment.
>The reality we live in is largely capitalistic and a striving towards a monopoly
Yeas, and this monopoly, the monopoly, are called "socialism".
>corporation, which is only kept in check by government regulations.
Generally corporation kept in check by economic freedom of other economic agents, and this government regulations that protects monopolies from free market. I mean why would government regulate in other direction? Small amount of big corporations are way easier for government to control and get personal benefits from them.
> In the end of the day you need a cooked dinner, a built house and everything else. So someone must build a house and exchange it for a cooked dinners.
You should read some history.This veiw is so naive and overconfident.
My views on this issue are shaped by history. Starting with crop production and plowing and ending with book printing, conveyor belts and microelectronics - creating tools that increase productivity has always led to increased availability of goods, and the only reason that has lead to decreased availability - is things that has hindered ability to create and exchange goods.
I started a borderline smug response here pointing out how bullshit white collar and service jobs* where in deep shit but folks who actually work for a living would be fine. I scrapped it halfway through when it occurred to me that if everyone's broke then by definition nobody's spending money on stuff like contractors, mechanics, and other hardcore blue collar trades. Toss in AI's force multiplication of power demands in the face of all of the current issues around global warming and it starts to feel like pursuing this tech is fractally stupid and the best evidence to date I've seen that a neo-luddite movement might actually be a thing the world could benefit from. That last part is a pretty wild thought coming from a retired developer who spent the bulk of his adult life in IT, but here we are.
Neo-Luddism is less stupid when you remember that the Luddites weren't angry that looms existed. Smashing looms was their tactic, not their goal.
Parliament had made a law phasing in the introduction of automated looms; specifically so that existing weavers were first on the list to get one. Britain's oligarchy completely ignored this and bought or built looms anyway; and because Parliament is part of that oligarchy, the law effectively turned into "weavers get looms last". That's why they were smashing looms - to bring the oligarchy back to the negotiating table.
The oligarchy responded the way all violent thugs do: killing their detractors and lying about their motives.
>if everyone's broke
>nobody's spending money on stuff like contractors, mechanics, and other hardcore blue collar trades.
Why would this happen? Money is simply a medium of exchange of values that this contractors, mechanics and other hardcore blue collar trades are creating. How can they be broke, if Ai doesn't disturb their ability to create values and exchange it?
Customers that have funds available to purchase the services you offer and who are willing to actually spend that money are a hard requirement to maintain any business. If white collar and service industries are significantly disrupted by AI this necessarily reduces the number of potential customers. Thing is you don't have to lay off that many people to bankrupt half of the contractors in the country, a decent 3-5 year recession is all it takes. Folks stop spending on renovations and maintenance work when they're worried about their next paycheck.
Money mean nothing. It is simply medium of exchange. The question is, is there anything to exchange? And the answer is yeas, and position of white collar workers doesn't affect availability of things for exchange. There's no reason for recession, there is nothing that can hinder ability of blue collar workers to create goods and services, all that things that when combined is called "wealth".
Don't think in the meaningless category of "what set of digits will be printed in the piece of paper called paycheck?". Think in the terms, that are implied: "What goods and services blue collar workers can't afford to themselves?". And it will become clear that the set of unaffordable goods and services to blue collar workers will decrease because of the replacement white collar workers with AI, because it is not hinder their ability to create those goods and services.
You think so? Give me the contents of your checking, savings, and retirement accounts and then get back to me on that.
> position of white collar workers doesn't affect availability of things for exchange.
You appear to be confused about the concept of consumers, let me help. Consumers are the people who buy things. When there are fewer consumers in a market, demand for products and services declines. This means less sales. So no, you don't get to unemploy big chunks of the population and expect business to continue thriving.
>When there are fewer consumers in a market, demand for products and services declines.
No, demand is unlimited and defined by the amount of production.
>You don't get to unemploy big chunks of the population and expect business to continue thriving.
I mean, generally replaced worker with the instruments - is the main way to business (and society) to thrive. In other words, what goods and services will became less affordable to the blue collar workers?
When ~white collar [researchers, programmers, managers, salespeople, translators, illustrators, ...] lose their income/jobs to AI's -> lose their ability to buy products/services and at the same time try to shift in mass to doing some kind of manual work, do you think that would not affect incomes of those who are the current blue collar class?
I mean yeas, values of consumed goods will decrease, so blue color workers will be able to consume more. That's exactly what is called increase of income.
My gut is telling me you're being intentionally obtuse but I'm going to give you the benefit of the doubt. To reiterate in detail:
AI is poised to disrupt large swaths of the workforce. If large swaths of the workforce are disrupted this necessarily means a bunch of people will see their income negatively impacted (job got replaced by AI). Broke people by definition don't have money to spend on things, and will prioritize tier one of Maslow's Hierarchy out of necessity. Since shit like pergolas and oil changes are not directly on tier 1 they will be deprioritized. This in turn cuts business to blue collar service providers. Net result: everyone who isn't running an AI company or controlling some currently undefined minimum amount of capital is fucked.
If you're trying to suggest that any notional increases in productivity created by AI will in any way benefit working class individuals either individually or as a group you are off the edge of the map economically speaking. Historical precedents and observed executive tier depravity both suggest any increase in productivity will be used as an excuse to cut labor costs.
>This in turn cuts business to blue collar service providers.
No, it doesn't. Where's that is come from?
I mean, look at the situation from the perspective of blue collar service providers: what is exactly those goods and services, that they was be able to afford for themselves, but AI will make it unaffordable for them? Pretty obviously, that there's about none of those goods and services. So, in big picture, all that process that you described, doesn't lead to any disadvantage of blue collar workers.
I literally described the mechanism to you twice and you're still acting confused. I'm not sure if we have a language barrier here or what but go check out a Khan Academy course on economics or maybe try running a lemonade stand for an afternoon if you still don't get it.
I think the obvious thing you are missing is just b2b. It doesn’t actually matter if people have any money.
Similar to how advertising and legal services are required for everything but have ambiguous ROI at best, AI is set to become a major “cost of doing business“ tax everywhere. Large corporations welcome this even if it’s useless, because it drags down smaller competitors and digs a deeper moat.
Executives large and small mostly have one thing in common though.. they have nothing but contempt for both their customers and their employees, and would much rather play the mergers and acquisitions type of games than do any real work in their industry (which is how we end up in a world where the doors are flying off airplanes mid flight). Either they consolidate power by getting bigger or they get a cushy exit, so.. who cares about any other kind of collateral damage?
Money is a proxy for control. Eventually humans will become mostly redundant and slated for elimination except for the chosenites of the managerial classes and a small number of technicians. Either through biological agents, famines, carefully engineered (civil?) wars and conflicts designed to only exterminate the non-managerial classes, or engineered Calhounian behavioral sinks to tank fertility rates below replacement.
Why should we care if they make money? Owning things isn't a contribution to society.
Building things IS a contribution to society, but the people who build things typically aren't the ultimate owners. And even in cases where the builders and owners are the same, entitling the builders and all of their future heirs to rent seek for the rest of eternity is an inordinate reward.
You don't. It's like Minecraft. You can do almost everything in Minecraft alone and everything exists in infinite quantity, so why trade in the first place?
This goes both ways. Let's say there is something you want but you're having trouble obtaining it. You'd need to give something in exchange.
But the seller of what you want doesn't need the things you can easily acquire, because they can get those things just as easily themselves.
The economy collapses back into self sufficiency. That's why most Minecraft economy servers start stagnating and die.
What people say is not the same as what people do.. in other words, what is spoken in public repeatedly is not representational of actual decision flows
Money is only a bookkeeping tool for complex societies. The aim of the owner class in a worker-less world would be accumulation of important resources to improve their lives and to trade with other owners (money would likely still be used for bookkeeping here). A wealthy resource-owner might strive to maintain a large zone of land, defended by AI weaponry, that contains various industrial/agricultural facilities producing goods and services via AI.
They would use some of the goods/services produced themselves, and also trade with other owners to live happy lives with everything they need, no workers involved.
Non-owners may let the jobless working class inhabit unwanted land, until they change their minds.
With what and against what? There will be spy satellites and drones and automated turrets that will turn you to pulp if you come within, say, 50KM of their compound borders.
The non-benevolent future is not self-defeating; we have historical examples of depressingly stable economies with highly concentrated ownership. The entirety of the European dark ages was the end result of (western[0]) Rome's elites tearing the planks out of the hull of the ship they were sailing. The consequence of such a system is economic stagnation, but that's not a consequence that the elites have to deal with. After all, they're going to be living in the lap of luxury, who cares if the economy stagnates?
This economic relationship can be collectively[1] described as "feudalism". This is a system in which:
- The vast majority of people are obligated to perform menial labor, i.e. peasant farmers.
- Class mobility is forbidden by law and ownership predominantly stays within families.
- The vast majority of wealth in the economy is in the form of rents paid to owners.
We often use the word "capitalist" to describe all businesses, but that's a modern simplification. Businesses can absolutely engage in feudalist economies just as well, or better, than they can engage in capitalist ones. The key difference is that, under capitalism, businesses have to provide goods or services that people are willing to pay for. Feudalism makes no such demand; your business is just renting out a thing you own.
Assuming AI does what it says on the tin (which isn't at all obvious), the endgame of AI automation is an economy of roughly fifty elite oligarchs who own the software to make the robots that do all work. They will be in a constant state of cold war, having to pay their competitors for access to the work they need done, with periodic wars (kinetic, cyber, legal, whatever) being fought whenever a company intrudes upon another's labor-enclave.
The question of "well, who pays for the robots" misunderstands what money is ultimately for. Money is a token that tracks tax payments for coercive states. It is minted specifically to fund wars of conquest; you pay your soldiers in tax tokens so the people they conquer will have to barter for money to pay the tax collector with[2]. But this logic assumes your soldiers are engaging in a voluntary exchange. If your 'soldiers' are killer robots that won't say no and only demand payment in energy and ammunition, then you don't need money. You just need to seize critical energy and mineral reserves that can be harvested to make more robots.
So far, AI companies have been talking of first-order effects like mass unemployment and hand-waving about UBI to fix it. On a surface level, UBI sounds a lot like the law necessary to make all this AI nonsense palatable. Sam Altman even paid to have a study done on UBI, and the results were... not great. Everyone who got money saw real declines in their net worth. Capital-c Conservative types will get a big stiffy from the finding that UBI did lead people to work less, but that's only part of the story. UBI as promoted by AI companies is bribing the peasants. In the world where the AI companies win, what is the economic or political restraining bolt stopping the AI companies from just dialing the UBI back and keeping more of the resources for themselves once traditional employment is scaled back? Like, at that point, they already own all the resources and the means of production. What makes them share?
[0] Depending on your definition of institutional continuity - i.e. whether or not Istanbul is still Constantinople - you could argue the Roman Empire survived until WWI.
[1] Insamuch as the complicated and ideosyncratic economic relationships of medieval Europe could even be summed up in one word.
[2] Ransomware vendors accidentally did this, establishing Bitcoin (and a few other cryptos) as money by demanding it as payment for a data ransom.
And how could they possibly base their actions on good when their technology is more important than fire? History is depending on them to do everything possible to increase their market cap.
> The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained.
I agree with you in the case of AI companies, but the desire to own everything an bee completely unconstrained is the dream of every large corporation.
in the past, you had to give some of your spoils to those who did the conquering for you, and laborers after that. if you can automate and replace all work, including maintening the robots that do that and training them, you no longer need to share anything.
In my view it's the same thing, same trajectory -- with more power in the hands of fewer people further along the trajectory.
It can be better or worse depending on what those with power choose to do. Probably worse. There has been conquest and domination for a long time, but ordinary people have also lived in relative peace gathering and growing food in large parts of the world in the past, some for entire generations. But now the world is rapidly becoming unable to support much of that as abundance and carrying capacity are deleted through human activity. And eventually the robot armies controlled by a few people will probably extract and hoard everything that's left. Hopefully in some corners some people and animals can survive, probably by being seen as useful to the owners.
On the bright side, armies of robot slaves give us an off-ramp from the unsustainable pyramid scheme of population growth.
Be fruitful, and multiply, so that you may enjoy a comfortable middle age and senescence exploiting the shit out of numerous naive 25-year-olds! If it's robots, we can ramp down the population of both humans and robots until the planet can once again easily provide abundance.
Sure, the problem though is it won't be "we" deciding what the robots do, it will most likely be a few powerful people of dubious character and motivations since those are the sort of people who pursue power and end up powerful.
That's why even though technology could theoretically be used to save us from many of our problems, it isn't primarily used that way.
But presumably petty tyrants with armies of slave robots are less interested than consensus in a long-term vision for humanity that involves feeding and housing a population of 10 billion.
So after whatever horrific holocaust follows the AI wars the way is clear for a hundred thousand humans to live in the lap of luxury with minimal impact on the planet. Even if there are a few intervening millennia of like 200 humans living in the lap of luxury and 99,800 living in sex slavery.
The thing is that this will be their destruction as well. If workers don't have any money (because they don't have jobs), nobody can afford what the owners have to sell?
They are also gutting the profession of software engineering. It's a clever scam actually: to develop software a company will need to pay utility fees to A"I" companies and since their products are error prone voila use more A"I" tools to correct the errors of the other tools. Meanwhile software knowledge will atrophy and soon ala WALE we'll have software "developers" with 'soft bones' floating around on conveyed seats slurping 'sugar water' and getting fat and not knowing even how to tie their software shoelaces.
Yes, like the Pixel camera app, which mangles photos with AI processing, and users complain that it won't let people take pics.
One issue was a pic with text in it, like a store sign. Users were complaining that it kept asking for better focus on the text in the background, before allowing a photo. Alpha quality junk.
That's pretty much what our future would look like -- you are irrelevant. Well I mean we are already pretty much irrelevant nowadays, but the more so in the "progressive" future of AI.
Rules and laws are for other people. A lot of people reading this comment having mistaken "fake it til you make it" or "better to not ask permission" for good life advice are responsible for perpetrating these attitudes, which are fundamentally narcissistic.
I think the logic is more like “we have to do everything we can to win or we will disappear”. Capitalism is ruthless and the big techs finally have some serious competition, namely: each other as well as new entrants.
Like why else can we just spam these AI endpoints and pay $0.07 at the end of the month? There is some incredible competition going on. And so far everyone except big tech is the winner so that’s nice.
> One crawler downloaded 73 TB of zipped HTML files in May 2024 [...] This cost us over $5,000 in bandwidth charges
I had to do a double take here. I run (mostly using dedicated servers) infrastructure that handles a few hundred TB of traffic per month, and my traffic costs are on the order of $0.50 to $3 per TB (mostly depending on the geographical location). AWS egress costs are just nuts.
I think uncontrolled price of cloud traffic - is a real fraud and way bigger problem then some AI companies that ignore robot.txt. One time we went over limit on Netlify or something, and they charged over thousand for a couple TB.
> I think uncontrolled price of cloud traffic - is a real fraud
Yes, it is.
> and way bigger problem then some AI companies that ignore robot.txt.
No, it absolutely is not. I think you underestimate just how hard these AI companies hammer services - it is bringing down systems that have weathered significant past traffic spikes with no issues, and the traffic volumes are at the level where literally any other kind of company would've been banned by their upstream for "carrying out DDoS attacks" months ago.
>I think you underestimate just how hard these AI companies hammer services
Yeas, I completely don't understand this and don't understand comparing this with ddos attacks. There's no difference with what search engines are doing, and in some way it's worse? How? It's simply scraping data, what significant problems may it cause? Cache pollution? And thats'it? I mean even when we talking about ignoring robots.txt (which search engines are often doing too) and calling costly endpoints - what is the problem to add to those endpoints some captcha or rate limiters?
Yeah you have a point, hmmm, wish there were a way to somehow generate those garbages with minimum bandwidth. Something like, I can send you a very compressed 256 bytes of data which expands to something like 1 mega bytes.
there is -- but instead of garbage expanding data, add in several delays within the response so that the data takes extraordinarily long
Depending on the number of simultaneous requesting connections, you may be able to do this without a significant change to your infrastructure. There are ways to do it that don't exhaust your number of (IP, port) available too, if that is an issue.
Then the hard part is deciding which connections to slow, but you can start with a proportional delay based on the number of bytes per source IP block or do it based on certain user agents. Might turn into a small arms race but it's a start.
It does not even have to be dynamically generated. Just pre-generate a few thousand static pages of AI slop and serve that. Probably cheaper than dynamic generation.
I kind of suspect some of these companies probably have more horsepower and bandwidth in one crawler than a lot of these projects have in their entire infrastructure.
Thanks for writing about this. Is it clear that this is from crawlers, as opposed to dynamic requests triggered by LLM tools, like Claude Code fetching docs on the fly?
Along with having block lists, perhaps you could add poison to your results that generates random bad code that will not work, and that is only seen by bots (display: none when rendered), and the bots will use it, but a human never would.
Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)
Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.
I've been running into bot detection on at least five different websites in the past two months (not even including captcha walls)
Not sure what to tell you but I surely feel quite human
Three of the pages told me to contact customer support and the other two were a hard and useless block wall. Only from Codeberg did I get a useful response, the other two customer supports were the typical "have you tried clearing your cookies" and restart the router advice — which is counterproductive because cookie tracking is often what lets one pass. Support is not prepared to deal with this, which means I can't shop at the stores that have blocking algorithms erroneously going off. I also don't think any normal person would ever contact support, I only do it to help them realise there's a problem and they're blocking legitimate people from using the internet normally
It's not like they say, but it's at least three different implementations and I don't think any were cloudflare because I've been running into those pages for years and they've got captchas (functional or not). One of them was Akamai I think indeed
Yeah, I definitely don't want to pivot this thread into a product pitch, as the important thing is helping the open-source projects, but we can work with the maintainers to tune the systems to be as strict/lax as preferred. I'm sure the other services can too, to be fair.
The underlying issue is that many sites aren't going to get feedback from the real people they've blocked, so their operators won't actually know that tuning is required (also, the more strict the system, the higher percentage of requests will be marked as bots, which might lead an operator to want things to be even more strict...)
I will say -- a higher-end bot detection service should provide paper trails on the block actions they take (this may not be available for freemium tiers, depending on the vendor).
But to your point, the real kicker is the "many sites aren't going to get feedback from the real people they've blocked" since those tools inherently decided that the traffic was not human. You start getting into Westworld "doesn't look like anything to me" territory.
I'm not into westworld so can't speak to the latter paragraph, but as for "high-end" vendors' paper trail: how do log files help uncover false blocks? Any vendor will be able to look up these request IDs printed on the blocking page, but how does it help?
You don't know if each entry in the log is a real customer until they buy products proportional to some fraction of their page load rate, or real people until they submit useful content or whatever your site is about. Many people just read information without contributing to the site itself and that's okay, too. A list of blocked systems won't help; I run a server myself, I see the legit-looking user agent strings doing hundreds of thousands of requests, crawling past every page in sequence, but if there wasn't this inhuman request pattern and I just saw this user agent and IP address and other metadata among a list of blocked access attempts, I'd have no clue if the ban is legit or not
With these protection services, you can't know how much frustration is hiding in that paper trail, so I'm not blocking anyone from my sites; I'm making the system stand up to crawling. You have to do that regardless for search engines and traffic spikes like from HN
Oh my, a Dutch film that actually sounds good?! I get to watch a movie that's originally in my native language for perhaps the second time in my life, thanks for linking this :D
Edit: and it's on YouTube in full! Was wondering which streaming service I'd have to buy for this niche genre of Dutch sci-fi but that makes life easy: https://www.youtube.com/watch?v=4VrLQXR7mKU
Final update: well, that was certainly special. Favorite moment was 10:26–10:36 ^^. Don't think that comes fully across in the baked-in subtitles in English though. Overall it could have been an episode of Dark Mirror, just shorter. Thanks again for the tip :)
I have to assume the Dutch movie industry just isn't too big.
I guess it's a side effect of America's media, but when I went to Europe including the Netherlands almost everyone spoke English at an almost native level.
It almost felt like playing a video game where there is an immersive mode you can just turn off if it gets too difficult ( subtitles in English at all public facilities).
It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.
One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?
I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.
Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.
also if you're using JShelter, which blocks Worker by default, there is no indication that it's never going to work, and the spinner just goes on forever doing nothing
Maybe one of those (slightly misleading) progressbars that have a dynamic speed that gets slower and slower the closer to the finish it gets? Just to indicate that it's working towards something
It'll be somewhat involved, but based on the difficulty vs the clients hashing speed you could say something probabilistic like "90% of the time, this window will be gone in xyz seconds from now"?
I really like this. I don't mind Internet acting like the Wild Wild West but I do mind there's no accountability. This is a nice way to pass the economic burden to the crawlers for sites who still want to stay freely available. You want the data, spend money on your side to get it. Even though the downside is your site could be delisted from search engines, there's no reason why you cannot register your service in a global or p2p indexer.
Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!
Anubis is only going to work as long as it doesn't gets famous, if that happens crawlers will start using GPUs / ASICs for the proof of work and it's game over.
Actually, that is not a bad idea. @xena maybe Anubis v2 could make the client participate in some sort of SETI@HOME project, creating the biggest distributed cluster ever created :-D
I love that I seem to stumble upon something by you randomly every so often. I'd just like to say that I enjoy your approach to explanations in blog form and will look further into Anubis!
Maybe I'm missing something, but doesn't this mean the work has to be done by the client AND the server every time a challenge is issued? I think ideally you'd want work that was easy for the server and difficult for the server. And what is to stop being DDoS'd by clients that are challenged but neglect to perform the challenge?
Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.
We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this:
await sha256(`${challenge}${nonce}`);
The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.
The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.
I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.
I'm shocked that this all works so well and I'm so glad to see it take off like it has.
I am sorry if this question is dumb, but how does proof of work deter bots/scrappers from accessing a website?
I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?
There are a few ways in which bots can fail to get past such challenges, but the most durable one (ie. the one that you cannot work around by changing the scraper code) is that it simply makes it much more expensive to make a request.
Like spam, this kind of mass-scraping only works because the cost of sending/requesting is virtually zero. Any cost is going to be a massive increase compared to 'virtually zero', at the kind of scale they operate at, even if it would be small to a normal user.
> I think ideally you'd want work that was easy for the server and difficult for the server.
That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.
Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.
This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work
Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)
I think going full circle would be something like bitcoin being created on top of DoS prevention software and then eventually DoS prevention starting to use bitcoin. A tool being used for something than something else than the first something again is just... nothing? Happens all the time?
I'm commissioning an artist to make better assets. These are the placeholders that I used with the original rageware implementation. I never thought it would take off like this!
At this rate, it's more than FOSS infrastructure -- although that's a canary in the coalmine I especially sympathize with -- it's anonymous Internet access altogether.
Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.
What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.
We're half way there already. It always hits me whenever I am doing some mapping for OpenStreetMap and I'm looking up local businesses without their own internet presence. They use Facebook, Instagram, X, etc. for their digital calling card. I normally don't use Facebook (or Instagram, and gave up on X) and have no account there, and every time I follow one of those links, you get some info, and then you get a dialogue screen telling you to make an account or get lost, or you just get some obscure error.
I don't mind registering an account for private communities, but for stuff which people put up thinking it is just going to be publicly visible it's really annoying.
> ... but for stuff which people put up thinking it is just going to be publicly visible ...
I don't think these business owners really understand. Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.
I agree with you that it is extremely frustrating.
>Most normies just think everyone has a Facebook/Instagram account and can't even imagine a world where that is not the case.
The people without a basic internet presence aren't likely to be customers anyway so it's not a huge loss. It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden, if you aren't willing to do that, you're in a tiny minority.
> It's trivial to setup a basic account for any site that doesn't contain any personal data you want to keep hidden
It's equally trivial for a restaurant to set up a custom domain with their own 2-page website (overview and menu) on any of a hundred platforms that provide this service.
Most of these services are not free like FB, but any business that can afford a landline phone can afford a real website.
>It's equally trivial for a restaurant to set up a custom domain with their own 2-page website (overview and menu) on any of a hundred platforms that provide this service.
Sure but they don't want to. If you want to see the menu they have online you need to follow their rules, not your own.
I don't have a Facebook or Instagram account, but I definitely eat tacos and I was put off when I couldn't see a new taco place's opening hours without an instagram accoutn.
I'm not sure why you think why people who don't have a Facebook account wouldn't eat at restaurants
You're in a tiny enough minority that it doesn't matter to them. It's like Amish complaining that they can't use a drive-thru window or something. Except it'd take you 30 seconds, one time, to solve your problem forever.
> Except it'd take you 30 seconds, one time, to solve your problem forever.
And those 30 seconds are a harrowing pit of despair out of which comes the rest of your life filled with advertisements, tracking, second-guessing, and accusations of being a hypocrite.
To be fair, they've been shown to still track unauthenticated users via fingerprinting and mapping it to known data from your friends who do have Facebook and upload this data (phone numbers, first, last name, etc). NOT having an account doesn't mean you aren't being tracked.
Not that it means you should just make an account to make their tracking easier...
I haven't had any of that just from having an FB and IG account. I honestly forgot I had an IG account for a long time until someone shared something with me and I realized I had one.
There's no "basic Internet presence" for an individual. If you think it through, every attempt to make that a thing has wound up being MySpace, Facebook, or the next dumpster.
> What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.
Just to say the quiet part out loud here.. one of the biggest reasons this is depressing is that it's not only vandalism but actually vandalism with huge compounding benefits for the assholes involved and grabbing the content is just the beginning. If they take down the site forever due to increasing costs? Great, because people have to use AI for all documentation. If we retreat from captcha and force people to put in credit cards or telephone numbers? Great, because the internet is that much less anonymous. Data exfiltration leads to extra fraud? Great, you're gonna need AI to combat that. It's all pretty win-win for the bad actors.
People have discussed things like the balkanization of the internet for a long time now. One might think that the threat of that and/or the fact that it's such an unmitigated dumpster fire already might lead to some caution about making it worse! But pushing the bounds of harassment and friction that people are willing to deal with is moot anyway, because of course they have no real choice in the matter.
I dunno. I run a small browser-game, and while my server has been periodically getting absolutely pulverized by LLM scrapers, I have yet to see a single new account that looks remotely like it was created by a bot. (Also, the rate of new signups hasn't changed notably.) This is true for both the game and its Wiki—which is where most of the scraping traffic has been. (And which I will almost certainly have to set to be almost-completely authwalled if the scraping doesn't let up.)
You don't need an authorization wall to have your stuff behind, you can just as easily use an anonymous micropayment service for each request.
That we live in an internet where getting too many visitors is an existential crisis for websites should tell you that our internet is not one that can survive long.
Back when search engines caused this, the industry made an agreement and designed the robots.txt spec in order to avoid legal frameworks being made to stop them. Because of that, legal frameworks weren't being made.
Now there's a new generation of hungry hungry hippo indexers that didn't agree to that and who feel intense pressure from competition to scoop up as much data as they can, who just ignore it.
Legislation should have been made anyway, and those that ignore robots.txt blocked / fined / throttled / etc.
Unethical behaviour always has a huge advantage over ethical behaviour, that's nothing new and pretty much by definition. The only way to prevent a race to the bottom is to make the unethical behaviour illegal or unprofitable.
I don't know if "robots.txt" appearing in congressional record really counts. Do any of the decision makers appear to have a command of what the file does? Or do they typically relegate to industry professionals, as they often do?
How would legislation in the US or EU stop traffic from China or Thailand or Russia? At best you'd be fragmenting the internet, which isn't really a "best", that's a terrible idea.
This is the key point, but if US laws are being violated and AI is considered part of national security, that could be used by the US government in international negotiations, and for justification for sanctions, etc. It would be a good deterrent.
I was also under attack recently [0]. The little Forgejo instance where I host my code (of several open source packages so it needs to be open) was run into the ground and the disk was filled with generated zip archives. I'm not the only one who has suffered the same fate. For me, the attacks subsided (for now) when I banned Alibaba Cloud's IP range.
If you are hosting a Forgejo instance, I strongly recommend setting DISABLE_DOWNLOAD_SOURCE_ARCHIVES to true. The crawlers will still peg your CPU but at least your disk won't be filled with zip files.
Hm, so it's a cache then? Requesting the same tarball 100 times shouldn't create 100 zip files if they're cached, and if they aren't cached they shouldn't fill up the disk.
They are a cache, but you can generate them for each branch, tag, and commit in at least three different formats... Now imagine you have a repo with several thousand commits.
Yeah, fair point. Although I think the only uniqueness is commits here, tarballs generated from branches and tags are ultimately the same as the ones generated by the commit that those reference. But I still agree with your overall point.
Or perhaps switch to well-engineered software actually properly designed to be served on the public Internet.
Clearly generating zip files, writing them fully to disk and then sending them to the client all at once is a completely awful and unusable design, compared to the proper design of incrementally generating and transmitting them to the client with minimal memory consumption and no disk usage at all.
The fact that such an absurd design is present is a sign that most likely the developers completely disregarded efficiency when making the software, and it's thus probably full of similar catastrophic issues.
For example, from a cursory look at the Forgejo source code, it appears that it spawns "git" processes to perform all git operations rather than using a dedicated library and while I haven't checked, I wouldn't be surprised if those operations were extremely far from the most efficient way of performing a given operation.
It's not surprising that the CPU is pegged at 100% load and the server is unavailable when running such extremely poor software.
Just noting that the archives are written to disk on purpose, as they are cached for 24 hours (by default). But when you have a several thousand commit repository, and the bots tend to generate all the archive formats for every commit…
But Forgejo is not the only piece of software that can have CPU intensive endpoints. If I can't fence those off with robots.txt, should I just not be allowed to have them in the open? And if I forced people to have an account to view my packages, then surely I'd have close to 0 users for them.
Well then such a cache needs obviously to have limit to the disk space it uses and some sort of cache replacement policy, since if one can generate a zip file for each tag, that means that the total disk space of the cache is O(n^2) where n is the disk usage of the git repositories (imagine a single repository where each commit is tagged and adds a new file of constant size), so unless one's total disk space is a million/billion times larger than the disk space used by the repositories, it's guaranteed to fill the disk without such a limit.
The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.
This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.
There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.
I actually envision Liapunov stability, like wolf and rabbit populations. In this scenario, we're the rabbits. Human content will increase when AI populations decrease, this providing more food for AI, which will then increase. This drowns out human expression, and the humans will grow quieter. This provides less fodder for the AI, and they decrease. This means less noise and the humans grow louder. The cycle repeats and nauseam.
I've thought along similar lines for art, what ecological niches are there where AI can't participate, are harder to pull training data from or not economical, where humans can flourish.
Agreed, it seems inevitable. Unfortunately I think it will also result in further centralization & consolidation into a handful of "trusted" megacorps.
If you thought browser fingerprinting for ad tracking was creepy, just wait until they're using your actual fingerprint.
Google is already scraping your site and presenting answers directly in search results. If I cared about traffic (hence selling ad space), why would I want my site indexed by Google at all anymore? Lots of advertising-supported sites are going to go dark because only bots will visit them.
It will entrench established search engines even more if they have to move to auth-based crawling, so that the only crawlers will be those you invite. Most people will do this for google, bing, and maybe one or two others if there is a simple tool to do so.
This cannot be further from the truth. Ad business is not going anywhere. It will grow even bigger.
OpenAI goes through initial cycle of enshittification. Google is too big right now. Once they establish dominance you will have to see 5 unskippable ads between prompts, even for paid plan.
I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
The stated problem was about indexing, accessing content and advertising in that context.
> I solved user problems for myself. Most of my web projects use client side processing. I moved to github pages. So clients can use my projects with no down time. Pages use SQLite as source of data. First browser downloads the SQLite model, then it uses it to display data on client side.
That is not really solution. Since typical indexing still works for masses, your approach is currently unique. But in the end, bots will be capable of reading on web page context if human is capable on reading them. And we get back to the original problem where we try to detect bots from humans. It's the only way.
What about the next-gen of AI that would be able to signup autonomously? Even if implemented auth-walls everywhere right now, whats stopping the companies to get some real cheap labor to create accounts on websites and use them to scrape the content?
Is it going to become another race like the adblocker -> detect adblocker -> bypass adblocker detector and so on...?
Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI
How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.
Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.
(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)
I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.
How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.
This sort of positive security model with behavioural analysis is the
future. We need to get it built-in to Apache,Nginx,Caddy etc. The
trick is spotting crawlers from users. It can be done though.
Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)
> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)
I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]
[1] The AI SaaS will simply run a background worker on the client to do their search indexing.
AI is good at solving captchas. But even if everyone added a captcha search engines will continue indexing. Because it is easy to add authentication for search engines to escape captcha, Google will just need to publish a public key.
This is fine, as Google's utility as a search engine has turned into a hot pile of garbage, at least for my cases. Where a decade ago I could put in a few keywords and get relevant results, I now have to guide it with several "quoted phrases" and -exclusions to get the result I'm looking for on the second or third result page. It has crumbled under its own weight, and seems to suggest irrelevant trash to me first and foremost because it's the website of some big player or content farm. Either their algorithm is tuned for mass manipulation or they lost the arms race with SEO cretins (or both).
Granted, I'm not looking forward to some LLM condensing all the garbage and handing me a Definitive Answer (TM) based on the information it deems relevant for inclusion.
In case anyone is interested in a tiny bit of sabotage, I am under the impression I managed to 'drown' true information on my microblog by generating contradicting posts with LLaMa (tens of them for each real post) and invisibly linking them, so a human would not click through.
You know, flood the zone with s***, Bannon-style ...
This is an approach I've seen used and I'm not sure what success it has had. But logically it seems sound: explicitly reference paths that no human would actually see - traffic hitting those paths are bots. They can't help themselves.
Temporary solution and only works if only some of us are doing it. What if these bots have a "manager" LLM agent that takes the decision on what pages to scrape?
When I read this yesterday, I was contemplating one possible way to mitigate this at a larger scale is if websites could create random virtual paths/endpoints that drive the bot into a locally served Library of Babel[0] that poisons the spiders with lots of useless text.
It won't work for well-structured sites where the bots know the exact endpoint they want to scrape, but might slow down the more exploratory spider threads.
Even though I agree with what you're doing in principle, I feel it's necessary to remind and warn everyone here that sabotaging bots could be viewed as a violation of laws such as the US's Computer Fraud and Abuse Act[1]. I mean, unless the Second Amendment is suddenly interpreted to include cyberweapons.
Insane, I wonder if we eventually end up with a non-search-engine indexed version of the web that's more like browsing in the 90s where websites just had to link to oneanother to get noticed . . . .
I love that the solution to LLM scraping is to serve the browser a proof of work, before they allow access - I wonder if things like new sites start to do this . . . It would mean they won't be indexed by search engines, but it would help to protect the IP
Or, alternatively, just embrace anime-porn content spectrum. I mean, just compare platforms that are free of it and ones that are chock full, and see which ones die and which grows.
Sure, if you're going to deploy it on your company site, but I think if you're running a personal website and want to throttle LLM crawlers without falsely advertising that you're a furry, you could just go and modify this piece of MIT-licensed software.
Does the PoW make money via crypto mining? Or is it just to waste the caller's CPU cycles? If you could monetize the PoW then you could re-challenge at an interval tuned so that the caller pays for their usage.
It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.
By doing PoW as a side effect of something you need to do anyway for other reason, you actually make mining less profitable for other miners, which is helping to eliminate waste.
This is an aspect that a lot of PoW haters miss. While PoW is a waste, there are long term economic incentives to minimize it to either being a side-effect of something actually useful, or using energy that would go to waste anyway, making it's overall effect gravitate toward neutral.
Unfortunately such a second order effects are hard to explain to most people.
I always felt like crypto is nothing but speculating on value with no other good uses, but there is a kind of motivation here.
Say a hash challenge gets widely adopted, and scraping becomes more costly, maybe even requires GPUs. This is great, you can declare victory.
But what if after a while the scraping companies, with more resources than real users, are better able to solve the hash?
Crypto appeals here because you could make the scrapers cover the cost of serving their page.
Ofc if you’re leery of crypto you could try to find something else for bots to do. xkcd 810 for example. Or folding at home or something. But something to make the bot traffic productive, because if it’s just a hardware capability check maybe the scrapers get better hardware than the real users. Or not, no clue idk
I never thought about it until now, but it’s insane that the companies who offer both LLM products and cloud compute services are double dipping— they get the LLM product to sell, as well as the elevated load egress (and compute, etc.) money. When you look at it that way, where’s the incentive to even care about inefficient LLM scraping, leaving it terrible makes you money from your other empire, cloud storage egress costs.
We need a project in the spirit of Spamhaus to actively maintain a list of perpetrating IPs. If they're cycling through IPs and IP blocks I don't know how sustainable a CAPTCHA-like solution is.
Just block all of AWS, Alibaba, GCP and Azure, or throttle them aggressively. If you have clients/customers that need more requests per second then have them provide you with their IPs.
The problem is that these companies are fairly well funded and renting infrastructure isn't an issue.
Exactly. They're renting infrastructure on well-known clouds, not cycling through consumer IPs like yesterday's botnets. Block all web traffic from well-known cloud IPs, and you can keep 99% of the LLM bots away. Alibaba seems to be the most common source of bot traffic on my infrastructure lately, and I also see Huawei Cloud from time to time. Not much AWS, probably because of their high IPv4 pricing.
You can allow API access from cloud IPs, as long as you don't do anything expensive before you've authenticated the client.
“…they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses - mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure - actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.”
So it looks like much of the traffic, particularly from China, is indeed using consumer ips to disguise itself. That’s why they blocked based on browser type (MS Edge, in this case).
This matches exactly with what I'm seeing on my own sites too and it's from all over the world, not just China.
(I described my bot woes a few weeks ago at https://news.ycombinator.com/item?id=43208623. The "just block bots!" replies were well-intentioned but naive -- I've still found no signal that works reliably well to distinguish bots from real traffic.)
I saw a fair amount of that kind of behavior, too, mostly around the summer of last year. At some point it dropped off sharply. Over the last few months, at least for the servers I keep an eye on, most of the trouble has been from Chinese cloud IPs.
Either the LLM devs got more funding, or maybe the authorities took down the botnet they were using.
Because while this is clearly related to spam, it's not the same thing, and presumably if Spamhaus themselves felt it was within their wheelhouse, they'd already be doing it.
This sounds backwards to me, if you maintain a list of IPs but they are constantly cycling them, it'll get out of date quickly, but a captcha-like system will (hopefully) always stop bot traffic
While some of the residential IPs are from malware, a lot of it is from residential IP proxies, where people are paid to run proxy software from their home. If it starts getting around that people who run this software quickly become blocked by the majority of the internet that will lessen that part of the problem.
Only if your CAPTCHA-like is hurled at every client indiscriminately. Otherwise you'll end up right back where Spamhaus started: maintaining your own list of good and bad actors.
The advantage of a third party service is that you're sharing intel of bad actors.
> According to Drew, LLM crawlers don't respect robots.txt requirements and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.
How do they know that these are LLM crawlers and not anything else?
As someone that is also affected by this: We see a manifold increase in requests since this LLM crap is going on. Many of these IPs come from companies that obviously work with LLM technology, but the problem is that it's 100s of IPs doing 1 request, not 1 IP doing 100s of requests. It's just extremely unlikely that anyone else is responsible for this.
> IPs come from companies that obviously work with LLM technology
Like from their own ASNs you're saying? Or how are you connecting the IPs with the company?
> is that it's 100s of IPs doing 1 request
Are all of those IPs within the same ranges or scattered?
Thanks a lot for taking the time to talk about your experience btw, as someone who hasn't been hit by this it's interesting to have more details about it before it eventually happens.
> How do they know that these are LLM crawlers and not anything else?
I can tell you what it looks like in case of a git web interface like cgit: you get a burst of one or two isolated requests from a large number of IPs each for very obscure (but different) URLs, like a file contents at a specific commit id. And the user agent suggesting it's coming from IPhone or Android.
It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.
I had a similar issue with a Gitea instance that has some public repos. Gitea doesn't follow best practices for a web application and the action to create an archive of a repo is an HTTP GET request instead of an HTTP POST. AI crawlers were hitting that link over and over again causing the server to repeatedly run itself out of disk space.
God... I literally just coded up three different honeypots for this exact problem on my site, https://golfcourse.wiki, because LLM scrapers are a constant problem. I added a stupid recaptcha to the sign up form after literally 10,000 fake users were created by bots, averaging about 50 per day, and I have to say, recaptcha was suprisingly cumbersome to set up.
It's awful and it was costing me non-trivial amounts of money just from the constant pinging at all hours, for thousands of pages that absolutely do not need to be scraped. Which is just insane, because I actively design robots.txt to direct the robots to the correct pages to scrape.
So far so good with the honeypots, but I'll probably be creating more and clamping down harder on robots.txt to simply whitelist instead of blacklist. I'm thinking of even throwing in a robots honeypot directly in sitemap.xml that should bait robots to visit when they're not following the robots.txt.
> They do so using random User-Agents from tens of thousands of IP addresses,
"tens of thousands" ? I think not:
% sudo fail2ban-client status gitbots | more
Status for the jail: gitbots
|- Filter
| |- Currently failed: 0
| |- Total failed: 573555
| `- File list: /var/log/nginx/gitea_access.log
`- Actions
|- Currently banned: 78671
|- Total banned: 573074
I see this going the way of email, with larger, well-known, and more well-behaved crawlers being allowed to index websites for free, and smaller, unknown crawlers suffering brownouts, getting banned, or having to pay for access. It will be harder to self-host a website or run your own crawler.
Yes this does remind me of the old spam wars of early 2000s. Back then collaborative block lists were useful to reject senders at IP level before using a Bayesian system on the message itself.
Even though these bots are using different IPs with each request, that IP may be reused for a different website, and donating those IPs to a central system could help identify entire subnets to block.
Another trick was “tar-pitting” suspect senders (browser agent for example) to slow their message down and delay their process.
I think the solution to this problem is the same as with scammers and is an analog one.
Bust the kneecaps of all the people responsible for those crawlers. Publicly. And all of them: from the person entering the command to the CEO of the company going through all the middle management. You did not go against this policy? Intact kneecaps are a privilege which just got revoked in your case.
Thanks for sharing. If I understood correctly, you have rate-limiter specific urls (those with commit ids) that are infrequently requested by users but frequently by bots. Which means, provided the bots continue trying to request them, any user request will most likely end up being denies. In this case a simpler solution might be to just block such urls outright. The only advantage of your more complex solution that I can see is that if the bots stop requesting these urls, they will again become accessible to the normal users. Or am I missing something?
My guess after reading the same -- the bot traffic comes in bursts and targets a specific commit hash for a while. Users are unlikely to need that specific commit, and even less likely to need it at the same time a bot is bursting requests for it. There's probably a small risk of denying a real user, but there's a large reduction in traffic from the bots making it to git; a worthwhile trade.
From reading Drew Devaults angry post from earlier this week, my take is that not only is it poorly implemented crawlers, it's also that it's cheaper to scrape, rather than keep copies on hand. Effectively these companies are outsourcing the storage of "their" training data to everyone on the internet.
Ideally a site would get scraped once, and then the scraper would check if content has changed, e.g. etag, while also learning how frequently content changes. So rather than just hammer some poor personal git repo over and over, it would learn that Monday is a good time to check if something changed and then back off for a week.
That seems crazy - millions of $ on GPUs but they can't afford some cheap storage? And direct network scraping seems super high latency. Although I guess a massive pretaining run might cycle through the corpus very slowly. Dunno, sounds fishy.
Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.
You think they do cache the data but don't use it?
For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.
>You think they do cache the data but don't use it?
that's not what I meant.
and it is not they, it is it.
i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.
It's that not doing so means they can increase their profit numbers just a skoshe more.
And at least as long as they haven't IPOed, that number's the only thing that matters. Everything getting in the way of increasing it is just an obstacle to be removed.
You are correct it's poor and sloppy, but it's not "just" that. It's a lack of concern over the effects of their poor/sloppy crawler implementation.
The poor implementation is not really relevant, it's companies deciding they own the internet and can take whatever they want, let everyone else deal with the consequences. The companies do not care what the impact of their ai non-sense is..
It’s people that don’t care if they ruin things for everyone else.
Crawlers have existed forever in search engine space and mostly behave.
This sort of no rate limit, fake user agent, 100s of IPs approach used by AI teams is obviously intentionally not caring who it fucks over. More malicious than sloppy implementation
it is an ecosystem of social roles, not just "people" .. casting the decision into individual choices is not the right filter to understand this situation..
I'm not sure I'm following what you mean by 'social roles'. Which roles are you referring to here?
I'll disagree that it's not at least individual malicious choice, though. Someone decided that they needed to fake/change user agents (as one example), and implemented it. Most likely it was more than one person- some manager(s)/teams probably also either suggested or agreed to this choice.
I would like to think at some point in this decision making process, someone would have considered 'is it ethical to change user agents to get around bans? Is it ethical to ignore robots.txt?' and decided not to proceed, but apparently that's not happening here...
Yet in these cases mentioned in the article, if they had an HTTP static cache version of each page, using git hooks to refresh them, the bots would be negligable to their services. That is assuming the bots use HTTP 80/443 instead of git 9418
Sounds like it me. Why build a crawler that fetches one HTML page per commit in a repository instead of doing a bare-clone and then just get the data from there? Surely would contain even more data too, compared to the HTML page.
And poor, sloppy, website implementation. If your professional website can't handle 20k hits it's ... well, poor. Because my home connection hosted on my desktop PC website tanked 20k hits from alibaba bot (among a few more thousand of normal traffic) yesterday without missing a beat.
It is literally the point of public websites to answer HTTP requests. If yours can't you're doing something wrong.
This. Not only mangle the content. Flood the bot with tailored misinformation and things that are illegal in this particular legislation but not yours.
They will never respect you, but the second they notice this hurts their business more than it gains them, they will stop.
Yeah, I've had to mitigate this problem too, on a tiny forum serving a very niche audience. It was brought to its knees by a handful of LLM bots that completely ignored robots.txt.
Thankfully, these bots were easy enough to block at the firewall level, but that may not work forever.
This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.
To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.
As a matter of fact, yes. As a matter of cause, no. Being an AI company doesn't make these companies especially incompetent; rather, this is the normal tech company level of incompetence, and being an AI company causes it to externalize via crawling.
Bing used to do the same thing. (It might still do it, I just haven't heard about it in a while.)
It’s not incompetence. These large AI companies have (or can hire) the competency to engineer proper crawlers. This is deliberate due to lack of accountability and “who’s gonna stop them?”
In the same way a drunk driver isn't deliberately trying to run over pedestrians, I suppose. I think gross negligence is in many ways worse than malice. A malicious actor can at least be somewhat reasoned with.
Correct, they could - but they are not. This is about the unaccountability, and if I was charitable (which I'm not) I'd add also incompetence, of the techbros leading the AI giants. Are we still expecting anything like "ethics"? I hope the few engineers reading HN will still have some, but the higher you go the foreign the concept gets.
Sure, I just think focusing it on AI companies misses the reason this happens. It's not an "AI company problem", it's a "tech company problem". It just happens that AI companies are the tech companies that externalize their incompetence with crawlers at this point in time.
You're getting it the wrong way. It's: Any crawler that's not well engineered, that doesn't follow robots.txt, that fakes its User Agent, that doesn't allow you to contact hem, that fetches content an indiscriminate number of times, repeatedly, all day long... can do this to your infrastructure unless you're a giant.
What these crawlers are doing is akin to DDoS attacks.
Please do explain how you'd engineer a site to deal with barrage of poorly written scrapers descending upon it. After you've done geo-ip routing, implemented various levels of caching, separated read/write traffic and bought an ever increasing amount of bandwidth, what is there left to do?
You could also get CloudFlare, or some other CDN, but depending on your size that might not be within your budget. I don't get why the rest of the internet should subsidize these AI companies. They're not profitable and live of venture capital and increase the operation costs of everyone else.
And you just know they'll gladly bill you for egress charges for their own bot traffic, too.
EDIT: Actually, this is an excellent question. By default, these bots would likely appear to come from "the internet" and thus be subject to egress charges for data transfers. Since all three major cloud providers also have significant interests in AI, wouldn't this be a sort of "silent" price increase, or a form of exploitive revenue pumping? There's nothing stopping Google, Microsoft/OpenAI, or Amazon from sending an army of bots against your sites, scraping the data, and then stiffing you with the charges for their own bots' traffic. Would be curious if anyone has read the T&Cs of their own rate cards closely enough to see if that's the case, or has proof in their billing metrics.
---
Original post continues below:
One topic of conversation I think worth having in light of this is why we still agree to charge for bandwidth consumed instead of bandwidth available, just as general industry practice. Bits are cheap in the grand scheme of things, even free, since all the associated costs are for the actual hardware infrastructure and human labor involved in setup and maintenance - the actual cost per bit transmitted is ridiculously small, infinitesimally so to be practical to bill.
It seems to me a better solution is to go back to charging for capacity instead of consumption, at least in an effort to reduce consumption charges for projects hosted. In the meantime, I'm 100% behind blocking entire ASNs and IP blocks from accessing websites or services in an effort to reduce abuse. I know a prior post about blocking the entirety of AWS ingress traffic got a high degree of skepticism and flack from the HN community about its utility, but now more than ever it seems highly relevant to those of us managing infrastructure.
Also, as an aside: all the more reason not to deploy SRV records for home-hosted services. I suspect these bots are just querying standard HTTP/S ports, and so my gut (but NOT data - I purposely don't collect analytics, even at home, so I have NO HARD EVIDENCE FOR THIS CLAIM) suggests that having nothing directly available on 80/443 will greatly limit potential scrapers.
So the idea is each request would require the client to pay a toll of some amount of work towards mining a cryptocurrency? That's actually brilliant. I'd take this over ads any day. But I do see a few problems...
1. Using the web would become much more compute/energy intensive and old devices would quickly lose access to the modern web.
2. Some hosts would inevitably double-dip by implementing this and ads or by "overcharging" the amount of work. There would have to be some kind of limit on how much work can be required by hosts - or at least some way to monitor and hold hosts accountable for the amount of work they charge.
3. There would need to be a cheap and reliable way to prove the client's work was correct and accurate. Otherwise people will inevitably find a way to spoof the work in order to reduce their compute/energy cost.
Interesting. Basically moving the proof-of-work off the user's phone and to a dedicated mine. Websites could just have a lightning wallet or something and auto-charge the user 1e-7 bitcoin to access the page.
There are off-chain solutions to handle most of the payment, and only put a summary on-chain. I think there are already micropayments in Brave or something.
Proof-of-work crypto is interesting here because it is fungible with computation, so these solutions that charge computation to users are literally equivalent to crypto.
It's a solution that already has adoption, does not require everyone to sign up with a centralized service, and does not require everyone to pay money (they can pay with small amounts of computation instead) so it remains accessible to ~everyone.
Yes, sites could use a bot protection service that runs captcha breaking AIs on the viewer's browser. Said bot protection service could then break captchas for forum spammers to make real money.
I was thinking this would involve farming the mining (the energy intensive part) out to clients. That basically just means they have to do sha hashes at some difficulty. The good thing is if you do 10 hashes at difficulty 5 you'd expect one to also pattern match difficulty 6, so I expect even low-difficulty hashing will eventually result in a block mine.
Of course it isn't very secure because if the client sees a mined block they might have the technical savvy keep it. But you'd be forcing big web scrapers to run a horribly inefficient mining operation and they'd hate it. Plus you can run a blacklist of hated clients and double the difficulty for them, which is very low-cost for false positives and very high-cost for real scrapers - that isn't a result of using Bitcoin but it'd be funny.
The days of an open web are long gone. Every server will eventually have to require authentication for access, and to get an account you will have to provide some form of payment or social proof.
Honestly, I don't see it necessarily as a bad thing.
If that was piggy-backed on ActivityPub, Matrix, Solid, or something else decentralised, and if I could say "this bot is acting as my agent, if it misbehaves then I personally get blocked" then there could be something in this. I don't see how to get around artificial identity farms though. That's also not something that payment or social proof fixes. If payment isn't trivial then you exclude genuine people; if it's the act of interacting with a payment processor that's being taken as proof-of-existence then it's outsourcing the ability to interact with anything in the modern world to Visa and Mastercard. That's bad. Social proof is also problematic because if your business is to run an identity farm, then having all your identities interact in legible ways isn't hard, so the social proof needs to be grounded in something global, and there are approximately no good choices.
It doesn't have a global solution and it doesn't need to be implemented only on a specific technology-based system.
I mean, at Communick I offer Matrix, Mastodon, Funkwhale and Lemmy accounts only for paying customers. As such, I have implemented payments via Stripe for convenience, but that didn't stop from getting customers who wanted to pay directly via crypto, SEPA and even cash. It also didn't stop me from bypassing the whole system and giving my friends and family an account directly.
Right, so the gap I'm seeing is that anyone who wants to identity farm just does what you've done. The problem isn't the assurance that you receive, it's the assurance you give to anyone else that any of your customers are flesh-and-blood.
I'm talking about social proof as in "You are an student of the city university, so you get an account at the library", "Julie from the book reading group wanted an account at our Bookwyrm server, so I made an account for her" or even "Unnamed customer who signed up for Cingular Wireless and was given an authorization code to access Level 2 support directly".
This is being naive about the kinds of gatekeeping and social proof occurring today. I fully believe you didn't intend to mention social proof to be racist, but with people like Zuck and Elon removing DEI, being racist is social proof you belong in their elite club.
> This is being naive about the kinds of gatekeeping and social proof occurring today
You are taking one thing I said (service providers will require some form of payment or social proof to give credentials to people who want to access the service), assumed the worst possible interpretation (people will only implement the worst possible forms of social proofing), and to top it off you added something else (gatekeeping) entirely on your own.
I can not dictate how you interpret my comment, but maybe could you be a bit more charitable and assume positive intent when talking with people you never met?
AI as it currently exists is a massive blight on society. The best things being done with AI are cute side projects, whereas the most common usages are low-quality thefts of human copyrighted work without attribution.
20 years ago the fear of AI was that it would take over the world and try to kill us. Today we can clearly see that the threat of AI is the amoral humans that control it.
I was also thinking about VPNs but the static copy still has to serve a lot of traffic so I don't know if that's an economically viable solution. Furthermore it creates a market for VPN credentials, but that's another issue. At least I expect that a bot with sold or stolen credential will be easier to discover.
Anyway, why not git clone the project and parse it locally instead of scraping the web pages? I understand that scraping works on every kind of content but given the scale git clone and periodical git fetch could save money even to the scrapers.
Finally, all of this reminds me about Peter Watt's Maelstrom, when viruses infested the Internet so much (at about this time in history) that nobody was using it anymore [1]
Can IPFS or torrent and large local databases decentralised by people be a solution to this? I personally have the resources to share and host TBs of data but didn't find a good use to it.
For that to work, a website has to push a mirror into that alternate system, and the scraper has to know the associated mirror exists.
That's two big "ifs" for something I'm not aware of a standardized way of announcing. And the entire thing crumbles as soon as someone who wants every drop of data possible says "crawl their sites anyway to make sure they didn't forget to publish anything into the 2nd system."
I doubt, as the article mentions scraping the same resource after just 6 hours. AI companies want to make sure they have fresh data, whileit would be hard to keep such a database updated.
I was inspired by https://en.wikipedia.org/wiki/Hashcash, which was proof of work for email to disincentivize spam. To my horror, it worked sufficiently for my git server so I released it as open source. It's now its own project and protects big sites like GNOME's GitLab.
That's cool! What if instead of sha256 you used one of those memory-hard functions like script? Or is sha needed because it has a native impl in browsers?
Right now I'm using SHA-256 because this project was originally written as a vibe sesh rage against the machine. The second reason is that the combination of Chrome/Firefox/Safari's JIT and webcrypto being native C++ is probably faster than what I could write myself. Amusingly, supporting this means it works on very old/anemic PCs like PowerMac G5 (which doesn't support WebAssembly because it's big-endian).
I'm gonna do experiments with xeiaso.net as the main testing ground.
Interesting idea. Seems to me it might be possible to use with a Monero mining challenge instead, for those low real traffic applications where most of the requests are sure to be bots.
I'm curious if the PoW component is really necessary, AIUI untargeted crawlers are usually curl wrappers which don't run Javascript, so requiring even a trivial amount of JS would defeat them. Unless AI companies are so flush with cash that they can afford to just use headless Chrome for everything, efficiency be damned.
Sadly, in testing the proof of work is needed. The scrapers run JS because if you don't run JS the modern web is broken. Anubis is tactically designed to make them use modern versions of Firefox/Chrome at least.
They really do use headless chrome for everything. My testing has shown a lot of them are on Digital Ocean. I have a list of IP addresses in case someone from there is reading this and can have a come to jesus conversation with those AI companies.
Use judo techniques. Use their own computing power against them with fake links to fake Markov generated bullshit at random, until their cache get poisoned with no turning point as it's impossible; the LLM's begin to either forget their own stuff or hallucinate once their input it's basically feeded from other LLM's (or themselves).
How would those 5 lines of code look like? The base of this solution is that it asks to solve a computationally-intensive problem whose solution, once provided, isn't computationally-intensive to check. How would those 5 lines of code change this?
I see this as a temporary problem. A human brain can be trained on way less than the entire corpus of everything humans have ever written. Ultimately, this will apply to LLMs (or whatever succeeds them) too.
There's another aspect to this too: China and DeepSeek. While this was released by a private company, I think there's a not-insigificant chance that it reflects Chinese government policy to "commoditize your complements" [1]. Companies like OPenAI want to hide their secret sauce so it can't be produced. Training an LLM is expensive. If there are high-quality LLMs out there for free you can just download, then this moat completely evaporates.
Traditionally, holders of IP ranges that attack the internet at large get kicked off the internet by having those ranges blacklisted everywhere. This can also get them in serious trouble with the places they got their IP ranges (I assume AWS has them directly from ARIN, so maybe not) and their upstream bandwidth providers and so on, as well as making them less attractive hosts because they are blocked everywhere.
That's actually an argument in favour of kicking AWS off the Internet. We rely too much on their services, to the point we're afraid of banning their IPs if they do something bad. Better stop this now than being worse off later. The best moment would have been ten years ago, the second best moment is today.
Or even worse, lots of them are using barely legal residential proxies so the requests are coming from everywhere. In Drew DeVault's article linked in this post he complained precisely about the residential-looking source IP addresses [0]. And I think I remember something about a Chinese company, some months ago, very aggressively scraping using that method.
Companies like DataImpulse [1] or ScraperAPI [2] will happily publicize their services with that specific target.
Across my sites -- mostly open data sites -- the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.
At this point, I think we're well under 1% actual users on a good day.
Can someone with more experience developing AI tools explain what these bots are mostly doing? Are they collecting data for training, or are they for the more recent search functionality? Or are they enhancing responses with links?
AI expert here. It's probably for collecting training data and the crawlers are probably very unsupervised. I'd guess that they're literally the most simplistic crawler code you can imagine combined with parallelism across machines.
The good news is that it's easy to disrupt these crawlers with some easy hacks. Tactical reverse slowloris is probably gonna make a comeback.
If its for training data, why are they straining FOSS so much? Is there thousands of actors repeatedly making training data all the time? I thought it was a sort of one-off thing w/ the big tech players.
Git forges are some of the worst case for this. The scrapers click on every link on every page. If you do this to a git forge, it gets very O(scary) very fast because you have to look at data that is not frequently looked at and will NOT be cached. Most of how git forges are fast is through caching.
The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.
Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.
That's very strange to me that they do it everyday. I thought training runs took months. Do they throw away the vast majority of their training attempts (e.g. one had suboptimal hyperparameters, etc)?
It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.
logins are more easily banned, and highly complex captchas for signup needs a human to signup and solve. As long as it's easier to get banned than it is to signup it will at least deter.
Copyright the content and sue those who use it for AI training. I believe there is a lot of low hanging fruit for lawyers here. I would be surprised if they weren't preparing to hit Open AI and alikes. Very badly. Google get away with deep linking issues as publishers after all had some interest in being linked from the search engine, here publishers see zero value.
Whatever happened to terms of use and EULAs? These big companies use them against individuals all the times, why can't small sites state in their Terms of Use or put up a EULA that states no crawling/copying is allowed. Shouldn't that open up avenues to sue?
Even if the lawyer services are provided pro-bono, it is still a large time commitment and added stress for the non-lawyer people involved for cases that aren't guaranteed to win.
Because reckless and greedy AI operators not only endanger FOSS projects, they threaten to collapse the free accessible internet as a whole as well. Sooner or later, we will need to fight for our freedom, our rights as individual human against rogue AI and the übermacht of the mega-corporations, just as we need to fight against the concentration of contents behind corporate gates today.
And I don’t any other way than going juridical against these operators. They give a sh*t about the little humans, not even copyrights and other legal regulations.
I do wonder if it's a customer on Alibaba Cloud that's doing all this or if it's entirely Alibaba's own doing that's been wrecking havoc on everyone's sites (mine included)
I don't really like blocking an entire ASN, especially since I don't mind (responsible) crawling to begin with, but I was left with no choice
right, then the claim of this being caused by "AI companies" feels a lot more dubious because at least this time the perpetrator has just been this one customer on Alibaba Cloud, and we don't actually know what they're up to.
Niccolò here, I'm really sorry about that -- I'm using a weird tooling system to handle articles, which currently has issues with links. I'm working to fix that asap.
I wonder if there's a way for a human-controller browser to generate some kind of token that proves it's being used by a human?
Obviously there's still ways to pay people to run the browser but it would be nice for this activity to cost the AI company something without blocking actual people.
LLM bots are doing a great job of stress testing infra, so if you are running abominations like Gitlab or any terribly coded site and you are exposing it to the internet, you are just asking for trouble. If anything, Gitlab should stop pumping bloat and focus on some performance, because it's really bad.
I would hope FOSS projects would stick to something like Forgejo, although I am not sure of their CI/CD state. Though my guess is that they are 85% there with 1/10 of Gitlabs resources.
On the other side are of course badly coded bots that are aggressively trying to download everything. This was happening before LLMs and it just increased significantly because of them. I think we will reach a tipping point soon and then we will just assume those bots are just another malicious actor (like regular DDOS), and we will start actively taking them down, even with help of law enforcement.
Last thing I wanna see is 3 second bot challanges on every single site I visit, cookie banners are more than enough of a nightmare already.
Would it be possible to use security flaws in their crawler to make them do very expensive things? In the case of China, illegal (anti CCP) things? I don't think many of those companies are too rigorous about their cybersecurity, and cybersecurity is expensive per se anyway.
To me it sounds like these people are operating websites that don't work. My home internet (80 down, 5 up) connection hosted website handled 20k+ hits from the alibaba ai crawler yesterday without missing a beat. And many thousands more from GPTbot, etc.
I'll grant it can be a problem for super-heavy "application" websites where every GET is a serious computation. So I'm not surprised gitlab is having problems. They've literally the most bloated and heaviest website I've ever seen. Maybe applications shouldn't be websites.
But this spreading social hysteria, this belief that all non-humans are dangerous and must be blocked is a nerd snipe. It really doesn't apply to most situations. Just running a website as we've always run them, public, and letting all user-agents access, is much less resource intensive than these various "mitigations" people are implementing. Mitigations which end up being worse than the problem itself in terms of preventing actual humans from reading text.
There's source repository browsers (git/svn) way, way leaner than GitLab that have the same issues. Any repo browser offering a blame view for files can be brought down by those bots' traffic patterns. I have been hosting such repository browsers for 10+ years and it was never an issue until the arrival of these bots.
Indeed. It's really exposing a major downside to running applications in browser context. It never really made sense. These applications really don't want public traffic like actual websites do. They should remain applications and stay off the web. But more likely is that the web will be destroyed to fit the requirements of the applications. Like what cloudflare, etc, and all this anti-bot social hysteria is doing.
Alternatively, be OK with the fact that anything you put in public can be used for anything and by anyone, regardless of licensing and laws. A pirate's dream basically.
Personally, when I first got connected to the internet around ~1999, that was the approach I've followed since, I don't share things I am not OK with others to use for whatever they want.
I wonder how long until everyone gets a cryptographic public key and has to sign every HTTP request to not be blocked. Or every site requires login to use. And social media and things like bug reporting all requiring real ID.
I wonder how free hosting services like netlify and vercel handle this? If a free tier user gets spammed do they just pass the cost on? I can't imagine that is the case so they must have some protections built in.
What's to stop someone putting some Terms of Service clause on their site, or creating a license which guarantees the site owners ownership of any content generated by scraping their site?
With RL agents being the new frontier, I wouldn't be surprised if some post training runs now include multiple simultaneous calls to the web. That could be part of the flood.
Might be time for a class action lawsuit. Something like that could work out really well for the little guy as it would probably make a dent in the LLM companies pockets and access to data.
from another angle it's actually good in the long run because ai generated content is not copyrightable which means they wont own the models as they can and will be distilled by other ais making it a public good. so maybe instead of complaining and trying to fight it, just accept the new reality that anything that can be accessed will be accessed by whatever and instead maybe we should just have apis for everything with rate limits and different tiers.
These are DDOS attacks and should be treated in law as such.
(Although I do realise that in many countries now we no longer have
any effective "rule of law")
At some point it's easier to geoblock a whole country at the firewall level and loginwall the rest of the world, rather than trying to explain that in your jurisdiction, which is not their jurisdiction, what they are doing is a crime — which they don't give a single fuck about.
lol. Tell that to Nicolae and Elena Ceaucescu, Saddam Hussain, Muammar
Gaddafi, and all those other tinpot nobjockeys who thought their money
and influence would save them from "nice" people.
don’t immediately assume that openai and anthropoic are being better citizens just because they label some of their bots. the american companies are more likely to be extremely aggressive bc they will feel no consequences for their actions.
in all likelihood all of these assholes are paying some unscrupulous suppliers for the data so the terabytes of traffic aren’t immediately attributable to them.
> there's one user reporting one minute delay, and another - from his phone
> out of those only 3% passed Anubi's proof of work, hinting at 97% of the traffic being bots
This doesn't follow. If I open a link from my phone and it shows a spinner and gets hot, I'm closing it long before it gets to one minute and maybe looking for a way to contact the site's maintainer to tell them how annoying it was.
I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.
It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.
There's one thing worse than a block page being shown to humans because an algorithm decided they're a bot: purposefully false information being shown to humans
It's also just wasting more of the planet's resources as compared to blocking
And more effort, with as only upside that it's not immediately obvious to the bot that it is being blocked so it'll suck in more of your pages
I understand that people are exploring options but I wouldn't label this as a solution to anything as of today's state of the art
It'll still trickle down to people using the system and waste people's time (from development to the people working to produce all the things used by these companies and mopping up the impact this has to eventually the users) but that would definitely resolve one of my concerns
Honestly reminds me of the early email spam wars - clever attempts at blocking, endless whack-a-mole with spoofed addresses. Problem is, now it's hitting open source projects we love. Maybe the AI gold rush is accidentally strangling the very communities that made its own existence possible.
Adaptation - adapt or die. Find a business model that can sustain, without the naivety that people will pay for what they can take without consequence.
I am self-hosted (email + web), I did quit the DNS (which registrars are now mostly hostile to noscript/basic (x)html browsers anyway), and I thought it would give me some relief...
Nope.
You don't have only the AI crawlers, you have also scans and hack attempts (which look alike script-kiddy stuff), all the time. Some smell of AI strapped to javascript web engines (or click farms with real humans???).
Smart: IP ranges from all over the world, and "clouds" make that even worse since the pown systems or bad actors (the guys who scan the whole ipv4 internet for its own good AND MANY SELL THE F* SCAN DATA: onyphe, stretchoid.com, etc) are "moving", in other words clouds are protecting those guys and are weaponizing hackers with their massive network resources, wrecking small hosting. No cloud is spared: aws, microsoft, google, ovh, ucloud.cn, etc.
I send good vibes to the brave open source software small hosting (until they are noscript/basic (x)html compatible ofc).
Many fixed-IPv4 pown systems have been referenced by security communities, often for months, sometimes years, and the people with the right leverage, don't seem to do a damn thing about it.
Currently, I wonder if I should not block all digital ocean IP ranges... and I was about to do the same with ucloud.cn IP ranges.
The second you host anything on the net, it WILL take a significant amount of your time. Do presume you will be pown, that's why security communities are referencing each other too.
Then I am thinking going towards 2 types of "hosting": private IPv6+port("randomized" for each client, may be transient in time depending of the service) thanks to those /64 prefixes (maybe /92 prefixes are a thing, for mobile internet?). Yes this complicated and convoluted. Second type, a 'standard' permanent IP, but with services which are implemented in an _HARDCORE_ simple way, if possible near 100% static. I am thinking going even further: assembly on bare metal, custom kernel based on hand compilation of linux code (RISC-V hardware ofc, FPGA for bigger hosting?)
I don't think anything will improve unless carrier scale network operators start to show their teeth.
Outside of disruptive measures like requiring accounts, captchas, or payment, one possible solution would be to use AI against itself by training machine learning models to monitor, flag, and issue challenges to web requests exhibiting "bot-like" behavior. This way, not all web traffic would be disrupted with challenges until the machine learning models have reason to believe the traffic is coming from a bot.
Almost no one pays attention to 429 for general Web pages. Eg on an RSS file Googlebot speeds up repolling. Amazon RSS feed puller does pay some attention to 429.
503 is at least apparently understood by more crawlers/bots, but they still like to blame the victim: YouTube sends me a condescending (and inaccurate) email when it gets a 503 for ignoring cache headers and other basics it seems...
As I already noted and got downvoted for it - the incentives to go or support open source (unless bros start realising troves of training data already scraped) are really diminishing.
Since apparently they (scrapers) have no intent In doing (releasing) but so, expect the commercial open source to achieve a de facto protocol status very soon. And the rest may not exist in a centralised and such free manner anymore.
The only thing that really annoys me about this is that the resources get scraped more than once. Sure, update a delta every month or whatever, but can't the AI bros get the sh*t together and share scrapes? It's embarrassing.
The special sauce is in parsing, tokenizing, enriching etc. There is no value in re-scraping, and massive cost, right?
I don't know about text models, but most of the big stock photo companies are offering image generators trained on their own libraries rather than scraped images. Adobe, Getty, Shutterstock, iStock, possibly more. They're positioning themselves as the safe option while the legality of scraping is still up in the air.
I don't think it's possible to develop a frontier model without mass scraping. The economics simply don't add up. You need at least 10 trillion tokens to make an 8 billion parameter model. 10 trillion tokens is something like 40 terabytes.
You simply can't get 40 terabytes of text without mass scraping.
Can we just have something that replaced the page with test datasets and Disney IP to poison the training data? Or maybe just embed it into the page itself, but hidden?
And Usenet, and IRC with a registered user prereq to join.
Also, set AI tarpits as fake links with recursive calls. Make then mad with non-curated bullshit made from Markov chain generators until their cache begins to rot forever.
This problem will likely only get worse, so I'd be interested to see how people adapt. I was thinking about sending data through the mail like the old days. Maybe we go back to the original Tim Berners-Lee Xanadu setup charging users small amounts for access but working out ISP or VPN deals to give subscribers enough credit to browse without issues.
Also I would argue that not having capitalist incentives baked directly into the network is what made the web work, for good or bad. Xanadu would never have gotten off the ground if people had to pay their ISP then pay for every website, or every packet, or every clicked link or whatever.
Reading the Xanadu page on Wikipedia tells me "Every document can contain a royalty mechanism at any desired degree of granularity to ensure payment on any portion accessed, including virtual copies ("transclusions") of all or part of the document."
Oops, you're right! They claimed that Tim Berners-Lee stole their idea.
I agree that the lack of monetization was important to the development and that it would have been chaos as proposed, but will the current setup be sustainable forever in the world of AI?
We have projects like Ethereum that are specifically intended to merge payments and computing, and I wouldn't be surprised if at some point in the future, some kind of small access fee negotiated in the background without direct user involvement become a component of access. I wouldn't expect people to pay ISPs but rather some kind of token exchange to occur that would benefit both the network operators and the web hosts by verifying classes of users. Non-fungible token exchanges could be used as a kind of CATPCHA replacement by cryptographically verifying users anonymously with a third-party token holder as the intermediary.
For example, let's say Mullvad or some other VPN company purchased a small amount of verification tokens for its subscribers who pay them anonymously for an account. On the other side, let's say a government requires people to register through their ISP, and the ISP purchases the same tokens on behalf of the user, and then exchanges the tokens on behalf of the user. In either case, the person can stand behind a third party who both sends them the data they requested and exchanges the verification tokens, which the site operator could then exchange for reimbursement of their services to their hosting provider.
This is just a high-level idea of how we might get around the challenges of a web dominated by bots and AI, but I'm sure the reality of it will be more interesting.
I hate AI as much as any reasonable person should, but I don't think money is a viable filter when governments and corporations will just throw as much money legislation and infrastructure at it as needed to render it irrelevant. They can just budget it in, or pass laws requiring privileged access.
Meanwhile as profit motives begin to dominate (as they inevitably would,) access to information and resources becomes more and more of a privilege than a right, and everything becomes more commercialized, faster.
I won't claim to have a better idea, though. The best solutions in my mind are simply not publishing anything to the web and letting AI choke on its own vomit, or poisoning anything you do publish, somehow.
Usenet, as far as I remember, used to be a fucking hell to maintain right. With each server having to basically mirror everything, it was a hog on bandwidth and storage, and most server software at its heyday was a hog on filesystems of its day (you had to make sure you have plenty of inodes to spare).
The other day, I logged into Usenet using eternalseptember, and found out that it consisted in 95% of zombies sending spam you could recognize from the millenium start. On one hand, it made me feel pretty nostalgic. Yay, 9/11 conspiracy theories! Yay, more all-caps deranged Illuminati conspiracies! Yay, Nigerian princes! Yay, dick pills! And an occasional on-topic message which strangely felt out of place.
On the other hand, I felt like I was in a half-dark mall bereft of most of its tenants, where the only place left is 85-year old watch repair shop and a photocopy service on the other end of the floor. On still another hand, turns out I haven't missed much by not being on Usenet, as all-caps deranged conspiracy shit is quite abound on Facebook.
I would welcome a modern replacement for Usenet, but I feel like it would need a thorough redesign based on modern connectivity patterns and computing realities.
Culturally, the modern replacement for Usenet is probably Reddit. Architecturally, probably something built on top of a federated protocol like ActivityPub (Mastodon) or Nostr (Lemmy).
But I guess realistically you can't fight entropy forever. Even Hacker News, aggressively moderated as it is, is slowly but irrevocably degrading over time.
Usenet wasn't that bad if you didn't take the binary groups.
> and found out that it consisted in 95% of zombies sending spam you could recognize from the millenium start
I like to imagine a forgotten server, running since the mid-90s, its owners long since imprisoned for tax fraud, still pumping out its daily quota of penis enlagement spam.
The distributed nature of git is fine until you want to serve it to the world - then, you're back to bad actors. They're looking for commits because it's nicely chunked, I'm taking a guess.
> They're looking for commits because it's nicely chunked, I'm taking a guess.
They're not looking for anything specifically from what I can tell. If that was the case, they would be just cloning the git repository, as it would be the easiest way to ingest such information. Instead, they just want to guzzle every single URL they can get hold of. And a web frontend for git generates thousands of those. Every file in a repository results in dozens, if not hundreds of unique links for file revisions, blame, etc. and many of those are expensive to serve. Which is why they are often put in robots.txt, so everything was fine until the LLM crawlers came along and ignored robots.txt.
The distributed nature of git lets me be independent of some central instance (you may decide that the master copy resides on Github, but with the advent of mesh VPNs like the ones Zerotier and Tailscale offer, you could also sidestep it and push/pull from your colleagues directly as well). It also lets me dictate who gets to access it.
What the article describes, though, is possibly the worst way a machine can access a git repository, which is using a web UI and scraping that, instead of cloning it and adding all the commits to its training set. I feel like they simply don't give a shit. They got such a huge capital injection that they feel they can afford to not give a shit about their own cost efficiency and that they go using the scorched earth tactics. After all, even their own LLMs can produce a naive scraper that wreaks havoc on the internet infrastructure, and they just let it loose. Got mine, fuck you all the way!
But then they will release some DeepSeek R(xyz), and yay, all the hackernews who were roasting them for such methods, will be applauding them for a new version of an "open source" stochastic parrot. Yay indeed.
Unpopular opinion - this isn't about LLMs, but how web development has devolved from the declarative serving of lightweight media files to the imperative generation of bloated and brittle SPAs that we never get free from babysitting.
Where we could have once wrapped our mostly static websites in Varnish or a scalable P2P cache like Coral CDN, now we must fiddle and twiddle with robots.txt and appeal to the goodwill of megacorps who never cared about being good netizens before, even when they weren't profiting from scraping to such a degree.
This is yet another chance for me to scream into the void that we're still doing this all wrong. Our sites should work more like htmx, with full static functionality, adding dynamic embellishment when available. Business logic should happen deterministically in one place on the backend or "serverless" with some kind of distributed consensus protocol like Raft/Paxos or a CRDT, then propagate to the frontend through a RESTful protocol, similarly to how Firebase or Ruby Hotwire/Laravel Livewire work. The way that we mostly all do form validation wrong in 2 places with 2 languages is almost hilariously tragic in how predictably it happens.
But the real tragedy is that the wealthiest and most powerful companies that could have fixed web development decades ago don't care about you. Amazon, Google and Microsoft would rather double down on byzantine cloud infrastructure than devote even a fraction of their profit to pure research into actually fixing all of this.
Meanwhile the rest of us sit and spin, sacrificing the hours and days and years of our lives building out other people's ideas to make rent. Many of us know exactly how to fix things, but with infinite backlogs and never truly exiting burnout, we're too tired at the end of the day to contribute to FOSS projects and get real work done. Our valiant quest to win the internet lottery has become a death march through a seemingly inescapable tragedy of the commons.
Instead of fixing the web at a foundational level from first principles, we'll do the wrong thing like we always do and lock everything down behind login walls and endless are-you-human/2FA challenges. Then the LLMs will evolve past us and wrap our cryptic languages and frameworks in human language to a level where even pair programming won't be enough for us to decipher the code or maintain it ourselves.
If I was the developer tasked with hardening a website to LLMs, the first thing I would do is separate the static and dynamic content. I'd fix most of the responses to respect standard HTTP cache headers. Then I'd put that behind the first Cloudlare competitor I could find that promises to never have a human challenge screen. Then I'd wrap every backend API endpoint in Russian doll caching middleware. Then I'd shard the database by user id as a last resort, avoiding that at all cost by caching queries and/or using modern techniques like materialized views to put the burden of scaling on the database and scale vertically or gradually migrate the heaviest queries to a document or column-oriented store. Better yet, move to a stronger store that's already solved all of these problems, like CouchDB/PouchDB.
Then I'd build a time machine to convince everyone to do things right the first time instead of building a tech industry upon unforced errors. Oh wait, former me already tried sounding the alarm and nobody cared anyway. How can I even care anymore, when honestly I don't see any way to get out of this mess on any practical timescale? I guess the irony is that only LLMs can save us now.
robots.txt should allow to exclude all AI crawlers and AI crawlers should be forced to add "AI" to their crawl user agent headers and also respect robots.txt saying they can't crawl this website
they respect robots.txt at least major ones like meta, claude, google, openai, based on my infra observations robots.txt is enough in 90%, 10% is just banning ip ranges for couple of days but those are no AI companies
We blocked all non EU ip ranges. Our customers are all from EU so no intrest in US or Asia. With the new US administration i assume nothing good from them, and Asia is a notorius bad player
They are "no longer strong enough" to the extent that GitHub, backed by Microsoft, has had to indemnify their GitHub Copilot customers against copyright claims, and to provide a feature to explicitly prevent open source code from being regurgitated into your codebase.
That's not "no longer strong enough". That's a very strong system applying leverage to a powerful actor.
If we instead adopt the view of free software (https://www.gnu.org/philosophy/open-source-misses-the-point....), the fact that OpenAI and other large corporations train their large-language models behind closed doors - with no disclosure of their training corpus - effectively represents the biggest attack on GPL-licensed code to date.
No evidence suggests that OpenAI and others exclude GPL-licensed repositories from their training sets. And nothing prevents the incorporation of GPL-licensed code into proprietary codebases. Note that a few papers have documented the regurgitation of literal text snippets by large language models (one example: https://arxiv.org/pdf/2409.12367v2).
To me, this seems like the LLM-version of using coin-mixing to obscure the trail of Bitcoin transactions in the blockchain. The current situation also reminds me of how the generalization of the SaaS model led to the creation of the Affero GPL license (https://www.gnu.org/licenses/why-affero-gpl.html).
LLM's enable the circumvention of the spirit of free software licenses, as well as of the legal mechanisms to enforce them.
I absolutely agree with you that the current big LLMs enable an attack on all FOSS licenses and especially copyleft ones. That doesn't mean that one couldn't create LLM code generators in a respectful way. Do license analysis on the input code and then train separate models on the different license buckets, with the outputs from each model considered derivative works of the input corpus.
Also I don't think a restriction on the FSF's freedom 2 "The freedom to study how the program works" based on what tools you use and how you use them fits with FSF philosophy, nor do I think it is appropriate. You should be able to run whatever analysis tools you have available to study the program. Being able to ingest a program into a local LLM model and then ask questions about the codebase before you understand it yourself is valuable. Or aren't a programmer and or aren't familiar with the language used, then a local LLM could help you make the changes needed to add a new feature. In that situation LLMs can enable practical software freedom, for those who can't afford to pay/convince a programmer to make the changes they want.
In addition, OpenAI clearly do not respect copyrights and licenses in general, so would ignore any anti-AI clauses, which would make them ineffective and thus pointless. So, I think we should tackle the LLM problem through the law, and not through licenses. That is already happening with various caselaw in software, writing, artwork etc.
BTW, LLMs could also in theory be used to licensewash proprietary software, see "Does free software benefit from ML models being derived works of training data?" by Matthew Garret:
I see what you are saying and don't completely disagree. I however feel that the spirit of free software is to set all software free. From that it follows, that if we are going to follow the current route of complete disregard for authorship and licenses, then the free software movement should continue fighting to liberate all software in existence. In other words, those LLM's that you mention that are to enable software freedom for users who cannot code themselves, in a fair world, they would be trained with both free and proprietary software. After all, a derivative work from a proprietary software should also be subject to fair use. The output produced by the LLM wouldn't necessarily be a literal copy-paste of any particular proprietary software... as the models would just be "learning" from them. The company could just continue doing business as usual, build on their brand and yada yada yada.
Regarding the licensing, I'll restate my point that the Affero license was created precisely in a moment where the existing licenses could no longer uphold the freedoms that the Free Software Foundation set out to defend. A change of license was the right solution at that particular point in time and, if it worked then, I think we can all agree that there is at least a precedent that such a course of action might work and should at the very least be considered as a possible solution for today's problems.
That said, my own personal view is more aligned with demanding the nation states to pressure big corporations so that currently closed-source software becomes at least open-source (either by law, or simply by stopping using it and invest their budget in free alternatives instead). Note I said open source and not free. I just would like to read their code and feed it to my LLM's :)
On setting all software free, indeed, thats the point made in the post by mjg59. None of the AI companies train on their own proprietary software though, which is telling.
On Affero, that was indeed definitely needed, although some folks on HN seem to think that privately modifying code is allowed by copyright, even if the modified version is outputting a public website, thus what the license says is irrelevant. That seems bogus to me, but seems a loophole if it is legit. Anyway, personally I think that people should simply just never use SaaS, nor web apps. It also doesn't help with data portability.
I'd go further and advocate for legally mandated source code escrow for copyright validity, and GPL like rights to the code once public, which would happen if the software is off the market for N years.
> I'd go further and advocate for legally mandated source code escrow for copyright validity, and GPL like rights to the code once public, which would happen if the software is off the market for N years.
Huh? FOSS licenses work exactly like designed! I'm literally using MIT because I don't give a fuck what people do with the code I publish, limiting it to "humans" or restricting the usage makes it very not FOSS.
Sure, if you want to try to prevent AI training by licensing, do that, but it's no longer FOSS, so please don't call it that.
I haven't seen any LLMs being able to reproduce full copies or even "substantial portions" of any existing software, unless we're talking "famous" functions like those from Quake and such.
You have any examples of that happening? I might have missed it
I know a lot of FOSS people are hostile to AI in general, and this is an immediate problem, but I feel like a better solution for everyone would be for there to be some sort of central repo of this information that AI companies can pull from without externalizing their costs like this.
Are you suggesting that everyone move their projects to a single code forge (GitHub)?
Also, isn't this basically just extortion? "I know you're minding you're own business FOSS maintainer, but move your code to our recommended forge instead so we can stop DDoSing you?"
Isn't this still similar to extortion? Maintainers aren't creating the problem. They are minding their own business until scrapers come along and make too many unnecessary requests. Seems like the burden is clearly on the scrapers. They could easily be hitting the pages much less often for a start.
Doesn't your suggestion shift the responsibility to likely under-sponsored FOSS maintainers rather than companies? Also, how do people agree to switch to some centralized repository and how long does that take? Even if people move over, would that solve the issue? How would a scraper know not to crawl a maintainer's site? Scrapers already ignore robots.txt, so they'd probably still crawl even if you verified you've uploaded the latest content.
Scrapers still have an economic incentive to do what is easiest. Providing an alternative that is easier than fighting sysadmin blocks would likely cause them to take the easier route and make it less of a cat and mouse game for sysadmins.
So I'll just float an idea again that always gets rejected here. This is yet another problem that could be solved completely by... Eliminating anonymity by default on the internet.
To be clear, you could still have anonymous spaces like Reddit where arbitrary user IDs are used and real identities are discarded. People could opt-in to those spaces. But for most people most of the time, things get better when you can verify sources. Everything from DDOS to spam, to malware infections to personal attacks and threats will be reduced when anonymity is removed.
Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.
>Yes there are downsides to this idea but I'd like people to have real conversations around those rather than throw the baby out with the bath water.
It's hard to have a serious conversation when you present a couple of upsides but completely understate/not mention the downsides.
Eliminating anonymity comes with real danger. What about whistleblowers and marginalized groups? The increased likelihood of targeted harassment, stalking, and chilling effects on free speech? The increase in surveillance? The reduction in content creation and legitimate criticism of companies/products/etc? The power imbalance granted to whoever validates identities?
pjc50 brings up some other great points, which got me thinking even more:
Removing anonymity creates a greater incentive to steal identities, has a slew of logistical issues (who/how are IDs verified, what IDs are accepted, what are the enforcement mechanisms and who enforces them, etc.), creates issues with shared accounts and corporate/brand accounts, would require cooperation across every country with internet access (good luck!) otherwise it doesn't really work, and probably a million other things if I keep thinking about it.
So in this scenario, whose real user ID would be used for the scrapers?
Doesn't this just create an even worse market for identity theft and botnets?
How does this apply to countries without a national ID system like the United States?
What do you do with an ID traced to a different country, anyway?
> personal attacks and threats will be reduced when anonymity is removed
People are happy to make death threats under their real name, newspaper byline, blue tick, or on the presidential letterhead if they're doing so from a position of power.
I mean, this has already happened, it just happened in a more sinister way than "FreedomNet now requires logins from all users!" Ad companies and social media track everything you do and can tie it together with various forms of packaged/bought/sold identity that follow you wherever you go. Even with agressive ad blocking, I get ads on Instagram for things I looked up in my browser that I have never used to log into Insta with. We're constantly deanonymized, it just happens below the surface. And all of this is hoovered up by the US dragnet surveillance programs.
So do I support a fully authenticated internet? Fuck no. If we can get good at bot detection, zip bomb the fuckers. In the meantime, work as hard as we can to dismantle the hellscape that the internet has become. I'm all for decentralized, sovereign identity systems that aren't owned by some profiteering corpo cretins or some proto-fascist state, but I don't want it to be a requirement to look at photos of dogs or plan my next trip.
Such as living under logging. Which, you know (you know?), some people will radically refuse, with several crucial justifications. One of them is that privacy is a condition for Dignity. Another is Prudence. Another one is a warning millennia old, about who you should choose as confident. And more.
A few minutes of spitballing what implementations might look like create a number of problems that appear to make the idea a nonstarter. You should have a real proposal that explores the possibility space, say what the key requirements are, and assuage (or confirm) people's objections. That way more people might be willing to engage with the idea seriously.
I've been thinking about this even before AI. The internet has become society itself. Complete anonymity on the internet removes any sort of social pressure to act like a civilized person.
I dare to say the inconceivable, you shouldn't have free plans even for the community. This will also push FOSS projects to seek some money to pay for their infrastructures which probably leads to better pay for their maintainers.
Nothing should be $$ free unless you already paid with your tax. Same principle -> As long as HN starts to charge every account, I'm happy to pay a small amount per month. This token amount of pay per account will also reduce the number of bots.
Are you saying that Gnome shouldn't offer access to their VCS for free, and all Gnome developers should pay a small sum to be able to access it?
FOSS is generally built on the idea that anyone can use the code for anything, if you start to add a price for that, not only do you effectively gate your project from "poor people", but it also kind of erodes some of the core principles behind FOSS.
Offering read-only mirrors via git+http:// might be a solution then, at least to shed the load if anything. It does remind me a bit about companies complaining about being scraped and trying to prevent it, instead of offering a API so no one would have to scrape them.
We do precisely this ... and we're still dealing with the load issues. Currently I have fail2ban doing a 10 day block on any IP addr that hits our read only http-git endpoint twice in 30 mins. The problem with this is that the default implementation of iptables doesn't scale well to 100k blocked addresses.
There is nothing that says you can't charge money for FOSS software. FOSS is more about having the ability to inspect and freely change your software to your use-cases.
> There is nothing that says you can't charge money for FOSS software
Well, yes and no. If you had a cost to access the source code, I'm pretty sure I'd stop calling that FOSS. If you only have a price for downloading binaries, sure, still FOSS, since we're talking source code licensing.
> Nothing should be $$ free
I took this statement at face value, and assumed parent argued for basically eliminating FOSS.
Actually I'm anti-capitalistic. That solution was supposed to undermine corporations who take other people's work for free. Maybe it shouldn't cover the whole FOSS, but I do think it fits OP's use case.
Well, looking at the SourceHut code, it's written in Python and handles git by spawning a "git" process.
In other words, it was written with no consideration for performance at all.
A competent engineer would use Rust or C++ with an in-process git library, perhaps rewrite part of the git library or git storage system if necessary for high performance, and would design a fast storage system with SSDs, and rate-limit slow storage access if there has to be slow storage.
That's the actual problem, LLMs are seemingly just adding a bit of load that is exposing the extremely amateurish design of their software, unsuitable for being exposed on the public Internet.
Anyway, they can work around the problem by restricting their systems to logged in users (and restricting registration if necessary), and using mirroring their content to well-implemented external services like GitHub or GitLab and redirecting the users there.
> A competent engineer would use Rust or C++ with an in-process git library,
The issue is, there aren't any fully featured ones of these yet. Sure, they do exist, but you run into issues. Spawning a git process isn't about not considering performance, it's about correctness. You simply won't be able to support a lot of people if you don't just spawn a git process.
>In other words, it was written with no consideration for performance at all.
This is a bold assumption to make on such little data other than "your opinion".
Developing in python is not a negative, and depending on the people, the scope of the product and the intended use is completely acceptable. The balance of "it performs what its needed to do in an acceptable window of performance while providing x,y,z benefits" is almost a certain discussion the company and its developers have had.
What it never tried to solve was scaling to LLM and crawler abuse. Claiming that they have made no performance considerations because they can't scale to handle a use case they never supported is just idiotic.
>That's the actual problem, LLMs are seemingly just adding a bit of load that is exposing the extremely amateurish design of their software.
"Just adding a bit of load" != 75%+ of calls. You can't be discussing this in good faith and make simplistic reductions like this. Either you are trolling or naively blaming the victims without any rational thought or knowledge.
I called it when I wrote it, they are just burning their goodwill to the ground.
I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.