This reminds me of how GPT-2/3/J came across https://reddit.com/r/counting, wherein redditors repeatedly post incremental numbers to count to infinity. It considered their usernames, like SolidGoldMagikarp, such common strings on the Internet that, during tokenization, it treated them as top-level tokens of their own.
Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257 distinct tokens in its vocabulary. It does make me wonder - it's certainly not a linear relationship, but given the number of inferences run every day on GPT-3 while it was the flagship model, the incremental electricity cost of these Redditors' niche hobby, vs. having allocated those slots in the vocabulary to actually common substrings in real-world text and thus reducing average input token count, might have been measurable.
It would be hilarious if the subtitle on OP's site, "IECC ChurnWare 0.3," became a token in GPT-5 :)
I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs. I mean if someone posts a question on an internet forum that I don't know the answer to, I'm certainly not going to post "I don't know" since that wouldn't be useful.
In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.
Three logicians walk into a bar. The bartender says "what'll it be, three beers?" The first logician says "I don't know". The second logician says "I don't know". The third logician says "Yes".
Both of the first two logicians wanted a beer; otherwise they would know the answer was "no". The third logician recognizes this, and therefore knows the answer.
He didn’t know perfectly, but he knew with great enough probability to place an order. In the very small chance that someone wanted two beers, someone would speak up.
This way is logically most efficient to work and involve the least communication.
I recently heard this explained () in the following way: three is the smallest number where you can set up an expectation (with the first two) and then break it. This is why three is such a common number, not just in jokes but in all sorts of story-telling.
() In a lecture by the mathematician & author Sarah Hart.
I do love a good joke, but this one falls a bit flat.
Logically speaking, the second bar tender could have thought to himself "no I don't want any beer, but one of these two other guys may want to double fist" and so there is really no way for the third logician to answer in the affirmative.
I'm really cross that the word "hallucination" has taken off to describe this as it's clearly in incorrect word. The correct word to describe it is "confabulation", which is clinically more accurate and a much clearer descriptor of what's actually going on.
Fully agreed that "hallucination" is a bonkers word for it — sensational and melodramatic. But few people know what a confabulation is, and moreover it's an overly complex way to describe the phenomenon.
The LLM is making something up. It's a fabrication.
It's not fanciful; it's not spooky; it's mundane, as it should be.
Fabrication implies making things up for the sake of it. Confabulation is similar but is defined by making things up due to some limitation in capacity.
My view is that hallucination is something related to the interpretation of reality. It's not really directly mapping to memory at all. The mechanisms of confabulation entirely surround the gluing together of memories, and what are these models other than some sort of representation of memory.
I believe that you can also cause something a bit like a transient dysphasia by giving them bad inputs as well, so there is that on the language production side. However there's still nothing that pertains to the experience aspects central to what hallucinations actually are.
I'm not sure that's correct. Hallucination definitely implies some sort of connection to reality, confabulation does not, it implies some kind of hard to detect error in stitching memories together coherently
That making things up based on memories of past things is entirely what confabulation is. Bullshitting in the large as it were. I've met quite a few clinical confabulators (people with Korsakov syndrome and the like) and I find the parallels remarkable.
That’s a good observation. If LLMs had taken off 15 years ago, maybe they would answer every question with “this has already been asked before. Please use the search function”
It's not. Kids overhear what parents watch, their ears are like little recorders. Meanwhile, kids videos near-universally end with either a like&subscribe admonition, or some crap like "ask parents to download our tablet app". Even the quality videos, they all do that.
Even if you don't show children videos, but want to play some music, YouTube is still the least-hassle, least-bullshit music stream player (arguably still it's main use for adults, too). Ain't anyone got time to deal with Spotify's ever more broken app. And this is the limit of technical skill of almost all parents. They can't exactly run SponsorBlock in YouTube's mobile app (and paid YouTube doesn't help here either, surprise surprise).
Not making excuses (though I'm not really blaming parents for this) - just saying how things actually are.
Which is crazy because there's plenty of good content for kids on Youtube (if you really need a break!). Blippy, Meekah, Seasame Street, even that mind-numbing drivel Cocomelon (which at least got my girls talking/singing really early).
I get the sentiment, but when reality hits unrealistic parental expectations, things get messy.
If you have to put a show on TV to give some songs to sing along to or to distract them while you're making lunch, I'm not judging you, and I think it's best to put this content on a gradient rather than black and white.
Neither did they have vaccines, bikes, gymnastics classes, dozens of books, a constant supply of fruit and veggies, family vacations, tractor rides, swing sets, family movie nights, planetarium projectors for a few bucks, zoos, kids museums, and in general a conflict free peaceful and disease free existence
Focusing on such a tiny thing and blowing it up into a huge negative out of context of their rich, busy, and safe lives is really out of hand.
Sure, and most of it starts with a jingle and ends up with begging block.
I used to cut all those things to shape with youtube-dl and Audacity; we have a library of a good hundred+ of sanitized songs to play, but with modern world hating files and anything offline, it turned out to be quite a hassle to keep the practice up.
Which is why I hate the current breed of assistants. They won't be Star Trek-level nice until vendors give up on the whole "Hey [brand 1], [brand 2] on [brand 3] with [brand 4]" interface.
They can say they don't know, and have been trained to in at least some cases; I think the deeper problem — which we don't know how to fix in humans, the closest we have is the scientific method — is they can be confidently wrong.
Contrast with Q&A on products on Amazon where people routinely answer that way. I have flagged responses saying "I don't know" but nothing ever comes of it.
I’d place in the same category the responses that I give to those chat popups so many sites have. They show a person saying to me “Can I help you with anything today?” so I always send back “No”.
A lot of LLM hallucination is because of the internal conflict between alignment for helpfulness and lack of a clear answer. It's much like when someone gets out of their depth in a conversation and dissembles their way through to try and maintain their illusion of competence. In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations.
A lot more of LLM hallucination is it getting the context confused. I was able to get GPT4 to hallucinate easily with questions related to the distance from one planet to another, since most distances on the internet are from the sun to individual planets, and the distances between planets varies significantly based on their locations in cycle. These are probably slightly harder to fix.
"In these cases, if you give the LLM explicit permission to tell you that it doesn't know in cases where it's not sure, that will significantly reduce hallucinations."
I've noticed that while this can help to prevent hallucinations, it can also cause it to go way too far in the other direction and start telling you it doesn't know for all kinds of questions it really can answer.
It also has a problem with quantity, so it gets confused by things like the cube root of 750l that it maintains for a long time is around 9m. It even suggests that 1l is equal to 1m³.
>In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.
This isn't true. There are many contexts where it is true but it doesn't actually generalize they way you say it does.
There are plenty of cases where experts in a non-one-on-one context will express a lack of knowledge. Sometimes this will be as part of making point about the broader epistemic state of the group, sometimes it will be simply to clarify the epistemic state of the speaker.
I personally will almost always say I don't know while talking thru to a solution. Admittedly this is informal speech that doesn't make it to written form.
I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.
Because expecting a behaviour, like knowing you don't know, that isn't represented in the training set is silly.
Kids make stuff up at first, then we correct them - so they have a way to learn not to.
> I've wondered if one could train a LLM on a closed set of curated knowledge. Then include training data that models the behaviour of not knowing. To the point that it could generalize to being able to represent its own not knowing.
The problem is that curating data is slow and expensive and downloading the entire web is fast and cheap.
Agreed. Using LLM to generate or curate training sets for other generations seems like a cool approach.
Maybe if you trained a small base model to know it doesn't know in general and THEN trained it on the entire web with embedded not-knowing preserving training examples, it would work?
With Wittgenstein I think we see that "hallucinations" are a part of language in general, albeit one I could see being particularly vexing if you're trying to build a perfectly controllable chatbot.
I'm referring to his two works, the "Tractatus Logico-Philosophicus" and "Philosophical Investigations". There's a lot explored here, but Wittgenstein basically makes the argument that the natural logic of language—how we deduce meaning from terms in a context and naturally disambiguate the semantics of ambiguous phrases—is different from the sort of formal propositional logic that forms the basis of western philosophy. However, this is also the sort of logic that allows us to apply metaphors and conceive of (possibly incoherent, possibly novel, certainly not deductively-derived) terms—counterfactuals, conditionals, subjunctive phrases, metaphors, analogies, poetic imagery, etc. LLMs have shown some affinity of the former (linguistic) type of logic with greatly reduced affinity with the latter (formal/propositional) sort of logical processing. Hallucinations as people describe them seem to be problems with not spotting "obvious" propositional incoherence.
What I'm pushing at is not that this linguistic ability naturally leads to the LLM behavior we're seeing and calling "hallucinating", just that LLMs may capture some of how humans process language, differentiate semantics, recall terms, etc, but without the mechanisms that enable rationally grappling with the resulting semantics and propositional (in)coherency that are fetched or generated.
I can't say this is very surprising—most of us seem to have thought processes that involve generating and rejecting thoughts when we e.g. "brainstorm" or engage in careful articulation that we haven't even figured out how to formally model with a chatbot capable of generating a single "thought", but I'm guessing if we want chatbots to keep their ability to generate things creatively there will always be tension with potentially generating factual claims, erm, creatively. Further evidence is anecdotal observations that some people seem to have wildly different thresholds for the propositional coherence they can spot—perhaps one might be inclined to correlate the complexity with which one can engage in spotting (in)coherence with "intelligence", if one considers that a meaningful term.
Wait, are you saying this something you read in both the Tractatus and the PI? They are quite opposed as texts! That's kinda why he wrote the PI at all..
I don't think Wittgenstein would agree, first of all, that there is a "natural logic" to language. At least in the PI, that kind of entity--"the natural logic of language"--is precisely the kind of weird and imprecise use of language he is trying to expose. Even more, to say that such a logic "allows" for anything (like metaphors) feels like a very very strange thing for Wittgenstein to assert. He would ask "what do you mean by 'allows'"?
All we know, according to him (in the PI), is that we find ourselves speaking in situations. Sometimes I say something, and my partner picks up the right brick, other times they do nothing, or hit me. In the PI, all the rest is doing away with things, like our idea of private language, the irreality of things like pain, etc. To conclude that he would make such assertions about the "nature" of language, of poetry, whatever, seems like maybe too quick a reading of the text. It is at best, a weirdly mystical reading of him, that he probably would not be too happy about (but don't worry about that, he was an asshole).
The argument you are making sounds much more French. Derrida or Lyotard have said similar things (in their earlier, more linguistic years). They might be better friend to you here.
The texts are quite different, this is true, but I don't find them contradictory. Whereas Tractatus was almost a facetious or flippant rejection of the millenia-long project to agree on a philosophical subset of language suitable for rigorous philosophy (although it continues today in the form of analytical philosophy), PI basically says "well we don't need to throw the baby out with the bath water", which I think is a fantastically mature response to a flawed tool that's still the best we have to reason about the universe. So: not contradictory in evaluation of fundamental compatibility of non-formal language for the formal needs of propositional philosophy, but perhaps contradictory in implied reaction to this realization.
I would assume GP is talking about the fallibility of human memory, or perhaps about the meanings of words/phrases/aphorisms that drift with time. C.S. Lewis talks about the meaning of the word "gentleman" in one of his books; at first the word just meant "land owner" and that was it. Then it gained social significance and began to be associated with certain kinds of behavior. And now, in the modern register, its meaning is so dilute that it can be anything from "my grandson was well behaved today" or "what an asshole" depending on its use context.
> In fact, in general, in any non one-on-one conversation the answer "I don't know" is not useful because if you don't know in a group, your silence indicates that.
Only a response makes it clear one has read and acknowledged the question and sometimes there are people expected to know, if they don‘t the should say so.
> I wonder how much the source content is the cause of hallucinations rather than anything inherent to LLMs
I mean, it's inherent to LLMs to be unable to answer "I don't know" as a result of not knowing the answer. An LLM never "doesn't know" the answer. But they'll gladly answer "I don't know" if that's statistically the most likely response, right? (Although current public offerings are probably trained against ever saying that.)
That knowing to say "I don't know" instead of extrapolating is an explicitly learned skill in humans, not something innate or inherent in structure of language, so we shouldn't expect LLMs to pick that ex nihilo either.
I suspect this is going to be a disagreement on the meaning of "to know".
On the same lines as why people argue if a tree falling in a wood where nobody can hear it makes sound because some people implicitly regard sound is the qualia while others regard it as the vibrations in the air.
What does it mean for a human to "know" something?
I have some representation in my mind; as someone who doesn't have aphantasia, this representation comes with a mental image. Tower? Tall, linear, and in my case a skyscraper by default. Eiffel Tower? Paying attention to the extra context, the first word transforms the second into the eponymous structure. Model Eiffel Tower? Now the context makes it a tchotchke, probably 10cm tall. Lego model Eiffel Tower? The 1-ish meter tall one on display in the Lego shop.
Is my "knowledge" the abstract representation that is in my case connected to a mental image? The attention process can reasonably be considered as developing a vector in a very high dimensional concept space, and the next token comes from what would best suit the current location in that high dimensional space. It's entirely possible that the concept of "ignorance" is linearly separable within that space (much as gender is, see the word2vec trick with "king" - "queen" ~= "man" - "woman"), and the corresponding "ignorance" vector can be associated with the sequence of words "I don't know".
I think it would take actual research into the internal vector space to answer that, and while I'd like to do that research, I have some higher priorities right now.
An LLM should have no problem replying "I don't know" if that's the most statistically likely answer to a given question, and if it's not trained against such a response.
What it fundamentally can't do is introspect and determine it doesn't have enough information to answer the question. It always has an answer. (disclaimer: I don't know jack about the actual mechanics. It's possible something could be constructed which does have that ability and still be considered an "LLM". But the ones we have now can't do that.)
No, that's the same misunderstanding previously stated.
Answering "I don't know" because it a likely response to a particular string is completely different from being aware that one does not know the answer and saying so.
Both motivations lead to the same outcome, but they're unrelated processes. The response "I don't know" can represent either:
1. The most likely answer to a particular question, based on statistical data; or
2. An expression of an agent's internal state.
Figuring out that distinction is perhaps one of the most important questions ever raised.
During tokenization, the usernames became tokens... but before training the actual model, they removed stuff like that from the training data, so it was never trained on text which contains those tokens. As such, it ended up with tokens which weren't associated with anything; glitch tokens.
It's interesting: perhaps the stability (from a change management perspective) of the tokenization algorithm, being able to hold that constant, between old and new training runs was deemed more important than trying to clean up the data at an earlier phase of the pipeline. And the eventuality of glitch tokens was deemed an acceptable consequence.
I'm more interested in what that content farm is for. It looks pointless, but I suspect there's a bizarre economic incentive. There are affiliate links, but how much could that possibly bring in?
This is honeypot. The author, https://en.wikipedia.org/wiki/John_R._Levine, keeps it just to notice any new (significant) scraping operation launched that will invariably hit his little farm and let be seen in the logs. He's well known anti-spam operative with his various efforts now dating back multiple decades.
Notice how he casually drops a link to the landing page in the NANOG message. That's how the bots will get a bait.
I recognize the name John Levine at iecc.com, "Invincible Electric Calculator Company," from web 1.0 era. He was the moderator of the Usenet comp.compilers newsgroup and wrote the first C compiler for the IBM PC RT
Except the first thing openai does is read robots.txt.
However, robots.txt doesn't cover multiple domains, and every link that's being crawled is to a new domain, which requires a new read of a robots .txt on the new domain.
> Except the fiist thing openai does is read robots.txt.
Then they should see the "Disallow: /" line, which means they shouldn't crawl any links on the page (because even the homepage is disallowed). Which means they wouldn't follow any of the links to other subdomains.
All the lines related to GPTBot are commented out. That robots.txt isn't trying to block it. Either it has been changed recently or most of this comment thread is mistaken.
Accessing a directly referenced page is common in order to receive the noindex header and/or meta tag, whose semantics are not implied by “Disallow: /”
And then all the links are to external domains, which aren't subject to the first site's robots.txt
More directly, e.g. Tesla boasts of training their FSD on data captured from their customer's unassisted driving. So it's hardly surprising that it imitates a lot of humans' bad habits, e.g. rolling past stop lines.
Jesus, that’s one of those ideas that looks good to an engineer but is why you really need to hire someone with a social sciences background (sociology, anthropology, psychology, literally anyone who’s work includes humans), and probably should hire two, so the second one can tell you why the first died of an aneurism after you explained your idea.
You mean human users? That is and always will be the dominant group of clients that ignore robots.txt.
What you’re talking about is an arms race wherein bots try to mimic human users and sites try to ban the bots without also banning all their human users.
That’s not a fight you want to pick when one of the bot authors also owns the browser that 63% of your users use, and the dominant site analytics platform. They have terabytes of data to use to train a crawler to act like a human, and they can change Chrome to make normal users act like their crawler (or their crawler act more like a Chrome user).
Shit, if Google wanted, they could probably get their scrapes directly from Chrome and get rid of the scraper entirely. It wouldn’t be without consequence, but they could.
The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.
People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.
> The point here is to poison the well for freeloaders like OpenAI not to actually prevent web crawlers. OpenAI will actually pay for access to good training data, don’t hand it over for free.
Sure, and they’ll pay the scrapers you haven’t banned for your content, because it costs those scrapers $0 to get a copy of your stuff so they can sell it for far less than you.
> People don’t mindlessly click on things like terms of service crawlers are quite dumb. Little need for an arms race, as the people running these crawlers rarely put much effort into any one source.
The bots are currently dumb _because_ we don’t try to stop them. There’s no need for smarter scrapers.
Watch how quickly that changes if people start blocking bots enough that scraped content has millions of dollars of value.
At the scale of a company, it would be trivial to buy request log dumps from one of the adtech vendors and replay them so you are legitimately mimicking a real user.
Even if you are catching them, you also have to be doing it fast enough that they’re not getting data. If you catch them on the 1,000th request, they’re getting enough data that it’s worthwhile for them to just rotate AWS IPs when you catch them.
Worst case, they just offer to pay users directly. “Install this addon. It will give you a list of URLs you can click to send their contents to us. We’ll pay you $5 for every thousand you click on.” There’s a virtually unlimited supply of college students willing to do dumb tasks for beer money.
You can’t price segment a product that you give away to one segment. The segment you’re trying to upcharge will just get it for cheap from someone you gave it to for free. You will always be the most expensive supplier of your own content, because everyone else has a marginal cost of $0.
Google doesn’t care what you do to other crawlers that ignore your TOS. This isn’t a theoretical situation it’s already going on. Crawling is easy enough to “block” there’s court cases on this stuff because this is very much the case where the defense wins once they devote fairly trivial resources to the effort.
And again blocking should never be the goal poisoning the well is. Training AI on poisoned data is both harder to detect and vastly more harmful. A price compared tool is only as good as the actual prices it can compare etc.
Scrapers of the future won't be ifElse logic, they will be LLM agents themselves. The slow loris robots.txt has to provide an interface to it's own LLM, which engages the scraper LLM in conversation, aiming to extend it as long as possible.
"OK I will tell you whether or not I can be scraped. BUT FIRST, listen to this offer. I can give you TWO SCRAPES instead of one, if you can solve this riddle."
You just set limits on everything (time, buffers, ...), which is easier said than done. You need to really understand your libraries and all the layers down to the OS, because its enough to have one abstraction that doesn't support setting limits and it's an invitation for (counter-)abuse.
Doesn't seem like it should be all that complex to me assuming the crawler is written in a common programming language. It's a pretty common coding pattern for functions that make HTTP requests to set a timeout for requests made by your HTTP client. I believe the stdlib HTTP library in the language I usually write in actually sets a default timeout if I forget to set one.
He says 3 million, and 1.8 million are for robots.txt
So 1.2 million non robots.txt requests, when his robots.txt file is configured as follows
# buzz off
User-agent: GPTBot
Disallow: /
Theoretically if they were actually respecting robots.txt they wouldn't crawl any pages on the site. Which would also mean they wouldn't be following any links... aka not finding the N subdomains.
A lot of crawlers, if not all, have a policy like "if you disallow our robot, it might take a day or two before it notices". They surely follow the path "check if we have robots.txt that allows us to scan this site, if we don't get and store robots.txt, scan at least the root of the site and its links". There won't be a second scan, and they consider that they are respecting robots.txt. Kind of "better ask for forgiveness than for permission".
That is indistinguishable from not respecting robots.txt. There is a robots.txt on the root the first time they ask for it, and they read the page and follow its links regardless.
I agree with you. I only stated how the crawlers seem to work, if you read their pages or try to block/slow down them it seems clear that they scan-first-respect-after. But somehow people understood that I approve that behaviour.
For those bad crawlers, which I very much disapprove, "not respecting robots.txt" equals "don't even read robots.txt, or if I read it ignore it completely". For them, "respecting robots.txt" means "scan the page for potential links, and after that parse and respect robots.txt". Which I disapprove and don't condone.
There are fewer than 10 links on each domain, how did GPTBot find out about the 1.8M unique sites? By crawling the sites it's not supposed to crawl, ignoring robots.txt. "disallow: /" doesn't mean "you may peek at the homepage to find outbound links that may have a different robots.txt"
I'm not sure any publisher means for their robots.txt to be read as:
"You're disallowed, but go head and slurp the content anyway so you can look for external links or any indication that maybe you are allowed to digest this material anyway, and then interpret that how you'd like. I trust you to know what's best and I'm sure you kind of get the gist of what I mean here."
The convention is that crawlers first read /robots.txt to see what they're encouraged to scrape and what they're not meant to, and then hopefully honor those directions.
In this case, as in many, the disallow rules are intentionally meant to protect the signal quality and efficiency of the crawler.
Linkers & Loaders is their own book (I haven't checked the others).
They have a page at https://www.iecc.com/linker/ where they used to publish a draft of the book contents, but changed the page to say "Chapters were available in an excessive variety of formats, but are not any longer due to chronic piracy", when it got posted to HN at https://news.ycombinator.com/item?id=18424233 and I bundled the files for offline reading. I notified them via email about that asking if they are OK with it but got an unfriendly response that I pirated the files and that wasn't OK, so I took the link down again and they changed that text. (Shrug. I'm not a/the book author, they are. I'll say that I also suggested to them they ask on the page not to do what I did since then I wouldn't have, but they chose their more radical approach.)
It's for shits-and-giggles and it's doing its job really well right now. Not everything needs to have an economic purpose, 100 trackers, ads and backed by a company.
Am I the only one who was hoping—even though I knew it wouldn’t be the case—that OpenAI’s server farm was infested with actual spiders and they were getting into other people’s racks?
It's not just a problem for training, but the end user, too. There are so many times that I've tried to ask a question or request a summary for a long article only to be told it can't read it itself, so you have to copy-paste the text into the chat. Given the non-binding nature of robots.txt and the way they seem comfortable with vacuuming up public data in other contexts, I'm surprised they allow it to be such an obstacle for the user experience.
I would say robots.txt is meant to filter access for interactions initiated by an automated process (ie automatic crawling). Since the interaction to request a site with a language model is manual (a human request) it doesn't make sense to me that it is used to block that request.
If you want to block information you provide from going through ClosedAI servers, block their IPs instead of using robots.txt.
If my web browser's extension "visits" the site and dumps it into ChatGPT for me to read its summarization of the site, what has been gained by the website operator?
This is why these things - search engines, AI crawlers, even adblock and video downloaders - exist in a slightly adversarial/parasitic relationship with the sites that provide their content to which they provide nothing back (or negative, if you cost them a page load without incurring an ad view).
I use adblock all the time but I'm very aware that it can only succeed as long as it doesn't win.
I’d let them do their thing, why not?! They want the internet? This is the real internet. It looks like he doesn’t really care that much that they’re retrieving millions of pages, so let them do their thing…
Just that the site owner, most likely, did this kind of on purpose. It's fairly unlikely that he's concerned about his "data" being used because it's junk data.
In the network security world, this is known as a tarpit. You can delay an attack, scan or any other type of automation by sending data either too slowly or in such a way as to cause infinite recursion. The result is wasted time and energy for the attacker and potentially a chance for us to ramp up the defences.
From the content of the email, I get the impression that it's just a honeypot. Also I'm not seeing any delays in the content being returned.
A tarpit is different because it's designed to slow down scanning/scraping and deliberately waste an adversary's resources. There are several techniques but most involve throttling the response (or rate of responses) exponentially.
Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content, which is more often than not slightly incorrect when it comes to Q&A, and the quality of AI responses trained on that content will quickly deteriorate. Right now, most internet content is written by humans. But in 5 years? Not so much. I think this is one of the big problems that the AI space needs to solve quickly. Garbage in, garbage out, as the old saying goes.
The end state of training on web text has always been an ouroboros - primarily because of adtech incentives to produce low quality content at scale to capture micro pennies.
Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise.
Common crawl alone is only a few hundred TB, I have more content than that on a NAS sitting in my office that I built for a few grand (Granted I’m a bit of a data hoarder). The fears that we have “used all the data” are incredibly unfounded.
"Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.
> The fears that we have “used all the data” are incredibly unfounded.
The problem isn't whether we used all the real data or not, the problem is that it becomes increasingly difficult to distinguish real data from previous LLM outputs.
> "Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.
I don't know about that. If you scraped the same data and ran a search engine I think people would generally say you're fine. The copyright issue isn't the scraping step.
Gonna say you’re way off there. Once you decompress common crawl and index it for FTS and put it on fast storage you’re in for some serious pain, and that’s before you even put it in your ML pipeline.
Even refined web runs about 2TB once loaded into Postgres with TS vector columns, and that’s a substantially smaller dataset than common crawl.
It’s not just a dumping a to of zip files on your NAS, it’s making the data responsive and usable.
my guess is mistral is so good for its size because they are doing some real precise pre ingestion selection and sorting. You need FTS to do that work.
> Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise
YMMV depending on the value of "you" and your budget.
If you're Google, Amazon or even lower tier companies like Comcast, Yahoo or OpenAI, you can scape a massive amount of data (ignoring the "allowed" here, because TFA is about OpenAI disregarding robots.txt)
Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.
However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.
I bet they are training their internal models on the data. Bet the real reason they are not training open source models on that data is because of fears of knowledge distillation, somebody else could distill LLaMa into other models. Once the data is in one AI, it can be in any AIs. This problem is of course exacerbated by open source models, but even closed models are not immune, as the Alpaca paper showed.
> The end state of training on web text has always been an ouroboros
And when other mediums have been saturated with AI? Books, music, radio, podcasts, movies -- what then? Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?
I mean, you’re not wrong. I’ve been building some unrelated web search tech and have considered just indexing all the sites I can about and making my own “non shit” search engine. Which really isn’t too hard if you want to do say, 10-50 sites. You can fit that on one 4TB nvme drive on a local workstation.
I’m trying to work on monetization for my product now. The “personal Google” idea is really just an accidental byproduct of solving a much harder task. Not sure if people would pay for that alone.
Once we've got the first, making a billion is easy.
That said… are content creators collectively (all media, film and books as well as web) a thin tail or a fat tail?
I could easily believe most of the actual culture comes from 10k-100k people today, even if there's, IDK, ten million YouTubers or something (I have a YouTube channel, something like 14 k views over 14 years, this isn't "culturally relevant" scale, and even if it had been most of those views are for algorithmically generated music from 2010 that's a literal Markov chain).
It's true that there will no longer be any virgin forest to scrape but it's also true that content humans want will still be most popular and promoted and curated and edited etc etc. Even if it's impossible to train on organic content it'll still be possible to get good content
It is already solved. Look at how Microsoft trained Phi - they used existing models to generate synthetic data from textbooks. That allowed them to create a new dataset grounded in “fact” at a far higher quality than common crawl or others.
It looks less like an ouroboros and more like a bootstrapping problem.
AI training on AI-generated content is a future problem. Using textbooks is a good idea, until our textbooks are being written by AI.
This problem can't really be avoided once we begin using AI to write, understand, explain, and disseminate information for us. It'll be writing more than blogs and SEO pages.
How long before we start readily using AI to write academic journals and scientific papers? It's really only a matter of time, if it's not already happening.
You need to separate “content” and “knowledge.” GenAI can create massive amounts of content, but the knowledge you give it to create that content is what matters and why RAG is the most important pattern right now.
From “known good” sources of knowledge, we can generate an infinite amount of content. We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.
I agree there will be many issues keeping up with what “known good” is, but that’s always been an issue.
> We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.
That's my entire point -- AI only generates content right now, but it will also be the source of content for training purposes soon. We need a "known good" human knowledge-base, otherwise generative AI will degenerate as AI generated content proliferates.
Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.
> Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.
That is training on content.
The future will have models pre-trained on content and tuned on corpuses of knowledge. The knowledge it is trained on will be a selling point for the model.
Think of it this way - if you want to update the model so it knows the latest news, does it matter if the news was AI generated if it was generated from details of actual events?
Is this like, the AI equivalent of “another layer will fix it” that crypto fans used?
“It’s ok bro, another model will fix, just please, one more ~layer~ ~agent~ model”
It’s all fun and games until you can’t reliably generate your base models anymore, because all your _base_ data is too polluted.
Let’s not forget MS has a $10bn stake in the current crop of LLM’s turning out to be as magic as they claim, so I’m sure they will do anything to ensure that happens.
“Struggle” at what? Struggle to have enough data to get smarter? Struggle to perform RAG and find legitimate sources?
I don’t think that we are going to get big improvements in LLMs without architecture improvements that need less data, and the current generation of models appears to be good enough at creating content from data/knowledge to train any future architectures we have with better synthetic datasets. Fortunately we have already seen examples of both of these “in the lab” and will probably see commercially sized models using some of the techniques in the coming months.
Well it will be multimodal, training and inferring on feeds of distributed sensing networks; radio, optical, acoustic, accelerometer, vibration, anything that's in your phone and much besides. I think the time of the text-only transformer has already passed.
What do you think the NSA is storing in that datacenter in Utah? Power point presentations? All that data is going to be trained into large models. Every phone call you ever had and every email you ever wrote. They are likely pumping enormous money into it as we speak, probably with the help of OpenAI, Microsoft and friends.
> What do you think the NSA is storing in that datacenter in Utah?
A buffer with several-days-worth of the entire internet's traffic for post-hoc decryption/analysis/filtering on interesting bits. All that tapped backbone/undersea cable traffic has to be stored somewhere.
As I understand it, they don't have the capability to essentially PCAP all that data.. and the data wouldn't be that useful since most interesting traffic is encrypted as well. Instead they store the metadata around the traffic. Phone number X made an outgoing call to Y @ timestamp A, call ended at timestamp B, approximate location is Z, etc. Repeat that for internet IP addresses do some analysis and then you can build a pretty interesting web of connections and how they interact.
encrypted with an algorithm currently considered to be un-brute-forcible. If you presume we'll be able to decrypt today's encrypted transmissions in, say, 50-100 years, I'd record the encrypted transmission if I were the NSA.
Of everyone's? No. But enough to store the signal messages of the President, down a couple of levels? I hope so. After I'm dead, I hope the messages between the President, his cabinet, and their immediate contacts that weren't previously accessible get released to historians.
Though it seems like something that could exist, who is doing the technical work/programming? It seems impossible to be in the industry and not have associates and colleagues either from or going to an operation like that. This is what I've always pondered about when it comes to any idea like this. The number of engineers at the pointy end of the tech spear is pretty small.
I am not sure why this would even be a conspiracy.
They would almost be failing in their purpose if they were not doing this.
On the other hand, this is an incredibly tough signal to noise problem. I am not sure we really understand what kind of scaling properties this would have as far as finding signals.
> Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content
What makes you think this is true? Yes, it's likely that the internet will have more AI generated content than real content eventually (if it hasn't happened already), but why do you think AI companies won't realize this and adjust their training methods?
Many AI content detectors have been retired because they are unreliable - AI can’t consistently identify AI-generated content. How would they adjust then?
I really really hope that five years from now we are not still using AI systems that behave the way today's do, based on probabilistic amalgamations of the whole of the internet. I hope we have designed systems that can reason about what they are learning and build reasonable mental models about what information is valuable and what can be discarded.
The only way out of this is robots that can go out in the world and collect data. Write in natural language what they observed which can then be used to train better LLMs.
I for one, welcome the junk-data-ouroboros-meta-model-collapse. I think it’ll force us out of this local maxima of “moar data moar good” mindset, and give us collectively, a chance to evaluate the effect these things have our society. Some proverbial breathing room.
They've obviously been thinking about this for a while and are well aware of the pitfalls of training on AI based content. This is why they're making such aggressive moves into video, audio, other better and more robust ground forms of truth. Do you really think that they aren't aware of this issue?
It's funny whenever people bring this up, they think AI companies are some mindless juggernauts who will simply train without caring about data quality at all and end up with worse models that they'll still for some reason release. Don't people realize that attention to data quality is the core differentiating feature that lead companies like OpenAI to their market dominance in the first place?
Is it (I am not a worker in this space, so genuine question)?
My thoughts - I teach myself all the time. Self reflection with a loss function can lead to better results. Why can't the LLMs do the same (I grasp that they may not be programmed that way currently)? Top engines already do it with chess, go, etc. They exceed human abilities without human gameplay. To me that seems like the obvious and perhaps only route to general intelligence.
We as humans can recognize botnets. Why wouldn't the LLM? Sort of in a hierarchal boost - learn the language, learn about bots and botnets (by reading things like this discussion), learn to identify them, learn that their content doesn't help the loss function much, etc. I mean sure, if the main input is "as a language model I cannot..." and that is treated as 'gospel' that would lead to a poor LLM, but i don't think that is the future. LLMs are interacting with humans - how many times do they have to re-ask a question - that should be part of the learning/loss function. How often do they copy the text into their clipboard (weak evidence that the reply was good)? do you see that text in the wild, showing it was used? If so, in what context "Witness this horrible output of chatGPT: <blah>" should result in lower scores and suppression of that kind of thing.
I dream of the day where I have a local LLM (ie individualized, I don't care where the hardware is) as a filter on my internet. Never see a botnet again, or a stack overflow q/a that is just "this has already been answered" (just show me where it was answered), rewrite things to fix grammar, etc. We already have that with automatic translation of languages in your browser, but now we have the tools for something more intelligent than that. That sort of thing. Of course there will be an arms race, but in one sense who cares. If a bot is entirely indistinguishable from a person, is that a difference that matters? I can think of scenerios where the answer is an emphatic YES, but overall it seems like a net improvement.
A "honeypot" is a system designed to trap unsuspecting entrants. In this case, the website is designed to be found by web crawlers and to then trap them in never-ending linked sites that are all pointless. Other honeypots include things like servers with default passwords designed to be found by hackers so as to find the hackers.
What does trap mean here? I presumed crawlers had multiple (thousands of or more) instances. One being 'trapped' on this web farm won't have any impact
I would presume the crawlers have a queue-based architectures with thousands of workers. It’s an amplification attack.
When a worker gets a webpage for the honeypot, it crawls it, scrapes it, and finds X links on the page where X is greater than 1. Those links get put on the crawler queue. Because there’s more than 1 link per page, each worker on the honeypot will add more links to the queue than it removed.
Other sites will eventually leave the queue, because they have a finite number of pages so the crawlers eventually have nothing new to queue.
Not on the honeypot. It has a virtually infinite number of pages. Scraping a page will almost deterministically increase the size of the queue (1 page removed, a dozen added per scrape). Because other sites eventually leave the queue, the queue eventually becomes just the honeypot.
OpenAI is big enough this probably wasn’t their entire queue, but I wouldn’t be surprised if it was a whole digit percentage. The author said 1.8M requests; I don’t know the duration, but that’s equivalent to 20 QPS for an entire day. Not a crazy amount, but not insignificant. It’s within the QPS Googlebot would send to a fairly large site like LinkedIn.
While the other comments are correct, I was alluding to a more subtle attack where you might try to indirectly influence the training of an LLM. Effectively, if OpenAI is crawling the open web for data to use for training, then if they don't handle sites like this properly their training dataset could be biased towards whatever content the site contains. Now in this instance this website was clearly not set up target an LLM, but model poisoning (e.g. to insert backdoors) is an active area of research at the intersection of ML and security. Consider as a very simple example the tokenizer of previous GPTs that was biased by reddit data (as mentioned by other comments).
In this case there are >6bn pages with roughly zero value each. That could eat a substantial amount of time. It's unlikely to entirely trap a crawler, but a dumb crawler (as is implied here) will start crawling more and more pages, becoming very apparent to the operator of this honeypot (and therefore identifying new crawlers), and may take up more and more share of the crawl set.
What has it got to do with deduplication? I'm talking about crafting some kind of alternative (not necessarily duplicate) data. I agree some kind of post data collection cleaning/filtering of the data before training could potentially catch it. But maybe not!
The funny way to do this would be to use an LLM to generate the content you respond with. Have 2 smallish LLMs talk to each other about topics chosen at random and generate infinite nonsense pages that have a few hundred words.
Isn’t it funny that all the “worthless” content out there on the internet is actually changing the world. Like how 4chan was mocked as being a cesspit for losers, but now everyone knows memes like Pepe the frog and Wojack from there. And like now this very comment and the billions of other comments on here, Reddit, Twitter, etc that are regarded as a “waste of time” are being used to train multi billion dollar companies to build the most powerful AI the world has ever seen. For free.
The moral of the story here is if you know something valuable, don’t share it online, because then everyone knows it.
Sharing it is often the only reason why it has value in the first place. If the authors of Gangnam Style or The Fox or any other silly viral thing hadn’t shared them they would have had zero value.
Anyone care to explain the purpose of Levine's https://www.web.sp.am site. Are the names randomly generated. Pardon my ignorance.
This is the type of stuff the news organisations should be publishing about "AI". Instead I keep reading or hearing people referring to training data with phrases like, "The sum of all human knowledge..." Quite shocking anyone would believe that.
I'm so tired of bots. A certain bot from Singapore started pulling all of the product images across multiple domains. Ok, whatever... Then we realized it never stopped. It made enough requests to download them 4-5x over and was still going. The AWS bill was not nice.
We added them to our robots.txt, but traffic didn't stop. I complained to their provider, who happened to also be AWS. Oh, the shock - they didn't care and recommended using robots.txt! AWS was making good money on both ends with this bot that apparently had more money to burn than we do.
Dude, you have caught the spider. Now use it. Start inserting whatever random junk you can until "astronaut riding a horse" looks more like Ronald MacDonald driving a Ferrari.
I feel like inserting "free user-tagged vacation images" into my robots.txt then pointing the spider at an endless series of fabric swatches.
I don't think this message is about "protecting the site's data" quite so much as "hey guys, you're wasting a ton of time and network connect to make your model worse. Might wanna do something 'bout that"
No. I've run 'bot motels myself. I've got better things to do than curating a block list when they can just switch or renumber their infrastructure. Most notably I ran a 'bot motel on a compute-intensive web app; it was cheaper to burn bandwidth (and I slow-rolled that) than CPU cycles. Poisoning the datasets was just lulz.
I block ping from virtually all of Amazon; there are a few providers out there for which I block every naked SYN coming to my environment except port 25, and a smattering I block entirely. I can't prove that the pings even come from Amazon, even if the pongs are supposed to go there (although I have my suspicions that even if the pings don't come from the host receiving the pongs the pongs are monitored by the generator of the pings).
The point I'm making is that e.g. Amazon doesn't have the right to sell access to my compute and tragedy of the commons applies, folks. I offered them a live feed of the worst offenders, but all they want is pcaps.
(I've got a list of 50 prefixes, small enough to be a separate specialty firewall table. It misses a few things and picks up some dust bunnies. But contrast that to the 8,000 prefixes they publish in that JSON file. Spoiler alert: they won't admit in that JSON file that they own the entirety of 3.0.0.0/8. I'm willing to share the list TLP:RED/YELLOW, hunt me down and introduce yourself.)
> A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
If you're not walling off your content behind a login that contains terms that you agree to not scraping, then, scraping that site is 100% legal. Robots.txt isn't a legal document.
ROBOTS.TXT is an implied license, just like LICENSE.MD or LICENSE.TXT in any GitHub repo. There are decades of precedent that the ROBOTS.TXT file communicates what is and isn't allowed when scaping web content, and that you should check that file before scraping the rest of the site.
Willfully violating a written license provided in a predictable format absolutely is a civil legal violation. If your license says "You cannot use this to train AI", and an AI company scrapes it up and trains an AI on it anyway, even though you did your due diligence to communicate your terms, then you have a legal right to seek damages if you can prove that they are violating your license.
You're basically arguing that no reasonable web scraper would know about ROBOTS.TXT. That's bullshit, this method of web robot control has existed since 1996. It would be like violating the license terms of a GitHub project, and claiming that you didn't know that the LICENSE.MD / LICENSE.TXT file was a license you were expected to follow...
So someone could hypothetically perform a Microsoft-Tay-style attack against OpenAI models using an infinite Potemkin subdomians generated on the fly on a $20 VPS? One could hypothetically use GenAI to create the biased pages with repeated calls on how it'd be great to JOIN THE NAVY on 27,000 different "subdomains"
Isn’t the legality of web scraping still..disputed?
There’s been a few projects I’ve wanted to work on involving scraping, but the idea that the entire thing could be shut down with legal threats seems to make some of the ideas infeasible.
It’s strange that OpenAI has created a ~$80B company (or whatever it is) using data gathered via scraping and as far as I’m aware there haven’t been any legal threats.
Was there some law that was passed that makes all web scraping legal or something?
Web scraping the public Internet is legal, at least in the U.S.
hiQ's public scraping of LinkedIn was ruled to be within their rights and not a violation of the CFAA. I imagine that's why LinkedIn has almost everything behind an auth wall now.
Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.
Some sites have terms at the bottom that prohibit scraping—but my understanding is that those aren't generally enforceable if the user doesn't have to take any action to accept or acknowledge them.
Most of these SaaS's have a "firehose" that if you are big enough (aka, can handle the firehose), can subscribe to. These are like RSS feeds on crack for their entire SaaS.
> Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.
They're legally enforceable in the sense that the scraped services generally reserve the right to terminate the authorizing account at will, or legally enforceable in that allowing someone to scrape you with your credentials (or scraping using someone else's) qualifies as violating the CFAA?
There’s currently only one situation where scraping is almost definitely “not legal”:
If the information you’re scraping requires a login, and if in order to get a login you have to agree to a terms of service, and that terms of service forbids you from scraping — then you could have a bad day in civil court if the website you’re scraping decides to sue you.
If the data is publicly accessible without a login then scraping is 99% safe with no legal issues, even if you ignore robots.txt. You might still end up in court if you found a way to correctly guess non-indexed URLs[0] but you’d probably prevail in the end (…probably).
The “purpose” of robots.txt is to let crawlers know what they can do without getting ip-banned by the website operator that they’re scraping. Generally crawlers that ignore robots.txt and also act more like robots than humans, will get an IP ban.
Also worth noting there's a long history of companies with deep pockets getting away with murder (sometimes literally) because litigation in a system that costs money to engage with inherently favors the wealthier party.
Also OpenAI's entire business model is relying on generous interpretations of various IP laws, so I suspect they already have a mature legal division to handle these sorts of potential issues.
> Isn’t the legality of web scraping still..disputed?
Are you suggesting it might be illegal to... write a program that connects to a web server and asks for a specific page, and then parses that page to see which resources it wants and which other pages it links to, and treats those links in some special fashion, differently from the text content of the page?
Especially given that a web server can be configured to respond to any request with a "403 Forbidden" response, if the server determines for any reason whatsoever that it does not want to give the client the page it requested?
The issue often isn't the scraping, it is often how you use the information scraped afterwards. A lot of scraping is done with no reference to any licensing information the sites being read might publish, hence image making AI models having regurgitated chunks of scraped stock images complete with watermarks. Though the scraping itself can count as a DoS if done aggressively enough.
Scraping publicly available data from websites is no different from web browsing, period. Companies stating otherwise in their T&Cs are a joke. Copyright infringement is a different game.
Scraping is legal. Always has been, always will be. Mainly because there's some fuzz around the edges of the definition. Is a web browser a scraper? It does a lot of the same things.
IIRC LinkedIn/Microsoft was trying to sue a company based on Computer Fraud and Abuse Act violations, claiming they were accessing information they were not allowed to. Courts ruled that that was bullshit. You can't put up a website and say "you can only look at this with your eyes". Recently-ish, they were found to be in violation of the User Agreement.
So as long as you don't have a user account with the site in question or the site does not have a User Agreement prohibiting scraping, you're golden.
The problem isn't the scraping anyway, it's the reproduction of the work. In that case, it really does matter how you acquired the material and what rights you have with regards use of that material.
The 9th Circuit Court of Appeals found that scraping publicly accessible content on the internet is legal.
If you publish something on a publicly served internet page, you're essentially broadcasting it to the world. You're putting something on a server which specifically communicates the bits and bytes of your media to the person requesting it without question.
You have every right to put whatever sort of barrier you'd like on the server, such as a sign in, a captcha, a puzzle, a cryptographic software key exchange mechanism, and so on. You could limit the access rights to people named Sam, requiring them to visit a particular real world address to provide notarized documentation confirming their identity in exchange for a unique 2fa fob and credentials for secure access (call it The Sams Club, maybe?)
If you don't put up a barrier, and you configure the server to deliver the content without restriction, or put your content on a server configured as such, then you are implicitly authorizing access to your content.
Little popups saying "by visiting this site, you agree to blah blah blah" are not valid. Courts made the analogy to a "gate-up/gate-down" mechanism. If you have a gate down, you can dictate the terms of engagement with your server and content. If you don't have a gate down, you're giving your content to whoever requests it.
You have control over the information you put online. You can choose which services and servers you upload to and interact with. Site operators and content producers can't decide that their intent or consent be withdrawn after the fact, as once something is published and served, the only restrictions on the scraper are how they use the information in turn.
Someone who's archived or scraped publicly served data can do whatever they want with the content within established legal boundaries. They can rewrite all the AP news articles with their own name as author, insert their name as the hero in all fanfic stories they download, and swap out every third word for "bubblegum" if they want. They just can't publish or serve that content, in turn, unless it meets the legal standards for fair use. Other exceptions to copyright apply, in educational, archival, performance, accessibility, and certain legal conditions such as First Sale doctrine. Personal use of such media is effectively unlimited.
The legality of web scraping is not disputed in the US. Other countries have some silly ideas about post-hoc "well that's not what I meant" legal mumbo jumbo designed to assist politicians and rich people in whitewashing their reputations and pulling information offline using legal threats.
Aside from right to be forgotten inanity, content on the internet falls under the same copyright rules as books, magazines, or movies published on physical media. If Disney set up a stall at San Francisco city hall with copies of the Avengers movies on a thumb drive in a giant box saying "free, take one!", this would be roughly the same as publishing those movie files to a public Disney web page. The gate would be up. (The way they have it set up in real life, with their streaming services and licensed media access, the gate is down.)
So - leaving behind the legality of redistribution of content, there's no restriction on web scraping public content, because the content was served intentionally to the software or entity that visited the site. It's up to the server operator to put barriers in place and to make content private. It's not rocket surgery, but platforms want to have their cake and eat it too, with control over publicly accessible content that isn't legal or practical.
Twitter/X is a good example of impractical control, since the site has effectively become useless spam without signing in. Platforms have to play by the same rules as everyone else. If the gate is up, the content is fair game for scraping. The Supreme Court gave the decision to a lower court, who affirmed the gate up/gate down test for legality of access to content.
Since Google and other major corporations have a vested interest in the internet remaining open and free, and their search engines and other tech are completely dependent on the gate up/gate down status quo, it's unlikely that the law will change any time soon.
Tl;dr: Anything publicly served is legal to scrape. Microsoft attempted to sue someone for scraping LinkedIn, but the 9th Circuit court ruled in favor of access. If Microsoft's lawyers and money can't impede scraping, it's likely nobody will ever mount an effective challenge, and the gating doctrine is effectively the law of the land.
It seems like that, but they're also concerned about the crawlers that they catch in this web. So it seems like they're trying to help make crawlers better ?, or just generally curious about what systems are crawling around.
I have to say I don't really get the website either. If the author is against scraper why not serve massive dummy content that it bloats their storage? Why all this linking? Maybe it's used to build (fake) page rank credibility and sometimes a link to one of the content farm pages is referenced on other pages, so these get boosted then?
You didn't really do the joke right. The second person is supposed to be doing the thing requested, but in a way the first person doesn't like. In this case, that would be AI companies contributing to a "free and open internet", but doing it """wrong""". But they're not contributing at all. The problem isn't that "free and open internet" is getting monkeys-pawed, it's all this closed proprietary stuff.
It summed up many comments and attitudes I see here on LLM's in under 30 words. In fact it's so clever it's one of the few posts here I'm certain wasn't produced by an LLM.
The factual elements work as a summary. But the entire aspect of just deserts, getting what you asked for and being unhappy, hypocrisy, it isn't there. So by implying that kind of thing, it doesn't work.
The point is that free and open also equals free to be a predator. The joke is that people always think free and open just means what is good for them but truly free is often free for a predator to eat it all and make it closed.
While you guys lecture and circle jerk about the joke you miss the real point.
Like fuck dudes are you this dense? Your comments are convincing me LLMs are smarter than people.
> truly free is often free for a predator to eat it all and make it closed
That doesn't apply when the word "open" is helping clarify.
A predator making open things closed isn't people asking for the wrong thing and getting egg on their face, it's people not getting what they asked for at all through no fault of their own.
Especially because this same kind of crawling works on a non-open internet.
> circle jerk
Do you say that any time someone disagrees with you, or is there something here in specific that you're reacting to?
An open thing can have protections against becoming non-open, and still be open.
But protections or not, someone saying "not like that" about the open web only properly applies when people or companies do something as part of their open web actions. If they do other bad things then "not like that" doesn't apply there.
Publishing a virus to the Web would be "not like that". Not publishing at all is in a different category.
Based on the page footer ("IECC ChurnWare") I believe this is a site by design to waste time for web crawlers and tools which try to get root access on every domain. The robots.txt looks like this: https://ulysses-antoine-kurtis.web.sp.am/robots.txt
I don't see how this does much to keep bad actors away from other domains, but I can see why they don't want to give up the game for OpenAI to stop crawling.
Yup, that's my impression as well. He's just nice to let OpenAI they have a problem. Usually this should be rewarded with a nice "hey, u guys have a bug" bounty because not long time ago some VP from OpenAI was lamenting that training their AI is, and it's his direct quote, "eye watering" cost (the order was millions of $$ per second).
> Before someone tells me to fix my
robots.txt, this is a content farm so rather than being one web site
with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one
page.
The reason that bit is relevant is that robots.txt is only applicable to the current domain. Because each "page" is a different subdomain, the crawler needs to fetch the robots.txt for every single page request.
What the poster was suggesting is blocking them at a higher level - e.g. a user-agent block in an .htaccess or an IP block in iptables or similar. That would be a one-stop fix. It would also defeat the purpose of the website, however, which is to waste the time of crawlers
The real question is how is GPTBot finding all the other subdomains? Currently the sites have GPTBot disallowed, https://www.web.sp.am/robots.txt
If GPTBot is compliant with the robots.txt specification then it can't read the URL containing the HTML to find the other subdomains.
Either:
1. GPTBot treats a disallow as a noindex but still requests the page itself. Note that Google doesn't treat a disallow as a noindex. They will still show your page in search results if they discover the link from other pages but they show it with a "No information is available for this page." disclaimer.
2. The site didn't have a GPTBot disallow until they noticed the traffic spike and they bot has already discovered a couple million links that need to be crawled.
3. There is some other page out there on the internet that GPTBot discovered that links to millions of these subdomains. This seems possible and the subdomains really don't have any way to prevent a bot from requesting millions of robots.txt files. The only prevention here is to firewall the bot's IP range or work with the bot owners to implement better subdomain handling.
https://www.alignmentforum.org/posts/8viQEp8KBg2QSW4Yc/solid...
https://www.lesswrong.com/posts/LAxAmooK4uDfWmbep/anomalous-...
Vocabulary isn't infinite, and GPT-3 reportedly had only 50,257 distinct tokens in its vocabulary. It does make me wonder - it's certainly not a linear relationship, but given the number of inferences run every day on GPT-3 while it was the flagship model, the incremental electricity cost of these Redditors' niche hobby, vs. having allocated those slots in the vocabulary to actually common substrings in real-world text and thus reducing average input token count, might have been measurable.
It would be hilarious if the subtitle on OP's site, "IECC ChurnWare 0.3," became a token in GPT-5 :)