Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content, which is more often than not slightly incorrect when it comes to Q&A, and the quality of AI responses trained on that content will quickly deteriorate. Right now, most internet content is written by humans. But in 5 years? Not so much. I think this is one of the big problems that the AI space needs to solve quickly. Garbage in, garbage out, as the old saying goes.
The end state of training on web text has always been an ouroboros - primarily because of adtech incentives to produce low quality content at scale to capture micro pennies.
Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise.
Common crawl alone is only a few hundred TB, I have more content than that on a NAS sitting in my office that I built for a few grand (Granted I’m a bit of a data hoarder). The fears that we have “used all the data” are incredibly unfounded.
"Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.
> The fears that we have “used all the data” are incredibly unfounded.
The problem isn't whether we used all the real data or not, the problem is that it becomes increasingly difficult to distinguish real data from previous LLM outputs.
> "Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.
I don't know about that. If you scraped the same data and ran a search engine I think people would generally say you're fine. The copyright issue isn't the scraping step.
Gonna say you’re way off there. Once you decompress common crawl and index it for FTS and put it on fast storage you’re in for some serious pain, and that’s before you even put it in your ML pipeline.
Even refined web runs about 2TB once loaded into Postgres with TS vector columns, and that’s a substantially smaller dataset than common crawl.
It’s not just a dumping a to of zip files on your NAS, it’s making the data responsive and usable.
my guess is mistral is so good for its size because they are doing some real precise pre ingestion selection and sorting. You need FTS to do that work.
> Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise
YMMV depending on the value of "you" and your budget.
If you're Google, Amazon or even lower tier companies like Comcast, Yahoo or OpenAI, you can scape a massive amount of data (ignoring the "allowed" here, because TFA is about OpenAI disregarding robots.txt)
Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.
However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.
I bet they are training their internal models on the data. Bet the real reason they are not training open source models on that data is because of fears of knowledge distillation, somebody else could distill LLaMa into other models. Once the data is in one AI, it can be in any AIs. This problem is of course exacerbated by open source models, but even closed models are not immune, as the Alpaca paper showed.
> The end state of training on web text has always been an ouroboros
And when other mediums have been saturated with AI? Books, music, radio, podcasts, movies -- what then? Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?
I mean, you’re not wrong. I’ve been building some unrelated web search tech and have considered just indexing all the sites I can about and making my own “non shit” search engine. Which really isn’t too hard if you want to do say, 10-50 sites. You can fit that on one 4TB nvme drive on a local workstation.
I’m trying to work on monetization for my product now. The “personal Google” idea is really just an accidental byproduct of solving a much harder task. Not sure if people would pay for that alone.
Once we've got the first, making a billion is easy.
That said… are content creators collectively (all media, film and books as well as web) a thin tail or a fat tail?
I could easily believe most of the actual culture comes from 10k-100k people today, even if there's, IDK, ten million YouTubers or something (I have a YouTube channel, something like 14 k views over 14 years, this isn't "culturally relevant" scale, and even if it had been most of those views are for algorithmically generated music from 2010 that's a literal Markov chain).
It's true that there will no longer be any virgin forest to scrape but it's also true that content humans want will still be most popular and promoted and curated and edited etc etc. Even if it's impossible to train on organic content it'll still be possible to get good content
It is already solved. Look at how Microsoft trained Phi - they used existing models to generate synthetic data from textbooks. That allowed them to create a new dataset grounded in “fact” at a far higher quality than common crawl or others.
It looks less like an ouroboros and more like a bootstrapping problem.
AI training on AI-generated content is a future problem. Using textbooks is a good idea, until our textbooks are being written by AI.
This problem can't really be avoided once we begin using AI to write, understand, explain, and disseminate information for us. It'll be writing more than blogs and SEO pages.
How long before we start readily using AI to write academic journals and scientific papers? It's really only a matter of time, if it's not already happening.
You need to separate “content” and “knowledge.” GenAI can create massive amounts of content, but the knowledge you give it to create that content is what matters and why RAG is the most important pattern right now.
From “known good” sources of knowledge, we can generate an infinite amount of content. We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.
I agree there will be many issues keeping up with what “known good” is, but that’s always been an issue.
> We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.
That's my entire point -- AI only generates content right now, but it will also be the source of content for training purposes soon. We need a "known good" human knowledge-base, otherwise generative AI will degenerate as AI generated content proliferates.
Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.
> Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.
That is training on content.
The future will have models pre-trained on content and tuned on corpuses of knowledge. The knowledge it is trained on will be a selling point for the model.
Think of it this way - if you want to update the model so it knows the latest news, does it matter if the news was AI generated if it was generated from details of actual events?
Is this like, the AI equivalent of “another layer will fix it” that crypto fans used?
“It’s ok bro, another model will fix, just please, one more ~layer~ ~agent~ model”
It’s all fun and games until you can’t reliably generate your base models anymore, because all your _base_ data is too polluted.
Let’s not forget MS has a $10bn stake in the current crop of LLM’s turning out to be as magic as they claim, so I’m sure they will do anything to ensure that happens.
“Struggle” at what? Struggle to have enough data to get smarter? Struggle to perform RAG and find legitimate sources?
I don’t think that we are going to get big improvements in LLMs without architecture improvements that need less data, and the current generation of models appears to be good enough at creating content from data/knowledge to train any future architectures we have with better synthetic datasets. Fortunately we have already seen examples of both of these “in the lab” and will probably see commercially sized models using some of the techniques in the coming months.
Well it will be multimodal, training and inferring on feeds of distributed sensing networks; radio, optical, acoustic, accelerometer, vibration, anything that's in your phone and much besides. I think the time of the text-only transformer has already passed.
What do you think the NSA is storing in that datacenter in Utah? Power point presentations? All that data is going to be trained into large models. Every phone call you ever had and every email you ever wrote. They are likely pumping enormous money into it as we speak, probably with the help of OpenAI, Microsoft and friends.
> What do you think the NSA is storing in that datacenter in Utah?
A buffer with several-days-worth of the entire internet's traffic for post-hoc decryption/analysis/filtering on interesting bits. All that tapped backbone/undersea cable traffic has to be stored somewhere.
As I understand it, they don't have the capability to essentially PCAP all that data.. and the data wouldn't be that useful since most interesting traffic is encrypted as well. Instead they store the metadata around the traffic. Phone number X made an outgoing call to Y @ timestamp A, call ended at timestamp B, approximate location is Z, etc. Repeat that for internet IP addresses do some analysis and then you can build a pretty interesting web of connections and how they interact.
encrypted with an algorithm currently considered to be un-brute-forcible. If you presume we'll be able to decrypt today's encrypted transmissions in, say, 50-100 years, I'd record the encrypted transmission if I were the NSA.
Of everyone's? No. But enough to store the signal messages of the President, down a couple of levels? I hope so. After I'm dead, I hope the messages between the President, his cabinet, and their immediate contacts that weren't previously accessible get released to historians.
Though it seems like something that could exist, who is doing the technical work/programming? It seems impossible to be in the industry and not have associates and colleagues either from or going to an operation like that. This is what I've always pondered about when it comes to any idea like this. The number of engineers at the pointy end of the tech spear is pretty small.
I am not sure why this would even be a conspiracy.
They would almost be failing in their purpose if they were not doing this.
On the other hand, this is an incredibly tough signal to noise problem. I am not sure we really understand what kind of scaling properties this would have as far as finding signals.
> Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content
What makes you think this is true? Yes, it's likely that the internet will have more AI generated content than real content eventually (if it hasn't happened already), but why do you think AI companies won't realize this and adjust their training methods?
Many AI content detectors have been retired because they are unreliable - AI can’t consistently identify AI-generated content. How would they adjust then?
I really really hope that five years from now we are not still using AI systems that behave the way today's do, based on probabilistic amalgamations of the whole of the internet. I hope we have designed systems that can reason about what they are learning and build reasonable mental models about what information is valuable and what can be discarded.
The only way out of this is robots that can go out in the world and collect data. Write in natural language what they observed which can then be used to train better LLMs.
I for one, welcome the junk-data-ouroboros-meta-model-collapse. I think it’ll force us out of this local maxima of “moar data moar good” mindset, and give us collectively, a chance to evaluate the effect these things have our society. Some proverbial breathing room.
They've obviously been thinking about this for a while and are well aware of the pitfalls of training on AI based content. This is why they're making such aggressive moves into video, audio, other better and more robust ground forms of truth. Do you really think that they aren't aware of this issue?
It's funny whenever people bring this up, they think AI companies are some mindless juggernauts who will simply train without caring about data quality at all and end up with worse models that they'll still for some reason release. Don't people realize that attention to data quality is the core differentiating feature that lead companies like OpenAI to their market dominance in the first place?
Is it (I am not a worker in this space, so genuine question)?
My thoughts - I teach myself all the time. Self reflection with a loss function can lead to better results. Why can't the LLMs do the same (I grasp that they may not be programmed that way currently)? Top engines already do it with chess, go, etc. They exceed human abilities without human gameplay. To me that seems like the obvious and perhaps only route to general intelligence.
We as humans can recognize botnets. Why wouldn't the LLM? Sort of in a hierarchal boost - learn the language, learn about bots and botnets (by reading things like this discussion), learn to identify them, learn that their content doesn't help the loss function much, etc. I mean sure, if the main input is "as a language model I cannot..." and that is treated as 'gospel' that would lead to a poor LLM, but i don't think that is the future. LLMs are interacting with humans - how many times do they have to re-ask a question - that should be part of the learning/loss function. How often do they copy the text into their clipboard (weak evidence that the reply was good)? do you see that text in the wild, showing it was used? If so, in what context "Witness this horrible output of chatGPT: <blah>" should result in lower scores and suppression of that kind of thing.
I dream of the day where I have a local LLM (ie individualized, I don't care where the hardware is) as a filter on my internet. Never see a botnet again, or a stack overflow q/a that is just "this has already been answered" (just show me where it was answered), rewrite things to fix grammar, etc. We already have that with automatic translation of languages in your browser, but now we have the tools for something more intelligent than that. That sort of thing. Of course there will be an arms race, but in one sense who cares. If a bot is entirely indistinguishable from a person, is that a difference that matters? I can think of scenerios where the answer is an emphatic YES, but overall it seems like a net improvement.