Eventually, OpenAI (and friends) are going to be training their models on almost...

bevekspldnw · 2024-04-11T15:09:49 1712848189

The end state of training on web text has always been an ouroboros - primarily because of adtech incentives to produce low quality content at scale to capture micro pennies.

The irony of the whole thing is brutal.

oceanplexian · 2024-04-11T17:08:57 1712855337

Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise.

Common crawl alone is only a few hundred TB, I have more content than that on a NAS sitting in my office that I built for a few grand (Granted I’m a bit of a data hoarder). The fears that we have “used all the data” are incredibly unfounded.

nicolas_17 · 2024-04-11T19:04:04 1712862244

"Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.

> The fears that we have “used all the data” are incredibly unfounded.

The problem isn't whether we used all the real data or not, the problem is that it becomes increasingly difficult to distinguish real data from previous LLM outputs.

Dylan16807 · 2024-04-11T20:51:45 1712868705

> "Content you're allowed to scrape from the internet" is MUCH smaller than what LLMs have actually scraped, but they don't care about copyright.

I don't know about that. If you scraped the same data and ran a search engine I think people would generally say you're fine. The copyright issue isn't the scraping step.

bevekspldnw · 2024-04-11T18:37:13 1712860633

Gonna say you’re way off there. Once you decompress common crawl and index it for FTS and put it on fast storage you’re in for some serious pain, and that’s before you even put it in your ML pipeline.

Even refined web runs about 2TB once loaded into Postgres with TS vector columns, and that’s a substantially smaller dataset than common crawl.

It’s not just a dumping a to of zip files on your NAS, it’s making the data responsive and usable.

Dylan16807 · 2024-04-11T20:48:56 1712868536

How important is full text search for training an LLM, compared to a pile of zip files with a gigabyte of text each?

michaelt · 2024-04-11T21:13:54 1712870034

Maybe not full full text search, but you'll generally want to remove the duplicates and suchlike.

Dylan16807 · 2024-04-11T21:28:36 1712870916

I guess you want some fast extra storage for as long as it takes to run https://github.com/chatnoir-eu/chatnoir-copycat but that's a very temporary thing.

bevekspldnw · 2024-04-12T00:47:02 1712882822

my guess is mistral is so good for its size because they are doing some real precise pre ingestion selection and sorting. You need FTS to do that work.

sangnoir · 2024-04-11T17:52:45 1712857965

> Content you’re allowed and capable of scraping on the Internet is such a small amount of data, not sure why people are acting otherwise

YMMV depending on the value of "you" and your budget.

If you're Google, Amazon or even lower tier companies like Comcast, Yahoo or OpenAI, you can scape a massive amount of data (ignoring the "allowed" here, because TFA is about OpenAI disregarding robots.txt)

Eisenstein · 2024-04-11T17:15:17 1712855717

> Facebook alone probably has more data than the entire dataset GPT4 was trained on and it’s all behind closed doors.

Meta is happily training their own models with this data, so it isn't going to waste.

bevekspldnw · 2024-04-11T18:23:14 1712859794

Not Llama, they’ve been really clear about that. Especially with DMA cross-joining provisions and various privacy requirements it’s really hard for them, same for Google.

However, Microsoft has been flying under the radar. If they gave all Hotmail and O365 data to OpenAI I’d not be surprised in the slightest.

troq13 · 2024-04-11T18:56:01 1712861761

The company that made a honeypot VPN to access competitor's traffic? They are definitively keeping their hands off their internal data, yes.

Dr_Birdbrain · 2024-04-11T21:58:02 1712872682

I bet they are training their internal models on the data. Bet the real reason they are not training open source models on that data is because of fears of knowledge distillation, somebody else could distill LLaMa into other models. Once the data is in one AI, it can be in any AIs. This problem is of course exacerbated by open source models, but even closed models are not immune, as the Alpaca paper showed.

ezekg · 2024-04-11T15:35:15 1712849715

> The end state of training on web text has always been an ouroboros

And when other mediums have been saturated with AI? Books, music, radio, podcasts, movies -- what then? Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?

nicolas_17 · 2024-04-11T19:05:13 1712862313

https://en.wikipedia.org/wiki/Low_background_steel but for web content.

berniedurfee · 2024-04-11T18:16:18 1712859378

Yahoo.com will rise from the ashes.

bevekspldnw · 2024-04-11T18:27:41 1712860061

I mean, you’re not wrong. I’ve been building some unrelated web search tech and have considered just indexing all the sites I can about and making my own “non shit” search engine. Which really isn’t too hard if you want to do say, 10-50 sites. You can fit that on one 4TB nvme drive on a local workstation.

I’m trying to work on monetization for my product now. The “personal Google” idea is really just an accidental byproduct of solving a much harder task. Not sure if people would pay for that alone.

weregiraffe · 2024-04-11T15:49:28 1712850568

>Do we need a (curated?) unadulterated stockpile of human content to avoid the enshittification of everything?

Either that, or a human level AI.

llamaimperative · 2024-04-11T15:53:51 1712850831

Well no, we need billions of human-level AIs who are experiencing a world as rich and various as the world that the billions of humans inhabit.

ben_w · 2024-04-11T18:34:31 1712860471

Once we've got the first, making a billion is easy.

That said… are content creators collectively (all media, film and books as well as web) a thin tail or a fat tail?

I could easily believe most of the actual culture comes from 10k-100k people today, even if there's, IDK, ten million YouTubers or something (I have a YouTube channel, something like 14 k views over 14 years, this isn't "culturally relevant" scale, and even if it had been most of those views are for algorithmically generated music from 2010 that's a literal Markov chain).

xipho · 2024-04-11T21:29:54 1712870994

Maybe we could print out knowledge on dead trees and store them somewhere, perhaps make it publically available? (stolen joke, not mine).

jayd16 · 2024-04-11T15:17:24 1712848644

It's true that there will no longer be any virgin forest to scrape but it's also true that content humans want will still be most popular and promoted and curated and edited etc etc. Even if it's impossible to train on organic content it'll still be possible to get good content

eightysixfour · 2024-04-11T19:06:57 1712862417

It is already solved. Look at how Microsoft trained Phi - they used existing models to generate synthetic data from textbooks. That allowed them to create a new dataset grounded in “fact” at a far higher quality than common crawl or others.

It looks less like an ouroboros and more like a bootstrapping problem.

mattc0m · 2024-04-11T19:16:54 1712863014

AI training on AI-generated content is a future problem. Using textbooks is a good idea, until our textbooks are being written by AI.

This problem can't really be avoided once we begin using AI to write, understand, explain, and disseminate information for us. It'll be writing more than blogs and SEO pages.

How long before we start readily using AI to write academic journals and scientific papers? It's really only a matter of time, if it's not already happening.

eightysixfour · 2024-04-11T19:53:10 1712865190

You need to separate “content” and “knowledge.” GenAI can create massive amounts of content, but the knowledge you give it to create that content is what matters and why RAG is the most important pattern right now.

From “known good” sources of knowledge, we can generate an infinite amount of content. We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.

I agree there will be many issues keeping up with what “known good” is, but that’s always been an issue.

ezekg · 2024-04-11T20:51:03 1712868663

> We can add more “known good” knowledge to the model by generating content about that knowledge and training on it.

That's my entire point -- AI only generates content right now, but it will also be the source of content for training purposes soon. We need a "known good" human knowledge-base, otherwise generative AI will degenerate as AI generated content proliferates.

Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.

eightysixfour · 2024-04-12T00:53:12 1712883192

> Crawling the web, like in the case of the OP, isn't going to work for much longer. And books, video, and music are next.

That is training on content.

The future will have models pre-trained on content and tuned on corpuses of knowledge. The knowledge it is trained on will be a selling point for the model.

Think of it this way - if you want to update the model so it knows the latest news, does it matter if the news was AI generated if it was generated from details of actual events?

FridgeSeal · 2024-04-11T22:50:46 1712875846

Is this like, the AI equivalent of “another layer will fix it” that crypto fans used?

“It’s ok bro, another model will fix, just please, one more ~layer~ ~agent~ model”

It’s all fun and games until you can’t reliably generate your base models anymore, because all your _base_ data is too polluted.

Let’s not forget MS has a $10bn stake in the current crop of LLM’s turning out to be as magic as they claim, so I’m sure they will do anything to ensure that happens.

eightysixfour · 2024-04-12T00:54:01 1712883241

I mean, you can use Phi now and it outclasses anything else in its size. This isn’t some “it could happen” situation.

FridgeSeal · 2024-04-12T03:12:32 1712891552

Oh I’m sure it works wonderfully for now.

My point is about the inevitable future when _those_ models start to struggle.

The phi approach doesn’t seem like breaking the ouroboros, it just feels like inserting another model/snake into the loop.

eightysixfour · 2024-04-12T15:03:45 1712934225

“Struggle” at what? Struggle to have enough data to get smarter? Struggle to perform RAG and find legitimate sources?

I don’t think that we are going to get big improvements in LLMs without architecture improvements that need less data, and the current generation of models appears to be good enough at creating content from data/knowledge to train any future architectures we have with better synthetic datasets. Fortunately we have already seen examples of both of these “in the lab” and will probably see commercially sized models using some of the techniques in the coming months.

sarah_eu · 2024-04-11T15:13:27 1712848407

Well it will be multimodal, training and inferring on feeds of distributed sensing networks; radio, optical, acoustic, accelerometer, vibration, anything that's in your phone and much besides. I think the time of the text-only transformer has already passed.

pants2 · 2024-04-11T15:30:10 1712849410

OpenAI will just litter microphones around public spaces to record conversations and train on them.

sarah_eu · 2024-04-11T15:32:16 1712849536

Has been happening for at least 10 years.

khalladay · 2024-04-11T15:43:05 1712850185

Got a source for that?

oceanplexian · 2024-04-11T17:24:26 1712856266

Want a real conspiracy?

What do you think the NSA is storing in that datacenter in Utah? Power point presentations? All that data is going to be trained into large models. Every phone call you ever had and every email you ever wrote. They are likely pumping enormous money into it as we speak, probably with the help of OpenAI, Microsoft and friends.

sangnoir · 2024-04-11T17:58:34 1712858314

> What do you think the NSA is storing in that datacenter in Utah?

A buffer with several-days-worth of the entire internet's traffic for post-hoc decryption/analysis/filtering on interesting bits. All that tapped backbone/undersea cable traffic has to be stored somewhere.

choilive · 2024-04-11T18:26:47 1712860007

As I understand it, they don't have the capability to essentially PCAP all that data.. and the data wouldn't be that useful since most interesting traffic is encrypted as well. Instead they store the metadata around the traffic. Phone number X made an outgoing call to Y @ timestamp A, call ended at timestamp B, approximate location is Z, etc. Repeat that for internet IP addresses do some analysis and then you can build a pretty interesting web of connections and how they interact.

fragmede · 2024-04-11T19:28:09 1712863689

> most interesting traffic is encrypted as well

encrypted with an algorithm currently considered to be un-brute-forcible. If you presume we'll be able to decrypt today's encrypted transmissions in, say, 50-100 years, I'd record the encrypted transmission if I were the NSA.

michaelt · 2024-04-11T21:49:19 1712872159

It's a big data centre.

But is it big enough to store 50 years worth of encrypted transmissions?

Far cheaper to simply have spies infiltrate the ~3 companies that hold the keys to 98% of internet traffic.

fragmede · 2024-04-11T22:59:20 1712876360

Of everyone's? No. But enough to store the signal messages of the President, down a couple of levels? I hope so. After I'm dead, I hope the messages between the President, his cabinet, and their immediate contacts that weren't previously accessible get released to historians.

beau_g · 2024-04-11T18:57:13 1712861833

Though it seems like something that could exist, who is doing the technical work/programming? It seems impossible to be in the industry and not have associates and colleagues either from or going to an operation like that. This is what I've always pondered about when it comes to any idea like this. The number of engineers at the pointy end of the tech spear is pretty small.

berniedurfee · 2024-04-11T18:18:45 1712859525

It would be absolutely fascinating to talk to the LLMs of the various government spy agencies around the world.

fragmede · 2024-04-11T19:32:09 1712863929

https://harpers.org/archive/2024/03/the-pentagons-silicon-va...

If they actually worked, that is.

setiverse · 2024-04-12T12:12:04 1712923924

I am not sure why this would even be a conspiracy.

They would almost be failing in their purpose if they were not doing this.

On the other hand, this is an incredibly tough signal to noise problem. I am not sure we really understand what kind of scaling properties this would have as far as finding signals.

FridgeSeal · 2024-04-11T22:52:47 1712875967

That doesn’t get around the root problem, just gives us multi-modal junk results lol.

bogwog · 2024-04-11T17:17:27 1712855847

> Eventually, OpenAI (and friends) are going to be training their models on almost exclusively AI generated content

What makes you think this is true? Yes, it's likely that the internet will have more AI generated content than real content eventually (if it hasn't happened already), but why do you think AI companies won't realize this and adjust their training methods?

loloquwowndueo · 2024-04-11T18:23:59 1712859839

Many AI content detectors have been retired because they are unreliable - AI can’t consistently identify AI-generated content. How would they adjust then?

SequoiaHope · 2024-04-11T19:09:03 1712862543

I really really hope that five years from now we are not still using AI systems that behave the way today's do, based on probabilistic amalgamations of the whole of the internet. I hope we have designed systems that can reason about what they are learning and build reasonable mental models about what information is valuable and what can be discarded.

SrslyJosh · 2024-04-11T20:59:34 1712869174

Everyone saying "ouroboros": The phrase you're looking for is "human centipede". =)

mkl · 2024-04-11T23:11:08 1712877068

No, the point is it's a circle, with the output feeding back into the input.

mlboss · 2024-04-11T18:17:52 1712859472

The only way out of this is robots that can go out in the world and collect data. Write in natural language what they observed which can then be used to train better LLMs.

Salgat · 2024-04-11T18:47:09 1712861229

As long as humans continue to filter out the bad content generated by AI, it should be fine.

FridgeSeal · 2024-04-11T22:56:51 1712876211

I for one, welcome the junk-data-ouroboros-meta-model-collapse. I think it’ll force us out of this local maxima of “moar data moar good” mindset, and give us collectively, a chance to evaluate the effect these things have our society. Some proverbial breathing room.

atleastoptimal · 2024-04-11T21:54:24 1712872464

They've obviously been thinking about this for a while and are well aware of the pitfalls of training on AI based content. This is why they're making such aggressive moves into video, audio, other better and more robust ground forms of truth. Do you really think that they aren't aware of this issue?

It's funny whenever people bring this up, they think AI companies are some mindless juggernauts who will simply train without caring about data quality at all and end up with worse models that they'll still for some reason release. Don't people realize that attention to data quality is the core differentiating feature that lead companies like OpenAI to their market dominance in the first place?

RogerL · 2024-04-11T16:40:32 1712853632

Is it (I am not a worker in this space, so genuine question)?

My thoughts - I teach myself all the time. Self reflection with a loss function can lead to better results. Why can't the LLMs do the same (I grasp that they may not be programmed that way currently)? Top engines already do it with chess, go, etc. They exceed human abilities without human gameplay. To me that seems like the obvious and perhaps only route to general intelligence.

We as humans can recognize botnets. Why wouldn't the LLM? Sort of in a hierarchal boost - learn the language, learn about bots and botnets (by reading things like this discussion), learn to identify them, learn that their content doesn't help the loss function much, etc. I mean sure, if the main input is "as a language model I cannot..." and that is treated as 'gospel' that would lead to a poor LLM, but i don't think that is the future. LLMs are interacting with humans - how many times do they have to re-ask a question - that should be part of the learning/loss function. How often do they copy the text into their clipboard (weak evidence that the reply was good)? do you see that text in the wild, showing it was used? If so, in what context "Witness this horrible output of chatGPT: <blah>" should result in lower scores and suppression of that kind of thing.

I dream of the day where I have a local LLM (ie individualized, I don't care where the hardware is) as a filter on my internet. Never see a botnet again, or a stack overflow q/a that is just "this has already been answered" (just show me where it was answered), rewrite things to fix grammar, etc. We already have that with automatic translation of languages in your browser, but now we have the tools for something more intelligent than that. That sort of thing. Of course there will be an arms race, but in one sense who cares. If a bot is entirely indistinguishable from a person, is that a difference that matters? I can think of scenerios where the answer is an emphatic YES, but overall it seems like a net improvement.