Hacker News new | past | comments | ask | show | jobs | submit login

What happens when AI-generated content organically becomes a large percentage of the user-generated corpus that is used for training, with no way to differentiate it from human-generated content?

We may look back on 2022/3 as the last time we had a training set that was clean. The fact that it was already polluted with SEO garbage will be a quaint problem.




I've been thinking about this a lot lately. I guess the 4D chess move is to think about what will be the next market that will benefit from whatever this effect causes.

I have a completely un-worked thought that the music industry might have some clues. As music becomes more and more entrenched in codified process ie pop culture music. It all starts to become very generic.

I'm starting to notice that music is returning to the old way of discovery. Where you hear music from a 3rd party that is playing it and ask them who that is. There is no longer a good bubble to the top source like radio because the corps have cannibalized themselves with production line pop music that has to play on their vertical stack of companies or be licensed. Seems the stuff I find now is some artist with less than 100k followers like (I'm guessing) back in the pre internet days.


The current thinking is that it becomes a negative feedback loop, a kind of lossy "compression" of our cultural output. Combined with normal bitrot we'll eventually start losing components of our shared culture until civilization ends.


Well, hopefully we’ll still have toast and jam, so there’s that.

This suggests a sort of cultural event horizon beyond which AI cannot see. Five years from now, the way actual people talk and think about a new topic will be drowned in AI content, creating a gulf between reality as people actually see it and the AI universe. Unfortunately for many people their reality is already slated to become AI-mediated.


It doesn't really seem to be much of a problem. Data quality is important, but there is plenty of incorrect information in the training data whether or not it's AI generated. And training on synthetic data works fine.

I see a lot of uninformed people claiming this is going to doom AI though.


I'm no AI expert, but I'm not uninformed. Can you explain what "informed" means in this context? I'm aware of the use of synthetic data for training in the context of a curated training effort, with HITL checking and other controls.

What we're talking about here is a world where 1) the corpus is polluted with an unknown quantity of unlabeled AI-generated content, and 2) reputational indicators (link counts, social media sharing/likes) may amplify AI-generated content, and lead to similar content being intentionally AI-generated.

At that point, can the incorrect info in the training set really be controlled for using noise reduction or other controls?


I like to think of it like mad cow: that’s what you get when you feed animals to animals. Same with AI.


Not trying to make a point but just throwing another layer in the game.

There is a tribe with a long history of ritual cannibalism that has developed a genetic modification(mutation) that prevents them from getting this disease.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: