Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"The biggest models want to train on literally every piece of human-written text ever written"

They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.



Even restricted to that narrower definition, the major commercial model companies wouldn't be able to afford to license all their high-quality human text.

OpenAI is Uber with a slightly less ethically despicable CEO.

It knows it's flaunting the spirit of copyright law -- it's just hoping it could bootstrap quickly enough to make the question irrelevant.

If every commercial AI company that couldn't prove training data provenance tomorrow was bankrupted, I wouldn't shed an ethical tear. Live by the sword, die by the sword.


Bold idea, requiring startups to proactively prove they have not broken the law. Should we apply it to all tech startups? Let’s see silicon startups prove they have not stolen trade secrets!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: