Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Quite right, but…

> That argument also makes little sense when you consider that the model is a couple gigabytes itself, it can't memorize 240TB of data, so it "learned".

The matter is really very nuanced and trivialising it that way is unhelpful.

If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?

If an AI model can generate statistically significantly similar images from the training data, with a trivial guessable prompt (“a picture by xxx” or whatever) then it’s entirely arguable that the model is similarly infringing.

The exact compression algorithm, be it model or jpg or zip is irrelevant to that point.

It’s entirely reasonable to say, if this is so good at learning, why don’t you train it without the art station dataset.

…because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?

If not, then it’s not just learning technique, it’s copying.

So; tldr: there’s plenty of scope for trying to train a model on an ethically sourced dataset, and investigation of techniques vs copying in generative models.

It is 100% not something we can just brush off.



If I recompress 240TB as super low quality jpgs and manage to zip them up as single file that is significantly smaller than 240TB (because you can), does the fact they are not pixel perfect matches for the original images mean you’re not violating copyright?

If you compress them down to two or three bytes each, which is what the process effectively does, then yes, I would argue that we stand to lose a LOT as a technological society by enforcing existing copyright laws on IP that has undergone such an extreme transformation.


Maybe?

Does that mean it’s worthless to try to train an ethical art model?

Is it not helpful to show that you can train a model that can generate art without training it on copyrighted material?

Maybe it’s good. Maybe not. Who cares if people waste their money doing it? Why do you care?

It certainly feels awfully convenient for that there are no ethically trained models because it means no one can say “you should be using these; you have a choice to do the right thing, if you want to”.

I’m not judging; but what I will say is that there’s only one benefit in trying to avoid and discourage people training ethical models:

…and that is the benefit of people currently making and using unethically trained models.


We don't agree on what "ethical" means here, so I don't see a lot of room for discussion until that happens. Why do you care if people waste computing time programming their hardware to study art and create new art based on what it learns? Who is being harmed? More art in the world is a good thing.


> Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?

You couldn't teach a human to do that without them having seen Greg's art. There are elements of stroke, palette, lightning and composition that can't be fully captured by natural language (short of encoding a ML model, which defeats the point).


Copyrights say you cannot reproduce, distribute, etc a work without consent from the author, whatever the mean. The copy doesn't need to be exact, only sufficiently close.

However, copyright doesn't prevent someone to look at the work and study it. Even study it by heart. Infringement comes only if that someone would make a reproduction of that work. Also, there are provision for fair use, etc.


> …because if it’s just learning techniques, generic public domain art should be fine right? Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?

Is it fair to hold it to a higher standard than humans though? To some degree it's the whole "xxx..... on a computer!" thing all over again if we go that way


> Can’t you just engineer the prompting better so that it generates “by Greg Rutkowski“ images without being trained on actual images by Greg?

Can you please rewrite this in the writing style of Socrates?


> The matter is really very nuanced and trivialising it that way is unhelpful.

Harping about copyrights in the Age of Diffusion Models is unhelpful (for artists) like protesting against a tsunami. It's time to move up the ladder.

ML engineers have a similar predicament - GPT-3 like models can solve at first try, without specialised training, tasks that took a whole team a few years of work. Who dares still use LSTMs now like it's 2017? Moving up the ladder, learning to prompt and fine-tune ready made models is the only solution for ML eng.

The reckoning is coming for programmers and for writers as well. Even scientific papers can be generated by LLMs now - see the Galactica scandal where some detractors said it will empower people to write fake papers. It also has the best ability to generate appropriate citations.

The conclusion is that we need to give up some of the human-only tasks and hop on the new train.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: