Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Specifically, Wikimedia Commons images in the PD-Art-100 category, because the images will be public domain in the US and the labels CC-BY-SA.

Doesn't the "BY" part of the license mean you have to provide attribution along with your models' output[0]? I feel you'll have the equivalent of Github Copilot problem: it might be prohibitive to correctly attribute each output, and listing the entire dataset in attribution section won't fly either. And if you don't attribute, your model is no different than Stable Diffusion, Copilot and other hot models/tools: it's still a massive copyright violation and copyright laundering tool.

----

[0] - https://creativecommons.org/licenses/by-sa/4.0/



I feel quite strongly that there is a large difference between Stable Diffusion and Copilot: with the size of the training set vs the number of parameters, it should be very difficult if not impossible for Stable Diffusion to memorize and, by extension, copy paste to produce its outputs. Copilot is trained on text and outputs text. Coding is also inherently more difficult for an AI model to do. I expect it will memorize large portions of its input and is copy pasting in many cases to produce output. I therefore believe Copilot is doing "copyright laundering" but Stable Diffusion is not. Furthermore, I do not believe, for example, that artists should be able to copyright a "style" - but I would like to see them not be negatively impacted by this. Its complicated.


Let me guess that you write more code than visual art?

Isnt it a bit anthropomorphic to compare the two algorithms by "how a human believes they work" instead of "what they're actually doing different to the inputs to create the outputs"?

These are algorithms and we can look at how they work, so it feels like a cop-out to not do that.


If I was generating image labels I absolutely would need to worry about that. However, since we're only generating images alone, we don't need to worry about bits of the labels getting into the output images.

The attribution requirement would absolutely apply to the model weights themselves, and if I ever get this thing to train at all I plan to have a script that extracts attribution data from the Wikimedia Commons dataset and puts it in the model file. This is cumbersome, but possible. A copyright maximalist might also argue that the prompts you put into the model - or at least ones you've specifically engineered for the particular language the labels use - are derivative works of the original label set and need to be attributed, too. However, that's only a problem for people who want to share text prompts, and the labels themselves probably only have thin copyright[0].

Also, there's a particular feature of art generators that makes the attribution problem potentially tractable: CLIP itself was originally designed to do image classification. Guiding an image diffuser is just a cool hack. This means that we actually have a content ID system baked into our image generator! If you have a list of what images were fed into the CLIP trainer and their image-side outputs[1], then you can feed a generated image back into CLIP and compare the distance in the output space to the original training set and list out the closest examples there.

[0] A US copyright doctrine in which courts have argued that collections of uncopyrightable elements can become copyrightable, but the resulting protection is said to be "thin".

[1] CLIP uses a "dual headed" model architecture, in which both an image and text classifier are co-trained to output data into the same output parameter space. This is what makes art generators work, and it can even do things like "zero-shot classification" where you ask it to classify things it was never trained on.


>If I was generating image labels I absolutely would need to worry about that. However, since we're only generating images alone, we don't need to worry about bits of the labels getting into the output images.

Just to be correct, SD generates labels on images sometimes, so, we need to worry ;)


> This is cumbersome, but possible.

This is not possible because the model is smaller than the input weights. Just as any new image it generates is something it made up, any attributions it generated would also be made up.

CLIP can provide “similarity” scores but those are based on an arbitrary definition of “similarity”. Diffusion models don’t make collages.


The SA part (ShareAlike) is even more restrictive, as it imposes a license on the derivative work.

"— If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original"


How is that restrictive? Doesn't it just mean that any outputs of the model also fall under the same license so they can be used in public datasets?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: