Seems to be an untested legal question but it's definitely presumptuous on the p...

sodality2 · on Sept 13, 2021

>Maybe using the dataset to train an AI is fine

If this is true, how much longer until we start purposefully training AI models to be overfit and start returning inputs verbatim?

wongarsu · on Sept 13, 2021

I think a reasonable first guess for a legal interpretation is "would it be legal if a human did that?"

If you read an encyclopedia and use that knowledge to answer questions, a human isn't violating copyright, so an AI doing that is probably fine. If you look at all of Picasso's paintings and paint something in his style that isn't violating copyright, so training an AI to make Picasso-like paintings by training it on real Picassos is probably fine.

However looking at a painting and perfectly replicating it is a copyright violation. Same for reading a text and then writing down the same text. Both of these are just copies, not new works. So intentionally overfitting an AI to have it return the inputs with minimal changes probably makes the outputs subject to copyright from the owners of the training data.

junon · on Sept 13, 2021

We're already there. Copilot already does this, as has been found with various GPL violation probes.

Whether or not it's illegal is still an ongoing debate. FSF says it absolutely is illegal, but I think it's ultimately going to end up in court (though I'm just speculating).