The problem with this argument is that it's founded in how the AI is used, not how it is made. It's not a compelling reason to ban the tool, it's a compelling reason to regulate its use.
Copilot can produce code verbatim, but it doesn't unless you specifically set up a situation to test it. It requires things like "include the exact text of a comment that exists in training data" or "prefix your C functions the same way as the training data does".
In everyday use, my experience has been that Copilot draws extensively from files that I've opened in my codebase. If I give Copilot a function body to fill in in a class I've already written, it will use my internal APIs (which aren't even hosted on GitHub) correctly as long as there are 1-2 examples in the file and I'm using a consistent naming convention. This isn't copypasta, it really does have a clear understanding of the semantics of my code.
This is why I'm not in favor of penalizing Microsoft and GitHub for creating Copilot. I think there needs to be some regulation on how it is used to make sure that people aren't treating it as a repository of copypasta, but the AI itself is pretty clearly capable of producing non-infringing work, and indeed that seems to be the norm.
Please let’s not start dictating how people should use a piece of software. It would be like ”regulating” Microsoft Word just because people might use it to duplicate copyrighted works.
I'm not saying we should regulate the software, I'm saying we need some rigorous method of ensuring that using the AI tools doesn't put you in jeopardy of accidental copyright infringement.
We most likely don't need new laws, because infringement is infringement and how you made the infringing work is irrelevant. Accidental infringement is already illegal in the US.
i would argue that we _do_ need new laws. AI generated code is so quite different from any other literary works - after all, it was not created by a human.
My own personal opinion is that the AI generated code (or pictures in the case of the article) should be under a new category of literary works, such that it does not receive copyright protection, but also does not violate existing copyright.
This is meaningless though. The majority of AI generated art you see out there is either hand tweaked or post-processed or both. There's human input involved and drawing a line is going to absolutely backfire.
if you presented both the generated image and the "original" to a jury of peers (or even a panel of experts in the field), they would be able to make a determination as to whether the generated image violated the copyright of the presented "original".
Humans tweaking the image is immaterial to this determination - if the human tweaked it so that it no longer seem to violate copyright, then that said panel would also make the same determination.
You are arguing that AI generated means no copyright protection. So you can't tweak it to "not violate copyright" because their literally isn't any.
Of course you have no way to prove whether any image was or was not generated by AI so welcome to a new scam for law firms to aggressively sue artists claiming they suspect AI was used in their works.
The vast majority of paintings weren't created by a human either, but by a paintbrush. We should really ban those too. Just think of all the poor finger-painters who've been put out of a job!
I think it's worth pointing out that Adobe has been doing this for a long time. You can't open or paste images into Photoshop which resemble any major currency.
> Copilot can produce code verbatim, but it doesn't unless you specifically set up a situation to test it.
It does not matter what a service can or cannot do. We do not regulate based on ability, but on action.
The service has an obligation to the license holders of the training data to not violate the license. The mechanism for which the license is violated is irrelevant. The only thing that matters is the code ended up somewhere it shouldn’t, and the service is the actor in the chain of responsibility that dropped the ball.
The prompting of the service is irrelevant. If I ask you to reproduce a block of GPL code in my codebase and you do it, you violated the license. It does not matter that I primed you or lead you to that outcome. What matters is the legally protected code is somewhere it shouldn’t be.
> It does not matter what a service can or cannot do. We do not regulate based on ability, but on action.
Whether we agree with it or not, intellectual property laws have historically been regulated by ability as well as action. Hence why blank multimedia formats would often have additional taxes in some jurisdictions just in case someone chose to record copyrighted content onto them. And why graphics cards used to include an MPEG royalty in their consumer cost, regardless of whether that user planned to watch DVDs on their computer.
Not saying I agree with this principle. Just that there is already a long history of precedence in this area.
Like a lot of politics, ultimately it just comes down to who has the bigger lobbying budget.
> If I ask you to reproduce a block of GPL code in my codebase and you do it, you violated the license. It does not matter that I primed you or lead you to that outcome. What matters is the legally protected code is somewhere it shouldn’t be.
This isn't accurate. If I reproduce GPL code in your codebase, that's perfectly acceptable as long as you obey the terms of the GPL when you go to distribute your code. In this hypothetical, my act of copying isn't restricted under the GPL license, it's your subsequent act of distribution that triggers the viral terms of the GPL.
The big question that is still untested in court is whether Copilot itself constitutes a derivative work of its training data. If Copilot is derivative then Microsoft is infringing already. If Copilot is transformative then it is the responsibility of downstream consumers to ensure that they comply with the license of any code that may get reproduced verbatim. This question has not been ruled on, and it's not clear which direction a court will go.
> The big question that is still untested in court is whether Copilot itself constitutes a derivative work of its training data.
Microsoft has a license to distribute the code used to train Copilot, and isn't distributing the Copilot model anyway, so it doesn't matter whether the model itself infringes copyright.
Whereas that same question probably does matter for Stable Diffusion.
As in " including improving the Service over time...parse it into a search index or otherwise analyze it on our servers" is the provision that grants them the ability to train CoPilot.
(also, in case you're wondering what happens if you upload someone else's code: "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post.")
But you may not have the rights to grant that extra license if CoPilot is determined to violate the GPL, they can yell at you all they want but they will have to remove it, as nobody can break someone else's license for you.
It'll have to be tested in court, but likely nobody actually gives a shit.
> But you may not have the rights to grant that extra license if CoPilot is determined to violate the GPL
Which is why that second provision is there to shift liability to you. You MUST have the ability to grant GitHub that license to any code you upload. If you don't, and MS is sued for infringing upon the GPL, presumably Microsoft can name you as the fraudster that claimed to be able to grant them a license to code that ended up in copilot.
How is that different from a consultant who indiscriminately copies from Stack Overflow?
Tangent to that is the "who gets sued and needs to fix it when a code audit is done?"
Ultimately, the question is then "who is responsible for verifying that the code submitted to production isn't copying from sources that have incompatible licensing?"
The consultants would have to knowingly copy from somewhere. One can hope they're educated on licensing, at least if they expect to get paid.
If Microsoft is so confident in Pilot doing sufficient remixing then why not train it on their own internal code? And why put the burden of IP vetting on clients who have less info than Pilot?
> How is that different from a consultant who indiscriminately copies from Stack Overflow?
and how is that different from a student learning how to code off stackoverflow (or anywhere else for that matter), then reproducing some snippets/learnt code structure, in their employment?
Or a random employee copies some art work that is then published ( https://arstechnica.com/tech-policy/2018/07/post-office-owes... ). You will note all the people that didn't get in trouble there - neither the photographer who created the image, nor Getty in making it available, nor the random employee who used it without checking its provenance.
In all of these cases, it is (or would be) the organization that published the copyrighted work without doing the appropriate diligence on checking what it is, if it would be useable, and how it should be licensed.
> The Post Office says it has new procedures in place to make sure that it doesn't make a mistake like this again.
... which is what companies who make use of AI models for generating content (be it art or code) should be doing to ensure that they're not accidentally infringing on existing copyrighted works.
Pilot is regurgitating snippets of code still under copyright and not in the public domain. Some may consider publicly available code fair use, but the fact that they're selling access for commercial use may undercut that argument.
There is a part of Deep Learning research (Differential Privacy) which focuses on making sure an algorithm cannot leak information about the training set, and this is a rigorous concept, you can quantify how much privacy-preserving a model is, and there are methods to make a model "private" (at the cost of performance I think for now)
Differential Privacy only proves that it cannot leak a certain amount of information about individual samples of the training set. This only guarantees the input is not leaked exactly back, any composition of the training set is valid, although in image generation this usually means a very distorted image.
Copilot can produce code verbatim, but it doesn't unless you specifically set up a situation to test it. It requires things like "include the exact text of a comment that exists in training data" or "prefix your C functions the same way as the training data does".
In everyday use, my experience has been that Copilot draws extensively from files that I've opened in my codebase. If I give Copilot a function body to fill in in a class I've already written, it will use my internal APIs (which aren't even hosted on GitHub) correctly as long as there are 1-2 examples in the file and I'm using a consistent naming convention. This isn't copypasta, it really does have a clear understanding of the semantics of my code.
This is why I'm not in favor of penalizing Microsoft and GitHub for creating Copilot. I think there needs to be some regulation on how it is used to make sure that people aren't treating it as a repository of copypasta, but the AI itself is pretty clearly capable of producing non-infringing work, and indeed that seems to be the norm.