I’m not a lawyer, but here is why I believe a class action lawsuit is correct;
“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.
I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.
In the spirit of Rick Sanchez; It’s just compression with extra steps.
I read most of the complaint. The only examples of supposed copyright infringement are isEven and isPrime functions. Here's what Copilot gives me in a Typescript file:
function isPrime(n: number): boolean {
for (let i = 2; i < n; i++) {
if (n % i === 0) {
return false;
}
}
return n > 1;
}
function isEven(n: number): boolean {
return n % 2 === 0;
}
These are clearly not covered by copyright in the first place. This case is really quite pathetic.
Correct me if I'm wrong. I don't think this document needs to be a comprehensive record of every piece of copyrighted material that Copilot or Codex produce. That's something that will be produced during/for the trial process itself. Right now, this is just establishing the basic premise, and the claims for the type of behavior that is going on.
I think they intentionally picked (literal) textbook examples because they're short and easy for non-experts to grasp and have some understanding of. But I don't think we've seen any of the code from the respective J. Doe's yet, and I would assume we would in the trial (possibly in addition to more cases).
I tested co-pilot initially with Hello World in different languages. In Lisp, it gave me verbatim code from a particular tutorial, which was made obvious because their code had "Hello <tutorialname>" where <tutorialname> was the name of a YouTube tutorial, instead of the word "World." It was surely slurped into the model via someone who had done the tutorial and uploaded their efforts to Github. Mind you, it's pretty much the way everyone would code it, but the inclusion of <tutorialname> is definitely an issue.
I have only skimmed. But lines 23 and 24 on page 23 also reference Copilot's autocompletion of Quake III's `Q_rsqrt`[1] and mention that it is under GPL2.
"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."
"Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis."
That code is specifically optimized for efficiency and there were similar approaches floating (get it?) around in the 1980s.
The magic-constant is not optimal there exist better alternatives. It is not necessary to implement this function and should be copyrightable. It is also not a trivial part.
On the other hand, Microsoft may only need to show "Hey, we got this code from FooBar under this license and this license and ..."
Why should it be copyrightable. It's just a way to calculte inverse square root. This falls under the public, in my non lawyer opinion. Such small snippets do not qualify, usually, for copyright.
It's not just the constant but it was easiest to identify for me in the last post. And due to it's popularity the size of the snippet doesn't matter, it stands on its own as a significant work.
The essence of the algorithm takes 4 lines: function declaration, declaration of 'y', one line for calculating the exponent in log-space, one line for returning the root finding.
The rest is fluff. Every line of the snippet has creative input with the chosen names ('threehalfs' for 1.5F), the order of declarations and instructions, the redundancy. There have been internet-wars around indentations and newlines, these are style choices.
((And it is public -- GPL more specifically, which is a restrictive license that should be respected. I think this snippets makes a perfect example of the dangers of copilot. But not one to litigate details with.))
(((Thinking back, I'm not sure anymore how the license laundering argument works if they got the code from a fair-use MIT-licensed hobby project. Can one person claim fair-use and include it under an MIT-license and have somebody else say 'oh this free code I'm going to use it commercially'?)))
You didn't read the relevant part of the complaint. It starts on document page 14 (PDF page 17). There's a clear footnote:
> Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot.
The offending solution from the AI included extra lines that are reasonably understood to come straight from Eloquent JavaScript:
This seems like an incredibly trivial example. If I remembered that example subconsciously, and used it myself somewhere, would that be an infringement of intellectual property? In any large code base how many such infringements are there? Many? Should we sue every software company on this premise?
Sure, those comments might be considered infringement, but that's from an earlier version of Codex. Copilot does not return that code. The complaint even says so.
If a software systematically engages in copyright violation but only haphazardly corrects those violations, those haphazard correct aren't evidence the problem has vanished.
If Copilot is committing widespread infringements of their copyright, then surely they will be able to find examples of such infringement to submit in their lawsuit.
I assume they want some kind of broad relief, such as an injunction to take down copilot. They are not going to get it, they are not going to get anything at all, if they can’t even provide examples of violating code.
During the piratebay case, the prosecutor only had to illustrate that it was likely (as in, convinced the judges) that copyright infringement had occurred. They did this by showing the top 100 torrents. They did not have to prove with certainty that the top 100 torrent actually was used by people. The fact that the names of movies and games showed up on the list was enough to convince the judges.
The lawyers defending the founders did try to make the argument that no infringement had been proven, and that the list itself was not proof of any infringement. It was just a list on a website, and they even presented evidence that the counter on the list was algorithm faulty. The judges was not convinced and applied the common sense approach that taken as a whole, it was not believable that no infringement had occurred by the website given the context of the site (the name, the top list, the overall perspective of how the site was designed).
> ...then surely they will be able to find examples of such infringement to submit in their lawsuit
Perhaps that is why they are reaching out to potential class members
> if they can’t even provide examples of violating code.
This is the very beginning of a very long process. I wouldn't rule out a settlement where class members get $10-100, which is a common resolution for class action suits.
There are many public examples of that same effect happening (for example https://twitter.com/mitsuhiko/status/1410886329924194309 ), and the legal team has been soliciting for more examples. Those examples are likely to come out if it does go to trial.
If this legal team was interested in this going to trial you think they would have put together a stronger case instead of risking that it won’t be heard.
There’s not even a single mention of any established legal doctrines around copyright and software, such as abstract-filter-compare, idea-expression dichotomy, etc.
> it threatens to disrupt one of the biggest technological progressions of all time.
Chill dude, all they have to do is include the licenses on their generated code.
If anything, this is going to generate even more progress. The copilot team would have to create some kind of feature that would connect the generated output the the relevant training data. That'd be pretty incredible to see in the field of AI/ML in general.
If they can actually link output to specific input, the lawsuit has merit and more, GPT-3 is a lie. A neural network is supposed to learn how things work, not memorize a large number of examples and spit them out verbatim - or keep connections to specific inputs.
Copilot losing the lawsuit is evidence it’s a case of overfitting, not true ML.
Not just AI is threatened, but also the use of sites like StackOverflow because some of those snippets might infringe a license. So we have to write everything from our heads, de novo. No more googling for solutions.
I think we should just relax copyright, it's dying anyway. Language models allow people to borrow skills learned from other people, and solve tasks. That's huge. Like Matrix, loading up a new skill at the press of a button. Can we give up such a huge advantage in order to protect copyright?
I think the notion of copyright has been under attack already for 2 decades by the internet, search engine and social networks. They all work against it, and AI does it even more. It just encapsulates the whole culture in a box, mixing everything up, all copyrights melting away, everything one prompt away. This could be a new medium of propagation for ideas. No longer limited to human brains and books, they can now propagate through language models more efficiently.
That isPrime function does not even cut at sqrt(n). Asking for the state of the art isPrime function is too much, but the sqrt trick is the very first step and it's free. (IIRC, the faster version uses i*i<n)
When searching for "console.log(isEven(50));" "// → true", which is one of the parts that the complaints is about, since this is also reproduced inside a Programming learning book: We get with cs.github.com
" Showing 1 - 20 of 66 files found (in 76 milliseconds)"
So, if this lawsuit succeeds in some way shape or form, does the author have a case against the 66 people that reproduced these lines in their own repository?
You could argue that if the author pursued enforcing their licence over those 66 people their code wouldn't have ended up in the training set in the first place. IANAL but I recall that you can't invoke copyright law to selectively enforce it, copyright is only protected if the holder pursues every violation of it. Maybe it works the same for enforcing a licence.
They can already sue those people if they don't follow the original license, they just need to file a complaint individually to each author, I think. Standard OSS license stuff, or else, why would people even use licenses?
Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.
Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.
I am more hesitant to release code on GitHub under any licenses now. Even outside of GPL-esque terms, I've considered open sourcing some of my product's components under a source available but otherwise proprietary license, but if Microsoft won't adhere to popular licenses like the GPL, why would they adhere my own licensing terms?
If my licenses mean nothing, why would I release my work in a form that will be ripped off by a trillion dollar company without any attribution, compensation or even a license to do so? The incentives to create and share are diminished by companies that won't respect the terms you've released your creations under.
That's just me as an individual. Thinking in terms of for-profit companies, many of them would choose not to share their source code if they know their competitors can ignore their licenses, slurp it up and regurgitate it at an incomprehensible scale.
> Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.
I strongly disagree. There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.
> I've considered open sourcing some of my product's components under a source available but otherwise proprietary license
What's the point of that? This isn't useful to anyone. The fact you even consider it shows you don't understand open source. I'm sure you happily use open source code yourself though.
> There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.
I actually agree. However this is not what's happening. Copilot effectvely removes copyright from FLOSS code, but doesn't touch proprietary software. FLOSS loses it's teeth against the corporations.
I'm the author of about a dozen popular AGPL and GPL projects, but please tell me how I don't understand open source.
The purpose of releasing source available but proprietary code is so that users can learn and integrate into it, and making it available lets anyone learn how it works. The only reason I even considered making the source available is balance between 1) needing to eat and 2) valuing open source enough to risk #1.
I played around with creating an MIT license on my GitHub that explicitly forbids Copilot and other such systems that I thought I may update my projects to, because I strongly dislike the data collection. I'm not a lawyer though.
Is there a GutHub terms of agreement that covers Copilot?
They claim it is fair use, therefore they can bypass copyright (and therefore license terms).
It being in GitHub has not been brought up as a factor yet (by GitHub/Microsoft), AFAIK they could use code from other places with that logic, they just don't need to.
I find your comment a bit perplexing, perhaps you can help me understand.
Why do you want to release code on GitHub with an oppressive license? What's the motivation for you, and what's the benefit for anyone else in it being released?
The size of code fragments being generated with these AI tools is, as far as I can tell, extremely small. Do you think you could even notice if your own implementation of sqrt, comments and all, wound up in Excel?
The point of copyleft licenses (which I assume are what you mean with "opressive") is to subvert copyright in order to incentivize others to share their code by providing them with something to build on if they return the favor. You cannot possibly call these licenses opressive since the default state with copyright is that you are not allowed to do much at all (at least when it comes to copying). In fact copyleft licenses allow you to do much much more than your average corp-lawyer approved proprietary license.
The problem (or A problem) with copilot is that it tries to sidestep those licenses, purpotedly allowing you to build upon the work of others without giving anything back even if the work you are building on has been published on the explicit condition that what you create with it should also be shared in the same way. While the great AI tumbler makes the legal copyright infringement argument complicated by giving you lots of small bits from lots of different sources it really does not change the moral situation: you are explicitly going against the wishes of the people that are enabling you to do what you are doing.
Beyond copyleft, this kind of disregard for other peoples wishes also applies to attribution even with more liberal licenses. Programming is already a field where proper attrubution is woefully lacking - we don't need to make it worse by introducing processes where it becomes much harder if not impossible to tell who contributed to the creation.
Now I am all for maximum code sharing. I'm all for abolishing copyright entirely and letting everyone build what they want without being shackled by so-called intellectual property. But that is not something Microsoft is doing with Copilot. What they have created is a one way funnel from OSS to proprietary software. If Microsoft had initially trained Copilot on their own proprietary sources this would have been seen very differently. But they did not. Because the way Microsoft "loves open source" is not in the way of a mutally beneficial symbiotic relationship but that of an abuser that loves taking advantage of whatever they can with giving as little back as they can get away with.
How does one make the leap from "a source available but otherwise proprietary license" to a copyleft license? As I understand the terms, perhaps in too limited a way, a proprietary license is never one in which others are free to build on the code or incorporate any part of it into their own works, and a source available proprietary license is just publishing source that no-one can use.
As for whether Copilot's morally wrong or not - I don't think copyright as a concept makes any sense at the level of the trivial, where Copilot _should_ be acting. If Copilot regularly reproduces sizeable portions of code from a single origin _without_ careful and deliberate guidance, I'd agree that there's a problem here. As I understand it though, that's not happening.
By its very nature of being published, code from OSS is funnelled into proprietary codebases by humans performing a similar task to Copilot - reading available code and using that to evolve an understanding of how to produce software. I like to think we do it at a deeper level than Copilot, but the general effect is the same: the code I write, like the words I write, are heavily influenced by all the code I've read over the years.
If I wind up using a few words from your comment, down the line, because some turn of phrase you used struck me as a good way to say something, do you think I've morally wronged you?
I'm fine with Copilot, but I think all rightsholders should be allowed to decide if they want their code training it or not. And that should be opt-in, not opt-out.
(And refusing to opt in shouldn't have to mean switching to a new hosting platform.)
> Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
That's the case in pretty much any class action. I look at class actions as having two purposes: to require that the defendant stops doing something, and to fine the defendant some amount of money. Sure, individual class members will see very little of that money, but I look at it as a way of hurting a company that has done people wrong. Hopefully they won't do that anymore, and other companies will be on notice that they shouldn't do those bad things either. Of course, sometimes monetary damages end up being a slap on the wrist, just something a company considers a cost of doing business.
>I look at class actions as having two purposes . . . to require that the defendant stops doing something
That's my point. Many of the class members don't want the company to stop doing this.
I have code on GitHub, and Copilot is a useful tool. I don't care if my code was used to train the model. Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things. The bottom line is, if I'm a coder with code on Github and I like Copilot, this suit is a huge net negative.
Even more importantly, I want to see the next version of Copilot that will be created by some other company, and then the next version after that. I want development to continue in this area at a high velocity. This suit does nothing but put giant screeching brakes on that development, and that is just a shame.
If lawsuit goes through, it's not likely that Copilot would disappear.. but there would be a checkbox to opt-in your code. You could check it and your code will be used to train model.
I have some code on Github as well and would not want it to be used in training, nor by Microsoft nor by other company. It is under GPL license to ensure that any derived use is public and not stripped of copyrights and locked into proprietary codebase, and copilot is pretty much 100% opposite of this.
If this lawsuit is successful it doubt it will change anything at all. Microsoft will just pay the damages as a cost of doing business and continue what they are doing. Maybe they will add an opt-out.
seems like a great opportunity for Microsoft to alter copilot so it's opt-in to get your code scanned, and to mandatorily add licensing and attribution to outputs
I know you said you're OK with it as is, but many aren't, so if I'm a coder, this suit represents a big net positive for me, being a way to reduce the probability of someone laundering my code away without proper attribution or license attention
Hypothetically, if I wanted to learn how to code by studying open source examples on GitHub, should I have to go ask permission of each rightsholder to learn from their code? I agree that, if Copilot is based on a model that overfits to output the exact same code it read, the lawsuit has merit (and Copilot is not really ML), but the idea of ML is that the model doesn’t memorize specific answers, it learns internal rules and higher-level representations that can output a valid result when given some input. Very much like me, the coder, would output valid code when given a use case description, after studying a lot of open source examples of that. Should most programmers just be paying rights to all publishers of code they have studied?
For a long time, Microsoft has used software licenses to reap profits from Windows and Office, the two products that enabled Microsoft to capture near-monopolies in their respective markets.
Now, Microsoft is violating other people's software licenses to repackage the work of numerous free and open source software contributors into a proprietary product. There is nothing moral about flouting the same type of contract that you depend on every day, for the sake of generating more money.
Either the entire Copilot dataset needs to be made available under a license that would be compatible with the code it was derived from (most likely AGPLv3), or Windows and Office need to be brought into the commons. Microsoft cannot have it both ways without legal repercussions.
I don’t think this lawsuit would hinder innovation but it would greatly change it and who owns it.
If an AI model is the joint property of all the people who contributed IP to it, it’s a pretty hugely democratic and decentralizing force. It also will incentivise a huge amount of innovation on better, richer data sources for AI.
If an AI model isn’t joint property of the IP it learned then it’s a great way to build extractive business models because the raw resource is mostly free. This will incentivise larger, more centralised entities.
Much of the most interesting data comes from everyday people. A class action precedent is probably good for society and good for innovation (particularly pushing innovation on the edge/data collection side)
The problem of jointly-owned AI is that the actual value of a particular contribution to the training set is not particularly easy to calculate. We can't tie a particular model weight back to individual training set examples, nor can we take an output that's a statistical mix of two different peoples' work and trace it back to them.
With current technology, the only licensing model we can offer is "give us your training set example, we'll chuck a few pennies at you out of credit sales and nothing more". We can't even comply with CC-BY because the model can't determine who to attribute to.
The resource is not "free": it is provided under a license that attempts to lay out the terms the entity using the resource must comply with in order to get the benefit of using their product; just because this is a non-monetary compensation doesn't mean it is "free".
Authors of code (open source or otherwise) hold a copyright in that code. The purpose of the license agreement is to set out the terms on which the authors will permit others to take actions that would otherwise infringe copyright.
Using code, photographs, documents, or other material to train a model isn't copyright infringement. The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.
Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.
> The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.
How do they not make copies? Do you know how a computer works? Ever heard of RAM? (At least the German Urheberrecht recognizes this clearly: You can't do any processing on any data with the help of a computer without at least making temporary local copies, so there are exceptions to some rules. I'm quite sure common law copyright also recognizes this!)
Also the claim that this is not a derivative work is actually one of the disputed claims here…
> Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.
Exactly, it's all copyrighted! That's why you can't use it for whatever you like. That's the whole point of copyright.
As a result this means that whoever wants to exploit that work in said way needs to buy (or get otherwise) a license!
Nobody said that feeding AI with properly licensed work would be problematic. Only the original creators need to get their fair cut form the outcome of such process.
you are clearly doesn't understand how machine learning works, if machine learning ok copyrighted data becomes illegal then most of our infrastructures will be down because most of it uses machine learning, the first that will affect many people is probably google search
I believe this is the core point of the lawsuit - is Copilot really creating code from what it learned (which happens to, by some weird glitch, mimic the source code) or is it just a big overfitting model that learned to encode and memorize a large number of answers and spit them out verbatim when prompted?
I think that losing this lawsuit has much more serious consequences for Copilot than just having to connect to a list of millions of potential copyright owners - it would mean the model behind it is essentially a failure.
Personal opinion: the real situation lies somewhere in the middle. From what I’ve seen, I think Copilot has some ability to actually generate code, or at least adapt and connect unrelated code pieces it remembers to respond to prompts - but I also believe it just “remembers” (i.e., has a close-to-lossless encoding of the input) how to do some operations and spits them out as part of the response to some prompts.
I hardly think the lawsuit will really explore this discussion, but it sounds like a great investigation into what DL models like transformers actually learn. For all I know, it might even give insight into how we learn. I have no reason to believe that humans don’t use the same strategy of memorising some operations and learning how to adjust them “at the edges” to combine them.
I don't think that anybody will try to answer the philosophical question in what regard what this machine does has anything to do with human reasoning.
In the end it's just a machine. It's not a person. So trying to anthropomorphize this case makes no sense from the get go.
Looking at it this way (and I guess this is the right way to look at it from the law standpoint) Copilot is just a fancy database.
It's a database full of copyrighted work…
How this database (and it's query system) works from the technical viewpoint isn't relevant. It just makes no difference as by law machines aren't people. End of story.
But should the curt (to my extreme surprise) rule that what MS did was "fair use" than the flood gates of "fairuseify through ML"[1] would be open. Given the history of copyright and/or other IP laws in the US this just won't happen! The US won't ever accept that someone would be allowed to grab all Mikey Mouse movies put them into some AI and start to create new Mikey Mouse movies. That's the unthinkable. Just imagine what this would mean. You could "launder" any copyrighted work just by uploading and re-querying it form some "ML-based database system". That would be the end of copyright. This just won't happen. MS is going lose this trail. There is no other option.
The only real question is how severe their loose will be. They used for sure also AGPLv4 code for training. Thinking this through to the end with all consequences would mean that large chunks of MS's infrastructure, and all supporting code, which means more or less all of Azure, which means more or less all of MS's software, would need to be offered in (usable!) source to all users of Copilot. I think this won't happen. I expect the court to find a way to weasel out of this consequence.
> Morally I think this class action is dead wrong. This is how innovation dies.
This legal challenge is coming one way or another. I think it’s better to get it out of the way early. At least then we will know the rules going forward, as opposed to being in some quasi-legal gray area for years.
I disagree. The more entrenched a practice is, like training AI models on media content, the less willing a court is going to be to take that practice away.
that seems like a machiavellian way of avoiding The People deciding the issue for themselves via their government representatives, and it'd just make things harder when the court takes the practice away anyways
Say you read a bunch of code, say over years of developer career. What you write is influenced by all that. Will include similar patterns, similar code and identical snippets, knowingly or not. How large does snippet have to be before it's copyright? "x"? "x==1"? "if x==1\n print('x is one')"? [obviously, replace with actual common code like if not found return 404].
Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?
This is a logical fallacy. A human is not an algorithm. We do not have to extend rights regarding novel invention to an algorithm to protect them for people.
Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense. If it were true, people would very easily game it, by using algorithms to automate the most trivial parts of copying someone's work.
Ultimately the algorithm is automating something a human could do. There is a lot of gray area to copyright law, but you can't get around that simply by offloading to an algorithm.
> Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense.
Uh? So if I design a self driving car which kills someone, it's the car that goes to jail?
Legal precedent seems to indicate this is not the case at all. Because humans and machines are different, simply because humans aren't machines and viceversa.
"So if I design a self driving car which kills someone, it's the car that goes to jail?"
No but the manufacturer will typically be held responsible. If the manufacturer intentionally designed it to kill people, someone could certainly be charged with murder. More likely it was a software defect and then it is a matter of financial liability. (in between is a software defect that someone knew about and chose not to fix)
This isn't a new issue. If you design a car and the brakes fail due to a design issue and that issue can be determined to be something that could have been preventable by more competent design.... someone might indeed go to jail but more likely it would be the corporation paying out a large amount of money.
It could even be a mixture of the manufacturer's fault and the driver. Maybe the brakes failed but the driver was speeding and being reckless and slammed on the brakes with no time to spare. Had it not been a faulty design, no one would have gotten hurt, but also if the driver had been competent and responsible, no one would have gotten hurt.
But with self driving cars, when they no longer need a "safety driver", it certainly won't typically be the human occupant of the car's fault to any degree, since they are simply a passenger.
Last I checked this was very much a gray area. I’d expect at least a long investigation into the amount of work and testing put into validating that the self-driving algorithm operates inside reasonable standards of safe driving. In fact, I expect that, as the industry progresses, the tests for minimal acceptable self-driving safety get more and more standardised.
That doesn’t answer the question of who’s responsible when an accident happens and someone gets hurt or dies - but then, there was a time when animals would be judged and sentenced if they committed a crime under human law. That practice is no longer deemed valid, maybe we need to agree that, if the self-driving car was built with reasonable care, accidents can still happen and it’s no one’s fault.
First of all, that isn't simple. How do you determine what is done by humans? If the human is using a computer and using copy and paste does that still qualify?
No matter where you draw the line between "done by computers" and "done by a human simply using a computer as a tool," there will always be a lot of gray area.
Also, if I spend a year creating my masterpiece, and some kid releases a copy of it for free and claims that that's ok just because it's "not for profit," there is still a problem.
> Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense.
it makes a lot of sense, for that reason and a lot of others
people can create algorithms that do whatever they want, including copyright infringement and outright criminality, but algorithms can't create people or want anything for themselves
Copyright already worries about this sort of thing a great deal, and it's actually a lot more well thought-out than your average hacker is aware of. There are no hard and fast rules; but generally... the thing being sued over has to be creative enough to be copyrightable in the first place. Small snippets do not qualify for copyright protection alone.
I'm not sure this is true. At least for copyright in the common law meaning.
Oracle got copyright on API signatures…
In civil law there is a bar to protection if the work lacks "substantial" creativity. But even this bar is extremely low. More or less everything besides maybe simple math formulas is protected.
Oracle got a very thin copyright on API signatures. The "programmer convenience" ruling in Google v. Oracle basically precludes almost all copyright action on APIs alone.
No, they got absolute copyright on the API signatures.
The court did not even question any copyright, it just assumed the APIs are copyrighted by Oracle. Than it looked for reasons why copying the APIs could possibly be fair use…
By the skin of their teeth they found some very involved and case specific reasons why Google's use of the copyrighted APIs was, after all, fair use.
The reason why SCOTUS bent over backwards to not talk about copyrightability was not because they assumed it was true for APIs, but because they didn't feel like they had all the facts. They basically said "we don't know if it's copyrightable, but if it is, here's a ruling that makes this case and anything similar to it go away".
Oracle only has copyright over APIs in the Federal Circuit, because they were able to hoodwink the judge into applying patent logic[0] to a copyright case. In other circuits it's still up in the air. And in the Ninth Circuit[1] there's already loads of controlling precedent that would have resulted in Oracle's case being summarily dismissed, API copyright or no.
The term "thin copyright" is a term of art. It refers to the kind of copyright protection you get from combining uncopyrightable elements in a creative way. For example, you can't own a particular chord progression. But, if you combine that with, say, a particular instrument, some audio engineering techniques, the subject matter of the lyrics, and so on... then you start getting something that requires creative effort and thus is copyrightable. Courts still have to take this into account when ruling on copyright claims as they do not want to give people a monopoly over just the chord, or just that instrument, etc.
In the case of APIs, we're talking about a series of names, plus an arrangement of type signatures that go with them. Very much a thin copyright, as the legal profession in the US calls it.
And when you have thin copyright, courts are going to be more liberal with handing out fair use exceptions. The "programmer convenience" argument that SCOTUS adopted means that copying an API to put in a different platform is OK. The Ninth Circuit says that copying an API to reimplement a platform that other people's code relies upon is also OK. There's very little room left to actually make a copyright claim on an API alone.
In the case of Copilot, it's not merely copying APIs and filling them out with novel details. It is either generating wholly novel code, or regurgitating training data, the latter of which is just a regular 'ol infringement claim with no difficult legal questions to worry about.
[0] The Court of Appeals for the Federal Circuit is the only court with subject-matter jurisdiction over patent claims. When you're the only person who can make hammers, everything looks like a nail.
[1] The Ninth Circuit court of appeals has jurisdiction over California, which means it takes on the brunt of copyright cases.
I still don't buy the part that there is not much to worry.
The thing you call "thin copyright" is still copyright. Being protected or not is in the end a binary judgment: If your stuff is "a little bit" protected it is actually fully protected—with all consequences that follow from that.
Also, alone the "assumption" of the highest US court that APIs are protected is a very strong signal. They could just have ruled that there is no protection at all; case closed. But they preferred to go for a weasel solution. This has reasons… They deliberately didn't open up the door for API freedom. (Most likely to still be able wield that weapon against foreign concurrency should they feel like that some day).
The point is: IP law is completely crazy. The smallest brain-farts are routinely protected.
The exceptions to this rule are actually stronger in civil law, but still even in the EU single words or sub-second audio samples are protected by default. (Regarding APIs the situation is better though: It's legal to reverse engineer something for e.g. compatibility, and a few other reasons; but that are explicit exceptions. The default is that almost every expression of even the slightest form of human "creativity" is copyrighted; the bar is extremely low; and gets actually pushed constantly lower and lower by common law influence).
So on both sides of the Atlantic the default is that every single line of code is protected. There is nothing like a lower bound in size. Than, form there, you could try to argue that there should be an exception from this protection in some particular case, e.g. there was no "creativity" at all involved. But you will need to win a—often very hard, expensive, and ridiculously long—fight over that issue, and wining that is nothing like a sure thing; the default is that just everything is protected to the max. (Just have a look at all the craziness around news headlines in the EU; Google lost that case back than; to understand this better, as this may be very surprising to US people: civil law does not recognize anything like "fair use"; there are exceptions of copyright protection that have in the end almost the same effect, like grants for libraries or educational purposes, but those exceptions, and their limitations, are listed explicitly in the law; if no exception is listed there just isn't one, and only the very vague "creativity bar" remains).
Regarding Copilot: It makes not much difference whether this machine spits out some verbatim copies of (clearly copyrighted!) snippets or some "remix" thereof. There is no "novel" code if at best all what this machine does is creating "remixes" of the code it has in its database based on the query given. (Its "knowledge base" is nothing else than a very funky database; technical details regarding the actual implementation of that database or its query system should not matter legally).
Before this comes up again: No, any comparisons to how humans learn are irrelevant in this consideration. That machine is not a human. It's a machine. End of story. So even if you consider also a human brain a kind of "funky database" this makes no difference.
I haven't heard anyone saying that copilot is legal "just because it's AI." That's a pretty bad faith, reductive, and disingenuous representation. The core argument I've seen is that the output is sufficiently transformative and not straight up copying.
I wasn't really trying to address whether the argument is valid, I was just noting the representation of the other side here is reductive to the point of being in bad faith. I find that kind of rhetoric a little frustrating since it's kind of inflammatory, and, I believe, not particularly productive towards having honest/informative disagreements and discussion.
I think if another algorithm was used instead of ML that did the same job as Copilot, then people would be making the same arguments. I think it's just the case that ML is just the first tech capable of doing what Copilot is doing.
You can't copyright an algorithm, you can copyright a particular expression of one, or you can attempt to patent an algorithm, but two authors can legitimately write the same thing and not infringe on each others copyright unless one copied from the other.
Suppose you own the rights to a jpeg. And I apply a simple algorithm that increments every hex value. So 00 becomes 01 and so on. The gibberish images it spits out would be so different then your original image that you wouldn't have any claim to them at all.
So I may create a tool that is capable of "incrementing every hex value" of an image, and also of "decrementing every hex value", and than distribute any of your images after "incrementing the hex values", together with said tool, right?
Or maybe it would be enough to just zip your image, to be allowed to distribute it? In the end the bytes I would distribute than "would be so different then your original image that you wouldn't have any claim to them at all", right?
I encourage you to go get a copy of the latest hollywood blockbuster, apply your transformation, share it on the internet and see if the courts agree with your copyright hack.
Humans are just compression with extra steps by that logic.
There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.
If I were to memorize my employer's IP then reproduce it (almost) verbatim and give it to a competitor, then I would be setting myself up for a world of legal hurt.
So yes, it is like how human memory is compression with extra steps.
I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.
The real solution is very, very simple. Only use opt-in training data. Don't acquire codebases from people who didn't agree to it.
If I own a repository on github and I have received contributions from other people, or included a .h file from mpv (thing that I have done), do I still have the right to click the opt-in button? I didn't ask the other contributors.
But github is in a position to scan my code and see if there are copy paste bits and disable the opt-in button in that case.
Except they act in bad faith so they wouldn't do that.
> I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.
Algorithms can't be patented or copyrighted, as they are pure mathematics. If an implementation of an algorithm has no creative content because it is succinct then it likely doesn't deserve copyright.
We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that resembles public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. In addition, we have announced that we are building a feature that will provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code, as well as explore and learn how that code is used in other projects.
Attributions are fundamental to open source? I thought having source openly available was fundamental to open source (and allowed use without liability/warranty) as per apache, mit, and other licenses.
If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.
If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.
After five minutes of googling I'm still not sure if using MIT code requires an attribution, but many people claim it does, see https://opensource.stackexchange.com/a/8163 as one example
You could have read the MIT license in its entirity in less than five minutes. It is very clear that the preserving attribution is a required condition. Other permissive licenses even explicitly require attribution in binaries / documentation.
MIT License:
Copyright <YEAR> <COPYRIGHT HOLDER>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
People would likely not share any code if they could not trust that their work would be respected, and attributed. So yes, I believe it to be fundamental to open source.
People share proprietary code publicly. And the fact that you're allowed to read a book doesn't (currently) give you the right to copy it and redistribute the copy.
If I read 10 or 20 books about a topic and then go teach that topic to others, do I have to attribute each thing to all the authors from where I learned it? And what if I come up with my own interpretation of a topic, do I have to trace it back to all interpretations of all the authors that influenced it? Even more, do the previous authors also have to do that and do I have to quote all the chain of references? If not, why an ML model that is supposed to learn how coding works, not memorize pieces of code verbatim, should have to “because of copyright laws”?
It does give you the right to write excerpts from memory though. If it happens to exactly match the text in the book, nobody gets excited about that, even if you could potentially rewrite the whole book.
maybe that is true, but there exist others for whom that is not true, and as long as they number greater than zero, the argument that 'open source means free to use however for whatever' will be invalid
True and valid. But all those clauses, AFAIK, were written with the mindset of ïf you want to run this code (particularly, but not limited to, for profit), you have to at least attribute it”. Copilot allegedly doesn’t run that code - it claims to read it, understand how it works and then generate its own code that does an equivalent function if requested. It’s up to the lawsuit to decide if that’s what it actually does, but my point is that the licenses simply did not cover this usage pattern, as much as no open source license requires any kind of action from someone who’s just reading or studying the code.
> “AI” is just fancy speak for “complex math program”
Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.
IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.
Who decides what constitutes an "AI program" vs just a "program"? What heuristic do we look at? At the end of the day, they have an equivalent of a .exe which runs, and outputs code that has a license attached to it.
I can suggest an idea, considering that the “AI program” is the model, not the training algorithm.
A program gets written by an entity (usually a person) and is executed to generate the desired output according to a deterministic mathematical function it expresses.
A training algorithm is a program that gets written to train a model (the model being the “AI Program”) when presented to some training data inputs, to implement a function that is not the training algorithm function itself, but another one, generalising over a problem domain beyond just the original examples fed to the training algorithm.
The output model is not the training algorithm or the training data (or an encoding of it) and exists as its own artefact, independent of both.
That oughtn't be controversial, in fact I wouldn't even bother with 'on steroids', implying it's a slightly different/morphed thing. The way I learnt it (very slightly, at university, not a particular focus) it was abundantly clear it was just stats.
I bring the steroids thing up because it's only relatively recently that we've had the massive computing power at our finger tips that we do now. We discovered the foundations of our current ML techniques a relatively long time ago, it's only been recently that we've been able to throw data centers full of powerful GPUs and whatnot at them.
The only license that is permissive enough for AI training is CC0.
Art generators can't comply with attribution requirements and code generators don't know if and when they trip the GPL copyleft. I believe most permissive code licenses also have some kind of attribution requirement.
Who should be sued? Microsoft who produces an application known as "Copilot" which itself contains nobody else's code but Microsoft's? OR the person who USES Copilot, to produce code which contains somebody else's copyrighted code?
Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.
Microsoft produces a service known as "Copilot" which does contain other people's code. That the Copilot network contains other peoples code is not in question since it has been demonstrated to output other people's code and Microsoft even added (very limited) filters to detect if it ooutputs other people's code.
copilot only generate copyrighted when it seen the code many many times and that called memorization in machine learning, machine learning researchers will always try to decrease the amount of memorization in their artificial neurons
Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.
The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.
The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.
It's not something to dismiss but it is something that has already been addressed. Authors Guild v Google. Google Books is built upon scanning millions of books from libraries without first gaining permission from copyright holders, this was found to not be a violation of copyright.
Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.
At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.
This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.
The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.
Attribution is irrelevant in Authors Guild, the books were not released under open source licenses where attribution is sufficient to meeting the licensing terms. Google never sought or obtained licenses from any of the publishers, and the court ruled such a license was not needed as Google's usage of the contents of the books (scanning them to build a product) did not represent a copyright infringement.
Attribution is mentioned in this filing because such attribution would be sufficient to meet the licensing terms for some of the alleged infringements.
It's an irrelevant discussion though, the suit does not make a claim that the training of Copilot was an infringement which is where Authors Guild is a controlling precedent.
> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries
I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.
> More specifically, a computer consuming a copyright work is not a violation of copyright.
I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.
Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.
Your disagreement and feelings about how copyright and the law should work are valid, they have very little to do with how copyright is addressed judicially in the United States
At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.
A) No it doesn't, there's nothing in the Copilot model or the plugin that represents or constitutes a reproduction of copyright code being distributed by GH/MS. The allegation is it generates code that constitutes a copyright violation. This distinction is not academic, it's significant, and represents an unexplored area of copyright law.
B) "parts of" copyright works are not themselves sufficient to constitute a copyright violation. The violation must be a substantial reproduction. While it's up to the court to determine if the alleged infringements demonstrated in the suit (I'm sure far more will be submitted if this case moves forward) meet this bar, from what I've seen none of them have.
Historically the bar is pretty high for software, hundreds or thousands of lines depending on use case. A purely mechanical description of an operation is not sufficient for copyright, you cannot copyright an implementation of a matrix transformation in isolation no matter what license you slap on the repo. Recall that the recent Google v Oracle case was litigated over tens of thousands of lines of code and found to be fair use because of the context of those lines.
I've yet to see a demonstrated case of Copilot generating code that is both non-transformative and represents a significant reproduction of the source work.
> The allegation is it generates code that constitutes a copyright violation.
The weights of the Copilot very likely contain verbatim parts of the copyrighted code, just like in a zip archive. It chooses semi-randomly which parts to show and sometimes breaks copyright by displaying large enough pieces.
Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.
Am I violating your copyright? Are you entitled to do that?
To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?
I take your code and I compress it in a tar.gz file. Il call that file "the model".
Then I ask an algorithm (Gzip) to infer some code using "the model".
The algorithm (gzip) just learned how to code by reading your code. It just happened to have it memorized in its model.
With the exception that there are infinite types of chords in this case, and even though many musicians follow familiar chord structures the underlying melodies and rhythms are unique enough for any familiar person to be able to differentiate "Red Hot Chill Peppers" from "All-American Rejects", and now there is a system where All-American Rejects hit a few buttons and a song is generated (using audio samples of "Under the Bridge") that sounds like "Under the Bridge pt 2, All-American Rejects Boogaloo".
That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.
Google doesn't sell its search feature as a product that you can just plagiarize the results from and they're yours. Microsoft does that with Copilot.
Copilot is as much of a search engine as Stable Diffusion or DALL-e are, which is to say they aren't at all. If you want to compare it to a search engine, despite it being a tortured metaphor, the most apt comparison is not to Google, but to The Pirate Bay if TPB stored all of their copyrighted content and served it up themselves.
With Copilot it's your responsibility not to use it as a search engine to copy-paste code. It's completely obvious when it's being used as a search engine so it's not a problem at all.
Stable Diffusion works on completely different principles and they can't exactly replicate a pixels from their training data.
Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).
Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.
Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.
They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.
But there is no equivalent of "unzipping" for Copilot.
This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:
- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.
- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
There is no guarantee that a ML network only produces the input data under those two conditions. But even for
> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.
But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.
Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.
You can easily see this happen, the regurgitation of training data, in an over fitted neural net.
This is not necessarily true, the function space defined by the hidden layers might not contain an exact duplicate of the original training input for all (or even most) of the training inputs. Things that are very well represented in the training data probably have a point in the function space that is "lossy compression" level close to the original training image though, not so much in terms of fidelity as in changes to minor details.
When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.
Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.
> which Copilot doesn't have a license to distribute
when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.
you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.
Host that code. Serve that code to other users. It does not grant the right to create derivative works of that code outside the purview of the code's license. That would be a non-starter in practice; see every repository with GPL code not written by the repository creator.
Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.
Actually pirate bay was even less of an infringement as they did not dsitribute the copygihted content or derivatives themselves, only indexed where it could be found. With Copilot all the content you're getting goes trough Microsoft.
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."
It's served under the terms of my licenses when viewed on GitHub. Both attribution and licenses are shared.
This is like saying GitHub is free to do whatever they want with copyrighted code that's uploaded to their servers, even use it for profit while violating its licenses. According to this logic, Microsoft can distribute software products based on GPL code to users without making the source available to them in violation of the terms of the GPL. Given that Linux is hosted on GitHub, this logic would say that Microsoft is free to base their next version of Windows on Linux without adhering to the GPL and making their source code available to users, which is clearly a violation of the GPL. Copilot doing the same is no different.
> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?
Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.
The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.
I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.
> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.
If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.
it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.
If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.
Copyright laws, if enforced perfectly, would make programming simply impossible. We've been skating by on people not really enforcing them, despite the laws still being on the books, and the existence of tools like this makes that not a viable strategy. Today it's Copilot, which can be shut down, but tomorrow it'll be something developers can run at home. Bits don't have colour; there's no way to distinguish between a copy happening by independent recreation, and one that's actually a copy. So we'll need proper rulings.
In fact, considering Fauxpilot, that will happen as soon as the models have improved somewhat.
*: Of course I don't think "independent recreation" is really a thing. Humans are excellent at open source laundering. It's called "learning".
The AFC test is a three-step process for determining substantial similarity of the non-literal elements of a computer program. The process requires the court to first identify the increasing levels of abstraction of the program. Then, at each level of abstraction, material that is not protectable by copyright is identified and filtered out from further examination. The final step is to compare the defendant's program to the plaintiff's, looking only at the copyright-protected material as identified in the previous two steps, and determine whether the plaintiff's work was copied. In addition, the court will assess the relative significance of any copied material with respect to the entire program.
Abstraction
The purpose of the abstraction step is to identify which aspects of the program constitute its expression and which are the ideas. By what is commonly referred to as the idea/expression dichotomy, copyright law protects an author's expression, but not the idea behind that expression. In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program. The abstractions test was first developed by the Second Circuit for use in literary works, but in the AFC test, they outline how it might be applied to computer programs. The court identifies possible levels of abstraction that can be defined. In increasing order of abstraction; these are: individual instructions, groups of instructions organized into a "hierarchy of modules", the functions of the lowest-level modules, the functions of the higher-level modules, the "ultimate function" of the code.
Filtration
The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.
The court explains that elements dictated by efficiency are removed from consideration based on the merger doctrine which states that a form of expression that is incidental to the idea cannot be protected by copyright. In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright.
Eliminating elements dictated by external factors is an application of the scènes à faire doctrine to computer programs. The doctrine holds that elements necessary for, or standard to, expression in some particular theme cannot be protected by copyright. Elements dictated by external factors may include hardware specifications, interoperability and compatibility requirements, design standards, demands of the market being served, and standard programming techniques.
Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis.
Comparison
The final step of the AFC test is to consider the elements of the program identified in the first step and remaining after the second step, and for each of these compare the defendant's work with the plaintiff's to determine if the one is a copy of the other. In addition, the court will look at the importance of the copied portion with respect to the entire program.
The brain is also just a "complex math program". Since math is just the language we use to describe the world. I don't feel this argument has any weight at all.
The legal world tends to be less interested in these kind of logical gotchas than engineering types would like. I don't see a judge caring about that brain framing at all.
Not to mention, if your brain starts outputting Microsoft copyright code, they're going to sue the shit out of you and win, so I'm not sure how that would help even so.
So if I read the windows explorer source code, then later produced a line for line copy (without referring back to the source). Microsoft couldn't sue me?
Explain yourself. There is not a understood natural phenomenon which we could not capture in math. If you argue behavior of the brain cannot be modeled using a complex math program you are claiming the brain is qualitative different then any mechanism known to man since the dawn of time.
The physics that gives rise to the brain is pretty much known. We can model all the protons, electrons and photons incredibly accurately. It's an extraordinary claim you say the brain doesn't function according to these known mechanisms.
You are confusing the nondiscrete math of physics with the discrete math of computation. Even with unlimited computational resources, we can’t simulate arbitrary physical systems exactly, or even with limited error bounds (see chaos theory). What a program (mathematical or not) in the turing-machine sense can do is only a tiny, tiny subset of what physics can do.
Personally I believe it’s likely that the brain can essentially be reduced to a computation, but we have no proof of that.
> We can model all the protons, electrons and photons incredibly accurately.
We can't even accurately model a receptor protein on a cell or the binding of its ligands, nor can we accurately simulate a single neuron.
This is one of those hard problems in computing and medicine. It is very much an open question about how or if we can model complex biology accurately like that.
> There is not a understood natural phenomenon which we could not capture in math.
This is a belief about our ability to construct models, not a fact. Models are leaky abstractions, by nature. Models using models are exponentially leaky.
> I didn't say we can simulate it.
Mathematics (at large) is descriptive. We describe matter mathematically, as it's convenient to make predictions with a shared modeling of the world, but the quantum of matter is not an equation. f() at any scale of complexity, does not transmute.
I'm using simulate as a synonym for model. For any biological model at the atomic, molecular and protein levels, accuracy is key for useful models. What I'm saying is that accuracy at that level is a hard problem in computing and biology, and even simple protein interactions are hard problems.
> There is not a understood natural phenomenon which we could not capture in math.
You are saying "If we know how something works, we can explain how it works using math."
But we know almost nothing about how the brain works.
> The physics that gives rise to the brain is pretty much known.
...no it is not! No physicist would describe any physical phenomenon as being "pretty much known". Let alone cognition. We don't even have a complete atomic model.
I think you are mostly correct but most people don't like this explanation and choose to believe in magic or spirits or whatever instead of physical reality. For some reason the brain is "magic" and non-physical unlike other organs (and everything else that exists) to most people. It's almost impossible to convince anyone of this though and it's not even worth trying.
> most people don't like this explanation and choose to believe in magic or spirits or whatever instead of physical reality.
You have it reversed. Math is a language tool to describe things, in a limited fashion (our current modeling). One is physical matter (even if it's antimatter). If you believe that there will be a language that can describe anything, it still doesn't manifest matter by speaking that language or describing it...unless you're into magic or spirits or whatever.
This disconnect has nothing to do with how well we do or do not understand physical phenomena. I think what the OP meant to say (and probably you support) is how the "mind" or how we think, can be described with mathematical models. Maybe one day we will have a full understanding, but we're not there yet and not currently in a way that is legally compelling.
I feel like this is a massive oversimplification...
In this answer, you're completely ignoring the massive fact that we cannot create a human brain. Having mathematical models about particles does not mean we have "solved" the brain. Unless you're also believe that these LLMs are actually behaving just like human brains, in that have consciousness, they have logic, they dream, they have nightmares, they produce emotions such as fear, love, anger, that they grow and change over time, that they controls body, your lungs, heart, etc...
You see my point, right? Surely you see that the statement 'The brain is also just a "complex math program"' is at best extremely over-simplistic.
There's certainly no model of a brain at the level of protons, electrons and photons. That's way beyond our level of mathematical understanding or computational ability. Biology isn't understood at the level of physics.
Somewhere in the complex math is the origin of whatever it is in intellectual property that we deem worthy of protection. Because we are humans, we take the complex math done by human brains as worthy of protection by fiat. When a painter paints a tree, we assign the property interest in the painting to the human painter, not the tree, notwithstanding that the tree made an essential contribution to the content. The whole point is to protect the interests of humans (to give them an incentive to work). There is no other reason to even entertain the concept of "property".
As long as AIs are incapable of recognizing when they are plagiarizing, as humans are generally capable of, the double standard seems entirely warranted.
Well, that you caught yourself is already something that makes a difference. It would already change the equation if Copilot would send an email saying “Hey, that snippet xyz I suggested yesterday is actually plagiarized from repo abc. I’m truly sorry about that, I’ll do my best to be more careful in the future.”
As far as “citation needed”, humans are being convicted for plagiarism, so it is generally assumed that they are able to tell and hence can be held responsible for it.
Responsibility or liability is really the crux here. As long as AIs can’t be made liable for their actions (output) like humans or legal entities can, instead the AI operators must be held accountable, and it’s arguably their responsibility to take all practical measures to prevent their AIs from plagiarizing, or from otherwise violating license terms.
At this point we are back in the territory that the idea and the expression of the idea are inseparable, therefore the conclusion will be that copyright protection does not apply to code.
Personally I think this has the potential to blow up in everyones faces.
If it does end up that way, I feel like the trickle away from github will become a stampede. And that would be unfortunate. Having such a good hub for sharing and learning code is useful, but only if licenses are respected. If not, people will just hunker down and treat code like the Coke secret recipe. That benefits no one.
The problem with the class action lawsuit against GitHub is this: if you host your code on GitHub, it doesn't matter what license you use. Microsoft can do whatever they want with it. You agreed to this by agreeing to their terms and conditions.
The end user agreement also says you must have the authority to grant these epic rights to GitHub, i.e. you cannot upload someone else's code. They could probably absolve themselves from responsibility due to your having committed wire fraud in this case. But, alas, IANAL.
If I have access to the source of a BSD, MIT, or GPL project - is there anything in those licenses that would prevent me from mirroring it on GitHub or GitLab?
“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.
I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.
In the spirit of Rick Sanchez; It’s just compression with extra steps.