> Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.
If I, a human, were to:
1. Carefully read and memorize some copyrighted code.
2. Produce new code that is textually identical to that. But in the process of typing it up, I randomly mechanically tweak a few identifiers or something to produce code that has the exact same semantics but isn't character-wise identical.
3. Claim that as new original code without the original copyright.
I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.
How is it any different when a machine does the same thing?
You might not get your ass kicked. Copyright doesn't protect function, to the point where the court will assess the degree to which the style of the code can be separated from the function. In the even that they aren't separable, the code is not copyrightable.
Software like Blackduck or Scanoss is designed to identify exactly that type of behaviour. It is used very often to scan closed source software and to check whether it contains snippets that are copied from open source with incompatible licenses (e.g. GPL).
To be able to do so, these softwares build a syntax tree of what your code snippet is, and compare the tree structure with similar trees in open source software without being fooled by variable names. To speed up the search, they also compute a signature for these trees so that the signature can be more easily searched in their database of open source code.
And that's all well and good, but that code that asserts to be protected by GPL still has to stand the abstraction-filtration-comparison test.
The plain fact is that you can claim copyright on plenty of stuff that isn't copyrightable.
Consider AI model weights at all: they're the result of an automatic process and contain no human expression; almost by definition, model weights shouldn't be copyrightable, but people are still releasing "open source" models with supposed licenses.
But there has to be a threshold. If a GPL project contains a function which takes two variables and returns x+y, and I have functionally identical code in a project I made with an incompatible license, it is obviously absurd to sue me.
You are right but there is no legally defined threshold so it's subjective.
As a matter of fact, the Eclipse Foundation requires every contributor to declare that every piece of code is their own original creation and is not a copy/paste from other projects, with the exception possibly of other Eclipse Foundation or Apache Foundation projects because their respective licenses allow that. Even code snippets from StackOverflow are formally forbidden.
If I am not mistaken, in the Oracle-Google trial over Java on Android, at the end Google re-implementation of Java API on Android was considered fair-use, because Google kept the original "signatures" of the Java SDK API and rewrote most of the implementation with the exception of copying "0.4% of the total Java source code and was minimal" [1]
However the trial came to this conclusion after several iterations in court.
You're right, there is. The threshold is whatever a court decides is "substantial similarity" in that particular case. But there's no way to know that ahead of time as the interpretation/decision is subjective.
The simple version is that code is copyrightable as an expression. And the underlaying algorithm is patentable.
The legal term you're looking for here is the "Abstraction-Filtration-Comparison" test; What remains if you subtract all the non-copyrightable elements from a given piece of code.
Algorithms have become patentable only very recently in the history of patents, without a rationale being ever provided for this change, and in some countries they have never become patentable.
Even in the countries other than USA where algorithms have become patentable, that happened only due to USA blackmailing those countries into changing their laws "to protect (American) IP".
It is true however that there exist some quite old patents which in fact have patented algorithms, but those were disguised as patents for some machines executing those algorithms, in order to satisfy the existing laws.
US copyright does protect for "substantial similarity" [0]. And at the other end of the spectrum, this has been abused in absurd ways to argue that substantially different code has infringed.
In Zenimax vs Oculus they basically argued that a bunch of really abstract yet entirely generic parts of the code were shared, we are talking some nested for loops, certain combinations of if statements, and due to a lack of a qualitative understanding of code, syntax, common patterns, and what might actually qualify for substantively novel code in the courtroom, this was accepted as infringing. [1]
Point is, the legal system is highly selective when it comes to corporate interests.
> US copyright does protect for "substantial similarity"
Substantial similarity refers to three different legal analyses for comparing works. In each case what the analysis is attempting to achieve is different, but in no case does it operate to prohibit similarity, per se.
The Wikipedia page points out two meanings. The first is a rule for establishing provenance. Copyright protects originality, not novelty. The difference is that if two people coincidentally create identical works, one after another, the second-in-time creator has not violated any right of the first. (Contrast with patents, which do protect novelty.) In this context, substantial similarity is a way to help establish a rebuttable presumption that the latter work is not original, but inspired by the former; it's a form of circumstantial evidence. Normally a defendant wouldn't admit outright they were knowingly inspired by another work, though they might admit this if their defense focuses on the second meaning, below. The plaintiff would also need to provide evidence of access or exposure to the earlier work to establish provenance; similarity alone isn't sufficient.
The second meaning relates to the fact that a work is composed of multiple forms and layers of expression. Not all are copyrightable, and the aggregate of copyrightable elements needs to surpass a minimum threshold of content. Substantial similarity here means a plaintiff needs to establish that there are enoughcopyrightableelements in common. Two works might be near identical, but not be substantially similar if they look identical merely because they're primarily composed of the same non-copyrightable expressions, regardless of provenance.
There's a third meaning, IIRC, referring to a standard for showing similarity at the pleadings stage. This often involves a superficial analysis of apparent similarity between works, but it's just a procedural rule for shutting down spurious claims as quickly as possible.
> Point is, the legal system is highly selective when it comes to corporate interests.
I don't even think it's that. In recent cases like Oracle v. Google and Corellium v. Apple, Fair Use prevailed with all sorts of conflicting corporate interests at play. The Zenimax v. Oculus case very much revolved around NDAs that Carmack had signed and not the propagation of trade secrets. Where IP is strictly the only thing being concerned, the literal interpretation of Fair Use does still seem to exist.
Or for a more plain example, Authors Guild. v. Google where Google defended their indexing of thousands of copywritten books as Fair Use.
In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way. It's a pretty parallel case to a number of the arguments. Indexing required ingesting whole works of copyright material verbatim. It utilized that ingested data to produce a new commercial work consisting of output derived from that data. If I remember the case correctly, google even displayed snippets when matching a search so the searcher could see the match in context, reproducing the works verbatim for those snippets and one could presume (though I don't recall if it was coded against), that with sufficiently clever search prompts, someone could get the index search to reproduce a substantial portion of a work.
Arguably, the AI platforms have an even stronger case as their nominal goal is not to have their systems reproduce any part of the works verbatim.
> In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way.
The more recent Warhol decision argues quite strongly in the opposite direction. It fronts market impact as the central factor in fair use analysis, explicitly saying that whether or not a use is transformative is in decent part dependent on the degree to which it replaces the original. So if you're writing a generative AI tool that will generate stock photos that it generated by scraping stock photo databases... I mean, the fair use analysis need consist of nothing more than that sentence to conclude that the use is totally not fair; none of the factors weigh in favor it.
I think that decision is much narrower than "market impact". It's specifically about substitution, and to that end, I don't see a good argument that Co-Pilot substitutes for any of the works it was trained on. No one is buying a license to co-pilot to replace buying a license to Photoshop, or GIMP, or Linux, or Tux Racer. Nor is Github selling co-pilot for that use.
To the extent that a user of co-pilot could induce it to produce enough of a copyrighted work to both infringe on the content (remember that algorithms are not protected by copyright) and substitute for the original by licensing in lieu of, I would expect the courts to examine that in the ways it currently views a xerox machine being used to create copies of a book. While the machine might have enabled the infringement, it is the person using the machine to produce and then distribute copies that is doing the infringing not the xerox machine itself nor Xerox the company.
Specifically in the opinion the court says:
>If an original work and a secondary use share
>the same or highly similar purposes, and the secondary use
>is of a commercial nature, the first factor is likely to
>weigh against fair use, absent some other justification for
>copying.
I find it difficult to come up with a good case that any given work used to train co-pilot and co-pilot itself share "the same or highly similar purposes". Even in the case of say someone having a code generator that was used in training of co-pilot, I think the courts would also be looking at the degree to which co-pilot is dependent on that program. I don't know off hand if there are any court cases challenging the use of copyright works in a large collage of work (like say a portrait of a person made from Time Magazine covers of portraits), but again my expectation here is that the court would find that while the entire work (that is the magazine cover) was used and reproduced, that reproduction is a tiny fraction of the secondary work and not substantial to its purpose.
Similarly we have this line:
>Whether the purpose and character of a use weighs in favor
>of fair use is, instead, an objective inquiry into what use
>was made, i.e., what the user does with the original work.
Which I think supports my comparison to the xerox machine. If the plaintiffs against Co-Pilot could have shown that a substantial majority of users and uses of Co-Pilot was producing infringing works or producing works that substitute for the training material, they might prevail in an argument that co-pilot is infringing regardless if the intent of github. But I suspect even that hurdle would be pretty hard to clear.
Of the various recent uses of generative AI, Copilot is probably the one most likely to be found fair use and image generation the least likely.
But in any case, Authors Guild is not the final word on the subject, and anyone trying to argue for (or against) fair use for generative AI who ignores Warhol is going to have a bad day in court. The way I see it, Authors Guild says that if you are thoughtful about how you design your product, and talk to your lawyers early and continuously about how to ensure your use is fair and will be seen as fair in the courts, you can indeed do a lot of copying and still be fair use.
I agree. Nothing is going to be the final word until more of these cases are heard. But I still don't think Warhol is as strong even against other uses of generative AI, and in fact I think in some ways argues in their favor. The court in Warhol specifically rejects the idea that the AWF usage is sufficiently transformed by the nature of the secondary work being recognizably a Warhol. I think that would work the other way around too, that a work being significantly in a given style is not sufficient for infringement. While certainly someone might buy a license to say, Stable Diffusion and attempt to generate a Warhol style image, someone might also buy some paints and a book of Warhol images to study and produce the same thing. Provided the produced images are not actually infringements or transformations of identifiably original Warhol works, even if they are in his style, I think there's a good argument to be made that the use and the tool are non-infringing.
Or put differently, if the Warhol image had used Goldsmith's image as a reference for a silk screen portrait of Steve Tyler, I'm not sure the case would have gone the same way. Warhol's image is obviously and directly derived from Goldsmith's image and found infringing when licensed to magazines, yet if Warhol had instead gone out and taken black and white portraits of prince, even in Goldsmith's style after having seen it, would it have been infringing? I think the closest case we have to that would have been the suit between Huey Lewis and Ray Parker Jr. over "I Want a New Drug"/"Ghostbusters" but that was settled without a judgement.
I do agree that Warhol is a stronger argument against artistic AI models, but it would very much have to depend on the specifics of the case. The AWF usage here was found to be infringing, with no judgement made of the creation and usage of the work in general, but specifically with regard to licensing the work to the magazine. They point out the opposite case that his Campbell paintings are well established as non-infringing in general, but that the use of them licensed as logos for soup makers might well be. So as is the issue with most lawsuits (and why I think AI models in general will win the day), the devil is in the details.
A key finding that the judge said in the Authors Guild v. Google case was that the authors benefited from the tool that google created. A search tool is not a replacement for a book, and are much more likely to generate awareness of the book which in turn should increase sales for the author.
AI platforms that replaces and directly compete with authors can not use the same argument. If anything, those suing AI platforms are more likely to bring up Authors Guild v. Google as a guiding case to determine when to apply fair use.
Yep. Now it is a legal cudgel wielded most effectively by corporate giants. It has mutated to become completely philosophically opposed to what it was expressly created to protect.
if that is the case why do people ever license covers?
to clarify - I thought you just had to negotiate with the cover artist about rights and pay a nominal fee for usage of the song for cover purposes - that is to say you do not negotiate with the original artist, you negotiate with a cover artist and the whole process is cheaper?
You're maybe thinking about this in a way that's not helping you to understand the system and why it works the way it does. It's very clear when you think of a specific case.
Say you want to make a recording of "Valerie" by the Zutons. You need permission (a license) from the songwriters (the Zutons presumably) to do this. You usually get this permission by paying a fee. Having done that, you can do your recording. Whenever that recording is played (or used) you will get a performance royalty and they will get a songwriting royalty.
Say you want to use a cover of "Valerie" by the Zutons in your film or whatever. Say the Mark Ronson version featuring Amy Winehouse. You need permission (a license) from the person who produced that version (Mark Ronson or his company) and will need to pay them a fee, some of which goes to the songwriter as part of their deal with Mark Ronson which gave him the license to produce his cover in the first place.
The Zutons don't have the right to sell you a license to Mark Ronson's version so if that's the version you want you have to negotiate with him. Likewise he doesn't have the right to sell you a license like the license he has (ie a license to do a recording/performance) so if you want that you have to negotiate with them.
OK it seems exactly what I thought and described, and the opposite of what the parent poster described. The parent poster said that if you want to use the cover of the song you need to negotiate with both the people who did the cover and the original rights owner.
The closest I could get to a situation like that would be if I told Band B do a cover of Song A for my movie and I paid the licensing costs as part of my deal with Band B, but still not the same as the parent poster's description.
While correct, the example given is that they COPY the code, then make adjustments to hide the fact. I suspect this is still a copyright violation. It’s interesting that a judge sees it differently when it’s just run through a programme. I’m not a legal expert so I’m guessing it’s a bit more complex than the headline?
Ok I read the article and it looks like the issue is the DMCA specifically, which require the code to be more identical than is presented. I’m guessing separate claims could still come from other copyright laws?
No copy-paste was explicitly used. They compressed it into a latent space and recreated from memory, perhaps with a dash of "creativity" for flavor. Hypothetically, of course.
The distinction is pedantic but important, IMHO. AI doesn't explicitly copy either.
But isn’t that the same as memorising it and rewriting the implementation from memory? I’m sure “it wasn’t an exact reproduction” is not much of a defence.
I would have thought so but I’m not a lawyer. The article suggests DMCA is intended for direct copies so that’s why it failed here. Maybe more general copyright laws would apply for lossy copies.
You have a much smaller lobbying budget than the AI industry, and you didn't flagrantly rush to copy billions of copyrighted works as quickly as possible and then push a narrative acting like that's the immutable status quo that must continue to be permitted lest the now-massive industry built atop copyright violation be destroyed.
Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.
> Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.
Sounds like the same concept as commonly said of "murderer vs conqueror".
Could probably be applied to many other fields for disruption too. Not the murderer bit (!), more the "break one or two laws -> scaled up massively to a potential new paradigm".
There's a strong geopolitical angle as well. If you force American companies to license all training data for LLMs, that is such a gargantuan undertaking it would effectively set US companies back by years relative to Chinese competitors, who are under no such restrictions.
Bottom line, if you're doing something considered relevant to the national interest then that buys you a lot of leeway.
You want to look at the Supreme Court case "Eldred v. Ashcroft." Eldred challenged Congress for retroactively extending existing copyrights, for extending the patent protections on existing inventions could not possibly further arts and sciences. They also argued that if Congress had the power to continually extend existing copyrights by N years every N years, the Constitutional power of "for a limited time" had no meaning.
The Supreme Court's decision was a bunch of bullshit around "well, y'know, people live longer these days, and some creators are still alive who expected these to last their whole lives, and golly, coincidentally this really helps giant corporations."
Copyright has utterly failed to serve that purpose for a long time, and has been actively counterproductive.
But if you want to argue that copyright is counterproductive, I completely agree. That's an argument for reducing or eliminating it across the board, fairly, for everyone; it's not an argument for giving a free pass to AI training while still enforcing it on everyone else.
Could these "free passes" for AI training serve as a legal wedge to increase the scope of fair use in other cases? Pro-business selective enforcement sucks, but so long as model weights are being released and the public is benefiting then stubbornly insisting that overzealous copyright laws be enforced seems self-defeating.
Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.
Just because their lobbies tend to push the boundary of copyright into the absurd doesn't mean these industries aren't worth saving. There should be actually respectful lawmakers who seek for a balance of public and commercial interests.
> Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.
Citation needed. There are many ways to make money from producing content other than restricting how copies of it can be distributed. The owner should be able to choose copyright as a means of control, but that doesn't mean nobody would create any content at all without copyright as a means of control.
There's nothing preventing people from producing works and releasing them without copyright restriction. If that were a more sustainable model, it would be happening far more often.
As it is now, especially in the creative fields (which I am most knowledgeable about), the current system has allowed for a incredible flourishing of creation, which you'd have to be pretty daft to deny.
> If that were a more sustainable model, it would be happening far more often.
that's not the argument. The fact that there currently are restrictions on producing derivative works is the problem. You cannot produce a star wars story, without getting consent from disney. You cannot write a harry potter story, without consent from Rowling.
That's not actually true. There's nothing stopping you from producing derivative works. Publishing and/or profiting from other people's work does have some restrictions though.
There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)
> There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)
Yes, and none of those people are making a living at creating things. That's why they are allowed by the copyright owners to do what they're doing--because it's not commercial. Try to actually sell a derivative work of something you don't own the copyright for and see how fast the big media companies come after you. You acknowledge that when you say there are "restrictions" (an understatement if I ever saw one) on profiting from other people's work (where "other people" here means the media companies, not the people who actually created the work).
It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.
> Yes, and none of those people are making a living at creating things.
Again, not true. One of the most famous examples is likely Naomi Novik, who is a bestselling author, in addition to a prolific producer of derivative works published on AO3. Many other commercially successful authors publish derivative works on this platform as well.
> It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.
Speculate all you want about an alternative system, but you really don't know what would have happened, or what would happen moving forward.
Sorry, I meant they're not making a living at creating derivative works of copyrighted content. They can't, for the reasons you give. Nor can other people make a living creating derivative works of their commercially published work. That is an obvious barrier to creation.
> the current system has allowed for a incredible flourishing of creation
No, the current system has allowed for an incredible flourishing of middlemen who don't create anything themselves but coerce creative people into agreements that give the middlemen virtually all the profits.
People do not put out their stuff. People get lured into contracts selling their IP to a shitty company that then publishes stuff, of course WITH copyright so they can make money while the artist doesnt
Yes, they can't, because there is no legally reliable way to do it (briefly, because the law really doesn't like the idea of property that doesn't have an owner, so if you try to place a work of yours in the public domain, what you're actually doing is making it abandoned property so anyone who wants to can claim they own it and restrict everyone else, including you, from using it). The best an author can do is to give a license that basically lets anyone do what they want with the work. Creative Commons has licenses that do that.
Copyright laws prevent piracy. It is interesting to live in a country with no enforced copyrights and EVERYTHING is pirated. I think it is easy to not know about that context and just see the stick side of copyright vis-a-vis big money corporations
Technically speaking, copyright laws create piracy as without them we would still have our free speech rights to share whatever we want without the approval from third parties and thus so-called piracy aka copyrigh infringement would not be a thing. Laws also hardly prevent sharing of copyrighted content, they only make it illegal.
> we would still have our free speech rights to share whatever we want
This is a false dichotomy. It's not "free speech" to copy someone else's video game and then sell it for your own profit. By "copy", in the old days that was literally copying the distribution CDs and providing a cracked keycode (it was not even a question of trademarks being close or what not. It's literally people taking the stuff, duplicating it, and selling it for their own profit. Eastern European mafia were greatly financed by this and ran this type of operation at industrial scale).
> Laws also hardly prevent sharing of copyrighted content, they only make it illegal.
Yeah, that's the point. Without that, everything is bootlegged. Imagine video games - they get bootlegged. DVDs, all bootlegged. Clothing bootlegged. Whatever your business is - bootlegged. Zero copyright is not a utopia of free speech, it is people ripping everyone else off. Per lived experience, I'm just saying the other extreme is not a utopia.
So true! Copyrights that last 20 years would be completely reasonable. Maybe with exponentially increasing fees for successive renewals, for super valuable properties like Disney movies.
Nobody cares anymore. We're sick of their rent seeking, of their perpetual monopolies on culture. Balance? Compromise? We don't want to hear it.
Nearly two hundred years ago one man warned everyone this would happen. Nobody listened. These are the consequences.
"At present the holder of copyright has the public feeling on his side. Those who invade copyright are regarded as knaves who take the bread out of the mouths of deserving men. Everybody is well pleased to see them restrained by the law, and compelled to refund their ill-gotten gains. No tradesman of good repute will have anything to do with such disgraceful transactions. Pass this law: and that feeling is at an end. Men very different from the present race of piratical booksellers will soon infringe this intolerable monopoly. Great masses of capital will be constantly employed in the violation of the law. Every art will be employed to evade legal pursuit; and the whole nation will be in the plot. On which side indeed should the public sympathy be when the question is whether some book as popular as “Robinson Crusoe” or the “Pilgrim’s Progress” shall be in every cottage, or whether it shall be confined to the libraries of the rich for the advantage of the great-grandson of a bookseller who, a hundred years before, drove a hard bargain for the copyright with the author when in great distress? Remember too that, when once it ceases to be considered as wrong and discreditable to invade literary property, no person can say where the invasion will stop. The public seldom makes nice distinctions. The wholesome copyright which now exists will share in the disgrace and danger of the new copyright which you are about to create. And you will find that, in attempting to impose unreasonable restraints on the reprinting of the works of the dead, you have, to a great extent, annulled those restraints which now prevent men from pillaging and defrauding the living."
Have you looked at who created these things by and large? For the most part, you have:
- aristocrats that were wealthy that didn't need to "work" to survive and put food on the table
- crafts people supported through the patronage of a rich person (or religious order) who deign to support your art
- (kinda modern world) national governments who want to support their national art often as a fear that other larger nations cultural influences will dwarf their
Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
A modern equivalent would be famous YouTubers who all they do all day is "watch" other people's hard earned videos. The super lazy ones will not direct people to the original, don't provide meaningful commentary, just consumes the video as 'content' to feed their own audience and provides no value to the original creator. The position to kill copyright entirely would amplify this "just bypass the original source" to lower value of the original creator to zero.
> Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
Do you think the vast "amount of content we produce" is actually propped up by copyright? Have you ever heard of someone who started their career on YouTube due to copyright? On the contrary, how often have you heard of people stopping their YouTube career due to copyright, or explicitly limiting the content they create? I have only heard of cases of the latter. In fact, the latter partially happened to me.
> How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
You are making an assumption that people should reap (monetary) benefits for creating things. What you are ignoring is that the world where digital copies are effectively free is also the world where original works are insanely cheap as well. In this world, people create regardless of monetary gain.
To make this point: how much money did you make from this comment that you posted? It's covered by copyright, so surely you would not have created it if not for your own benefit.
Spending 6 minutes of my life engaging in political discourse is a far swing from hundreds of individuals producing a movie that took millions of dollars to produce. Both are just as easily digitally repeatable, but the expensive content is likely way more beneficial to society as a whole. I am choosing to engage in this hobby because I receive the means to provide this content recreationally. I fail to see this scaling to anything of any real quality outside of some isolated instances. For instance, some video game enthusiasts are using the work of Bethesda to make a new game call fallout London. It's a knock off fallout game using the base code engine that Bethesda built for their commercial games. The game is exceptional in that it could actually achieve a mostly compatible level of a commercial product as long as you ignore that they're leveraging the engines and story which were developed by commercial interests. In the same time, 10's to hundreds of thousands of people are employed every year to produce video games for commercial reasons. Will they all stop making games if copyright was dead? No, but the vast majority would.
> Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
Yes, and better quality content too as it doesn't need to be compromised as much to allow for commercial exploitation in the current model.
But these are also not the only ways to fund content. Patronage in particular does not need to be restricted to singular rich patrons but can be extended to any group of people that decide to come together to make something exist. This does already happen to some extend (e.g. Kickstarter) but is actually hobbled by copyright where the norm is that the creator retains all rights while individual contributors to the funding are restricted in how they are allowed to share the creation they helped realize.
> How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
By having fans willing to pay him to create new content.
If everyone could do it, it wouldn't be as big a deal - small western businesses would be on a more level playing field, since they would be almost as immune from being sued by big businesses as Chinese businesses are. As it is, small businesses aren't protected by patents (because a patent is a $10k+ ticket to a $100k+ lawsuit against a competitor with a $1M+ budget for lawyers) while still being bound by the restrictions of big business's patents. It's lose/lose.
Video games would actually be better of if the profit incentive was removed. Modern high-budget video games have become indistinguishable from slot machines that are optimized by literal psychologists to get you to waste as much of your money (and time) as possible without providing any meaningful experience. I'd rather see much fewer games created if what remains are games focused on having artistic and/or educational value rather than investment opportunities for wall street.
This is just your own sanctimony, go to a gamestop and ask people if they think we should have an IP regime where there is no gta or football games. What a ridiculous response.
This is a specious argument. It is impossible for us to gesture at the works of art that do not exist because of draconian copyright. Humans have been remixing each others' works for millions of years, and the artificial restriction on derivative work is actively destroying our collective culture. There should be thousands of professional works (books, movies, etc.) based on Lord Of The Rings by now, many of which would surpass the originals in quality given enough time, and we have been robbed of them. And Lord Of The Rings is an outlier in that it still remains culturally relevant despite its age; most works will remain copyrighted for far longer than their original audience was even alive, meaning that those millions of flowers never get their chance to bloom.
> It is impossible for us to gesture at the works of art that do not exist because of draconian copyright.
We can gesture at the tiniest tip of the iceberg by observing things that are regularly created in violation of copyright but not typically attacked and taken down until they get popular:
- Game modding, romhacks, fangames, remakes, and similar.
- Memes (often based on copyrighted content)
- Stage play adaptations of movies (without authorization
- Unofficial translations
- Machinima
- Speedruns, Let's Play videos, and streams (very often taken down)
- Music remixes and sampling
- Video mashups
- Fan edits/cuts, "Abridged" series
- Archiving and preservation of content that would otherwise be lost
There are several other publishers who regularly go after gameplay footage of people playing their games. It's not as visible, because it's hard to notice the absence of a thing.
This is all true, and in a vacuum I agree with it. There's a pretty core problem with these kinds of assertions, though: people have to make rent. Never have I seen a substantiative, pass-the-sniff-test argument for how to make practical this system when your authors and your artists need to eat in a system of modern capital.
So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?
> What's the A to B if you could pass a law tomorrow?
Top priority: UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.
Apart from that: Conventions/concerts/festivals (tickets to a unique live event with a crowd of other fans), merchandise (pay for a physical object), patronage (pay for the ongoing creation of a thing), crowdfunding/Kickstarter (pay for a thing to come into existence that doesn't exist yet), brand/quality preference (many people prefer to support the original even if copies can be made), commissions (pay for unique work to be created for you), something akin to "venture funding", and the general premise that if a work spawns ten thousand spinoffs and a couple of them are incredible hits they're likely to direct some portion of their success back towards the work they build upon if that's generally looked upon favorably.
People have an incredible desire both to create and to enjoy the creations of others, and that's not going to stop. It is very likely that the concept of the $1B movie would disappear, and in trade we'd get the creation of far far more works.
> UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.
The poster didn't posit it as "how does this make huge amounts of money," they asked how copyright authors are supposed to pay their rent in your scenario. Your solution of course, has nothing to do with copyright policy.
Yeah, this is what I was expecting. I have no love for Disney et al but I think that this is dire (aside from UBI, which would be great but is fictional without a large-scale shift in American culture).
"Everybody else gets paid for the work they do; you get paid for things around the work you do, if you're lucky" is a way to expect creatives to live that, to put a point on it, always ends up being "for thee, but not for me". It's bad enough today--I think you described something worse.
The current model is "most people get paid for the work they do, but you get paid for people copying work you've already done", which already seems asymmetric. This would change the model to "people get paid for the work they do, and not paid again for copying work they've already done".
We converged on a system that protects the commercialization of copies because, in practice, "the first copy costs $X0,000" is not a viable way to pay your rent.
If we want art to be the province of the willfully destitute or the idle rich (and I do mean rich, the destruction of a functional middle class has compacted the available free time of huge swaths of society!), this is a good way to do it. I would rather other voices be included.
We converged on a system that makes copying illegal because that system was invented in an era when the only people who could copy were those with specialized equipment (e.g. printing presses). In that world, those who might do the copying were often larger than those whose works were being copied, and copyright had more potential to be "protective".
That system hasn't been updated for a world in which everyone can make perfect-fidelity copies or modifications at the touch of a key; on the contrary, it's been made stricter. And worse, per the story we're commenting on here, the much larger players who are mass-copying works largely by individuals or smaller entities have become effectively exempt from copyright, while copyright continues to restrict individuals and smaller entities, and the systems designed by those large players and trained on all those copied works are crowding individuals out of art and other creative endeavors.
I don't think the current system deserves valorizing, nor can it be credited as being intentionally designed to bring about most of the effects it currently serves.
I'm not suggesting that deleting copyright overnight will produce a perfect system, nor am I suggesting that it has zero positive effects. I'm suggesting that it's doing substantial harm and needs a massive overhaul, not minor tweaks.
Many of the funding models Josh listed are directpayment for creative work being done. If anything, in the current model creative work is often not paid directly (unless done as work for hire where the creative doesn't get to own their creation) but instead is a gamble that you can later on profit from the "intellectual property".
>So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?
Patreon (or liberapay etc). Take a look at youtube: so many creators are actively saying "youtube doesn't pay the bills, if you like us then please support us on Patreon". Patreon works. Some of the time, at least - just like copyright. Also crowdsourcing (e.g. Kickstarter), which worked out well for games like FTL and Kingdom Come: Deliverance.
Although, I personally don't believe copyright should be abolished - it just needs some amendments. It needs a duration amendment - not a flat duration (fast-fashion doesn't need even 5 years of copyright, but aerospace software regularly needs several decades just to break profitable), but either some duration-mechanism or a simple discrimination by industry.
Also, I think any sort of functional copyright (e.g. software copyright) ought to have an incentive or requirement to publish the functional bits - for instance, router firmware ought to require the source code in escrow (to be published once copyright duration expires) for any legal protections against reverse-engineering to be mounted. Unpublished source code is a trade secret, and should be treated as such.
Also, these discussions don't seem to mention fanfiction, which demonstrates plenty of people write good works without being professionally paid and without the protection of copyright.
How many subscribers on patreon are there because the creators provides pay-walled extra content? How many would remain if that pay-walled content would be mirrored directly by youtube or on youtube?
Crowdsourcing might work better, but how many would donate to a game where, instead of getting it cheaper as a kickstarter supporter, they could get free after it is released?
Copyright is not optimized for making sure artists and authors get enough to eat. It's optimized for people with a lot of money to make even more money by exploiting artists and authors.
I doubt there's a simple answer (I certainly don't have one), but the current system is not exactly a creators' utopia.
My own business model is to create Things That Don't Exist Yet. This (typically bespoke work) is actually the majority of work in any era I think. For me, copyright doesn't do much, it mostly gets in the way.
If you pass the law tomorrow -all else being equal- my profits would stay equal or go up somewhat.
Fashion is traditionally not copyrightable[1] , and the fashion industry is doing rather well.
Similarly our IT infrastructure is now built mostly on [a set of patches to the copyright system][2] called F/L/OSS that provided more freedom to authors and users, and lead to more innovation and proliferation of solutions.
So even just in the modern west, we can see thriving ecosystems where copyright is absent or adjusted; and where the outcomes are immediately visible on the street.
[1] Though a quick search shows that lawyers are making inroads.
That ship sailed long ago. While copyright can and is used at times to protect the "little guy", the law is written as it is in order to protect and further corporate interests.
The current manifestation of copyright is about rent-seeking, not promoting innovation and creativity. That it may also do so is entirely coincidental.
Also, if it wasn't about rent-seeking and preventing access to works, copyright wouldn't have to last for decades, many multiples of a work's useful commercial life. The fact that it does last this long shows that it's not about promoting innovation and creativity.
Copyright was invented by a cartel of noblemen, the British Stationer's Company, who, due to liberal reform, were going to lose their publishing monopoly. The implementation of copyright law as they helped pen allowed them to mostly continue their position while portraying it as "protecting the little guy".
Funny how both the rhetoric and intentions are the same after three hundred years.
Copyright’s purpose is a cudgel to be wielded to enrich the holder for, ideally, eternity. If “eternity” is threatened, you use proceeds from copyright to change copyright law to protect future proceeds.
What are you going to do about it? Confiscate everyone's home gamer PCs?
Even in the most extreme hypothetical where lawsuits shutdown OpenAI, that doesn't delete the stable diffusion models that I have on my external hard drives.
Somehow this argument does not seem to hold for copyright enforcement of works that have been shared over BitTorrent and it's predecessors for decades.
That's a significant over simplification of how it works though to the point of almost not being a useful analogy.
If your analogy was you were a human who memorized every variation of a problem (and every other known problem) and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized, but then added an after the fact filter so you don't directly reproduce it...
It's more like musicians who basically copy a bunch of music patterns or chord progressions before then notice their final output sounds too similar to another song (which happens often IRL) then changes it to be more original before releasing it to the public.
> If you analogy was you were a human who memorized every variation of a problem (and every other known problem)
This is mere assumption. AI is supposed to work like that, but that's a goal, and not the result of current implementations. Research shows that they do memorize solutions as well, and quite regularly so. (This is an unavoidable flaw in current LLMs; They must be capable of memorizing input verbatim in order to learn specific facts.)
> and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized
This is copyright infringement. Actionable copyright infringement. The big music publishers go after this kind of accidental partial reproduction.
> but then added an after the fact filter so you don't directly reproduce it...
"Legally distinct" is a gimmick that only works where the copyright is on specific identifiable parts of a work.
Changing a variable name does not make a code snippet "legally distinct", it's still copyright infringement.
Meh I still see that as a big oversimplification. Context matters. Even if the copyright courts often ignore that for wealthy entities. Someone reproducing a song using AI and publishing it as their own copyright infringement, a person specifically querying an AI engine, that sucked up billions of lines of information and generates what you ask it do with a sma probability it will reproduce a small subset of a larger commercial project and sends it to someone in a chatbox is not exactly the same IMO.
This is Github Copilot after all. I use it daily and it autocompletes lines of code or generates functions you can find on stackoverflow. It's not letting giving you the source code to Twitter in full and letting you put it on the internet as a business under another name.
We are currently seeing the music industry reacting to AI learning a bunch of music patterns and chord progressions and outputting works that sounds very similar to existing music and artists. They are not liking it.
To just see how much they disliked it, youtube copyright strikes is basically a trained AI to detect music patterns to identify sound with slight variations or copyrighted songs and take videos down. Generating slight variations was one of the early method that videos used to bypass the take down system.
> The most recently dismissed claims were fairly important, with one pertaining to infringement under the Digital Millennium Copyright Act (DMCA), section 1202(b), which basically says you shouldn't remove without permission crucial "copyright management" information, such as in this context who wrote the code and the terms of use, as licenses tend to dictate.
> It was argued in the class-action suit that Copilot was stripping that info out when offering code snippets from people's projects, which in their view would break 1202(b).
> The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply. Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.
So (not a lawyer!) this reads like the point about GitHub tuning their model is not a generic defense against any and all claims of copyright infringement, but a response to a specific claim that this violates a provision of the DMCA.
I don't know whether this is a reasonable defense or not, but your intuitions or mine about whether there is a general copyright violation or what's fair are not necessarily relevant to how the judge construes that very specific bit of legal code.
What I got from this is, you can copy someone's copyrighted work provided you tweak a few things here and there. I wonder how this holds up in court if you don't have billions at your disposal.
Just to set the stage and not entirely specific to this complaint... It really depends on what is and isn't subject to copyright for software.
Broadly, there is the distinction between expressive and functional code. [1]
And then there are the specific tests that have been developed by the courts to separate the expressive and functional aspects of software. [2] [3]
In practice it is very expensive for a plaintiff to do such analysis. For the most part the damages related to copyright are not worth the time and money. Plaintiffs tend to go for trade secret related damages as they are not restricted by the above tests.
There are also arguments to be made of de minimis infringements that are not worth the time of the court.
Most importantly the plaintiff fundamentally has the burden of proof and cannot just say that copying must have taken place. They need concrete evidence.
The guy who owns the machine is really rich, while you are more or less (all due respect of course) not worth suing.
That’s why I think the opposite of what you claim is true: if you were to do this, absolutely nothing would happen. When they do it, they will get sued over and over until the law changes and they can’t be sued, or they enter some mutually-beneficial relationship with the parties who keep suing.
Those emulators are very popular though to the point of potentially impacting another business's bottom line. Where an individual putting it out a small block of code isn't exactly going to attract expensive lawyers.
I'm skeptical Github Copilot reproducing a couple functions potentially used by some random Github project is going to be a threat to another party's livelihood.
When AI gets good enough to make full duplicates of apps I'd be more concerned about the source. Thousands of smaller pieces drawn from a million sources and being combined in novel ways is less worrying though.
In that case, could you clarify what instances of this you're referring to?
The death of Citra wasn't really a deliberate action on the part of Nintendo, it was collateral damage. Citra was started by Yuzu developers and as part of the settlement they were not able to continue working on it. Citra's development had long been for the most part taken over by different developers, but the Yuzu people were still hosting the online infrastructure and had ownership of the GitHub repository, so they took all of it down. Some of the people who were maintaining Citra before the lawsuit opened up a new repository, but development has slowed down considerably because the taking down of the original repository has caused an unfortunate splintering of the community into many different forks.
There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed. If indeed they did go after UltraHLE, then this would just like Yuzu be a case of them taking down an emulator for a console they were still profiting from, as UltraHLE was released in 1999.
The most famous example of companies going after emulators is Sony, which went after Connectix Virtual Game Station and Bleem!. Both were PS1 emulators released in 1999, a period during which Sony was still very much profiting from PS1 sales. Sony lost both lawsuits and hasn't gone after emulators since.
In 2017, Atlus tried to take down the Patreon page for RPCS3, a PS3 emulator. However, Atlus only went after the Patreon page, not the emulator itself, which they did because of their use of Persona 5 screenshots on said page. The screenshots were simply taken down and the Patreon page was otherwise left alone. Of note is that Atlus is a game developer, so they were never profiting from PS3 sales. However, they were certainly still profiting from Persona 5 sales, which had only released in 2016.
These are the only examples I can remember. Did I miss anything?
emulators for many nintendo consoles have been developed and released while the console was still sold and have been left alone as long as they had no direct links to piracy, recent events are a bit of a change.
> There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed.
iirc it got c&d but a case was never filed in court, the source code turned up eventually anyways.
the bnetd emulator, that let Diablo and StarCraft players not have to pay Blizzard for the privilege of buying the game, though that's a bit different.
Yes there is. If I can emulate Super Mario Odyssey on my PC, I don't need to buy a Nintendo Switch. If it wasn't available there, I'd have to buy a Nintendo Switch to play it. That's a lost sale for Nintendo. You could argue that I wasn't going to buy a switch anyway, but then we're getting too into hypotheticals.
This is the same reasoning the music and movie industries use when they go after people downloading music. And contrary to the popular opinion, I think it is wrong: if people want to pay, they will pay. Same for movies: if people would really want to pay for a movie, they would go to a cinema. Or stream it after a week or two. But there are also people who would jump through hoops than pay for music or movies. And that is not a lost sale because there was never an intention to buy something in the first place.
I enjoy how you removed the “I think” qualifier which suggested that it’s very possible that you’re right.
I’m quite well read on the DMCA but admit you probably know far more about how Nintendo wields it.
Still, I suggest that it’s a lot more likely that GitHub is going to get sued than you or GP.
Finally, I believe using the legal system to bully independent software developers is, in legal terms, super lame. We are probably in the same side here.
DMCA (at least the take down requests part) is not really suing someone and not really about making money. Its about getting certain works off the internet.
You are probably more likely to be on the wrong end of a dmca take down request as a poor person since you dont have the resources to fight it, and its not about recovering damages just censorship.
We are really losing the plot of what this thread is about here, but: DMCA takedown requests that are ignored or wheee the site does not comply with the process are subject to private civil action. Obviously, a takedown request is distinct from suing someone. And the way that the rights holder forces the site to remove the content is under threat of monetary penalties.
> How is it any different when a machine does the same thing?
I think the argument is that the machine is not doing that, or at least there isn't evidence that it is doing that.
Specificly no evidence that github is doing both 1 and 2 at the same time. There might be cases where it makes trivial changes to code (point 2) but for code that does not meet the threshold of originality. Similarly there might be cases with copyrighted code where the idea of it is taken, but it is expressed in such a different way that it is not a straightforward derrivitave of the expression (keeping in mind you cannot copyright an idea, only its expression. Using a similar approach or algorithm is not copyright infringement)
And finally, someone has to demonstrate it is actually happening and not just in theory could happen. Generally courts dont punish people for future crimes they haven't comitted yet (sometimes you can get in trouble for being reckless even if nothing bad happens, but i dont think that applies to copyrighg infringement)
But what if the generative AI were used to create music instead of code would the court have ruled differently?
CONSIDER:
In 2015, a federal judge order Thicke & Pharrell to pay 50% of proceeds to the Marvin Gaye estate for being “too similar” to the song, “Gots to Give It Up”.
Regardless of the details here, it's become quite clear that the judicial system is for corporations. It doesn't matter whether they win, lose, or settle, as they win regardless, since the monetary benefits of what got them in court in the first place far outweigh any punishment or settlement cost.
You probably do this all the time. Forget memorizing but undoubtedly you've read code, learned from it, and then likely reproduced similar code. Probably nothing terribly important, just a function here or there. Maybe even reproduced something you did for a previous employer.
The machine alone doesn't do anything. The user and machine together constitute a larger system, and with autocomplete, the user is charge. What's the user's intent?
I suspect that a lot of copyright violations are enabled by cut-and-paste and screenshot-taking functionality, and maybe we need to be careful with autocomplete, too? It's the user's responsibility to avoid this. We should be careful using our tools. Do users take enough care in this case? Is it possible to take enough care while still using CoPilot?
I've switched from CoPilot to Cody, but I use them the same way, to write my code. There's no particular reason to use CoPilot's output verbatim and lots of good reasons not to. By the time I've adapted it to my code base and code style and refactored it to hell and back, it's an expression of how I want to solve a problem, and I'm pretty confident claiming ownership.
Is that confidence misplaced? Are other people more careless?
By the same token, the machine alone can't download pirated movies. Yet the sites hosting those movies are targeted as the infringers.
There's a point at which foisting this responsibility on the users is simply socializing losses. Ultimately Copilot is the one serving the code up - regardless of the user's request. If the user then goes on to republish that work as their own it becomes two mistakes. It'll be interesting to see if any lawyers are capable of articulating that well enough in any of these lawsuits.
> Is that confidence misplaced? Are other people more careless?
I would say yes, for two reasons. One is that using code of unknown provenance means you're opening yourself to unknown legal risks. The second is if you're rewriting it fully (so as not to run afoul of easily spotted copyright) that's not actually "clean room" and you're still open to problems. I'd also wonder what the point of using a code writing LLM is anyways if you're doing all the authorship yourself. It seems like doing double the work.
>I assume that I would get my ass kicked legally speaking.
Why? This is no different than copy pasting and modifying a bit of code from some documentation/other project/tutorial/SO. Surely if that were a basis for copyright infringement most semi-large software projects would be infringing on copyright.
I don't think anyone here should be willing to open the can if worms that is copy pasting small snippets of code and modifying them.
The judge seems to argue that the non-identical copies are at issue here and that they only happen under contrived circumstances. My moral opinion is that this is irrelevant and that even the defendant is the wrong person. Even verbatim copies of code snippets shouldn't be copyright infringement and suing the company providing the AI is wrong to begin with, as the AI or its providercan not possibly be the one to infringe.
I don't think it works that way. During the course of your professional career as a developer you change jobs. And let's say that at every job you create APIs. Besides the particular functions those API provide, the API code itself (how you interact with clients, databases etc.) will be pretty much the same as whatever you did at previous jobs. Does this constitute copyright experience or is just experience?
My analogy is that if Copilot doesn't provide 100% code from another repository it is OK to be used by other people trained with code available on GitHub.
It would. And this is where some legislation "in the spirit of" would have helped. So Microsoft's huge legal arm can't just wiggle their way out on technicalities. Clearly, the law is not prepared to face the challenge of copyright violations on the scale created by the LLMs.
I also think it's not just copyright. It's simply not right to create a product on top of the collective work of all open source developers monetize them on the absurt scale Microsoft operates and never ever credit the original creators.
Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.
As long as we're not required to register copyright there's no reason to think the above will play out. International copyright agreements are not limited to verbatim copies only.
> Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.
This has already been done[1] in music, though in their case they released them to the public domain. Admittedly I think that was more of a protest than anything.
You are taking the plaintiff statement as is, which is wrong. You can blame the media that didn't made it clear that it was a statement from the plaintiff.
> I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.
It looks like wilful obfuscation because the obfuscation is so simplistic. But as the obfuscation gets increasingly sophisticated, it becomes ever harder to distinguish wilful obfuscation from genuine originality.
> But sufficiently complex obfuscation of infringement is very hard to distinguish from genuine originality.
for the purposes of copyright, originality is not required, just different expressions. It's ideas (aka, patent) that require originality.
The 'sufficiently complex obfuscation' is exactly what people's brains go through when they learn, and re-produced what they learnt in a different context.
I argue that AI-training can be considered to be doing the same.
(1) You leave your employer, don’t take any code with you, start your own company, reimplement your ex-employer’s product from scratch, but you do it in a very different way (different language, different design choices, different tech stack, different architecture)
(2) You leave your employer, take their code with you, start your own company, make some superficial changes to their code to obscure your theft but the copying is obvious to anyone who scratches the surface
(3) You leave your employer, take their code with you, start your own company, start very heavily manually refactoring their code, within a few months it looks completely different, very difficult to distinguish from (1) unless you have evidence of the process of its creation
(4) You leave your employer, take their code with you, start your own company, download some “infringement obfuscation AI agent” from the Internet and give it your employer’s codebase, within a few hours it has transformed it into something difficult to distinguish from (1) if you didn’t know the history
(1) is unlikely to be held to be infringing. (2) is rather obviously going to be held to be infringing. But what about (3)? IANAL, but I suspect if you admitted that is how you did it, a judge would be unlikely to be very sympathetic. Your best hope would be to insist you actually did (1) instead. And then the outcome of the case might come down to whether the judge/jury believes your claim you actually did (1), or the plaintiff/prosecution’s claim you did (3).
And (4) is basically just (3) with AI to make it a lot faster and quicker. Such an agent likely doesn’t exist yet, but it could happen.
Timing is obviously a factor. If you leave your employer and launch a clone of their app the next week, everyone is going to think either you stole their code, or you were moonlighting on writing it (in which case they may legally own it anyway). If it takes you 12 months, it becomes more believable you wrote it from scratch. But if someone uses AI to launder code theft, maybe they can build the “clone” in a few days or weeks, and then spend a few months relaxing and recharging before going public with it
Numbers 2, 3, & 4 are all illegal because they start with an illegal action.
If I find a dollar on the sidewalk and put it in my wallets, is that stealing? If I punch a man getting change at a hotdog stand and a dollar falls on the sidewalk and then I put that in my wallet, is that stealing?
It doesn't matter what the scenario is after you stole code from your former employer, all actions are poisoned after.
Although the question is - obviously the ex-employee is likely to be found guilty of copyright infringement (civilly or criminally or both). But what is the copyright status of the resulting work? Does its infringing origins condemn it to always be infringing? Or at some point if it is refactored/rewritten enough it ceases to so be?
Imagine the ex-employee open sources it, and I’m an innocent third party using that code base, ignorant of its unlawful origins. Am I infringing their ex-employers copyright (even if unintentionally)? For (2), obviously “yes”. But what about (3) or (4)?
That’s the entire reason “clean room reverse engineering” is done.
Using nothing but the binary itself, work out how things are done. Making sure that the reverse engineers don’t even have access to any material that could look like it came from the other organization in question. And that it is provable.
How is it anything different? You have no money. And Microsoft has. The problem on this is that it will give a huge leverage to rich companies over poor, because those rich can steal (memorize with AI) anything including music
It seems the total disregard that the tech community showed toward copyright when it was artists losing out has come back to bite. Face-eating leopards, etc.
The actual answer here, regardless of a court ruling, is that you'd go broke if anyone big enough tried to go after you for it.
Legal protections for source code are still pretty fuzzy, understandably so given how comparatively new the industry is. That doesn't stop lawyers from racking up huge fees though, it actually helps because they need so much more prep time to debate a case that is so unclear and/or lacking precedent.
> How is it any different when a machine does the same thing?
Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.
I'm not aware of evidence that support that claim. If I ask ChatGPT "Give me a recipe for squirrel lemon stew" and it so happens that one person did write a recipe for that exact thing on the Internet, then I would expect that the most accurate, truthful response would be that exact recipe. Anything else would essentially be hallucination.
As I understand it, LLMs are intended to answer questions as "truthfully" as they can. Their understanding of truth comes from the corpus they are trained on. If you ask a question where the corpus happens to have something very close to that question and its answer, I would expect the LLM to burp up that answer. Anything less would be hallucination.
Of course, if I ask a question that isn't as well served by the corpus, it has to do its best to interpolate an answer from what it knows.
But ultimately its job is to extract information from a corpus and serve it up with as much semantic fidelity to the original corpus as possible. If I ask how many moons Earth has, it should say "one". If I ask it what the third line of Poe's "The Raven" is, it should say "While I nodded, nearly napping, suddenly there came a tapping,". Anything else is wrong.
If you ask it a specific enough question where only a tiny corner of its corpus is relevant, I would expect it to end up either reproducing the possibly copyright piece of that corpus or, perhaps worse, cough up some bullshit because it's trying to avoid overfitting.
(I'm ignoring for the moment LLM use cases like image synthesis where you want it to hallucinate to be "creative".)
I get that's what you and a lot of people want it to be, but it isn't what they are. They are quite literally probabilistic text generation engines. Let's emphasise that: the output is produced randomly by sampling from distributions, or in simple terms, like rolling a dice. In a concrete sense it is non-deterministic. Even if an exact answer is in the corpus, its output is not going to be that answer, but the most probable answer from all the text in the corpus. If that one answer that exactly matches contradicts the weight of other less exact answers you won't see it.
And you probably wouldn't want to - if I ask if donuts are radioactive and one person explicitly said that on the internet you probably aren't going to tell me you want it to spit out that answer just because it exactly matches what you asked. You want it to learn from the overwhelimg corpus of related knowledge that says donuts are food, people routinely eat them, etc etc and tell you they aren't radioactive.
It's equally plausible to say you don't intend to reproduce copyrighted code verbatim but occasionally do so given either a sufficiently specific prompt or because the reproduced code is so generic that it probably gets rewritten a hundred times a day because that's how people learned to do basic things from books or documentation or their education.
Um, the entire intent of these "AI" systems is explicitly to reproduce copyrighted work with mechanical changes to make it not appear to be a verbatim copy.
That is the whole purpose and mechanism by which they operate.
Also the intent does not matter under law - not intending to break the law is not a defense if you break the law. Not intending to take someone's property doesn't mean it becomes your property. You might get less penalties and/or charges, due to intent (the obvious examples being murder vs manslaughter, etc).
But here we have an entire ecosystem where the model is "scan copyrighted material" followed by "regurgitate that material with mechanical changes to fit the surrounding context and to appear to be 'new' content".
Moreover given that this 'new' code is just a regurgitation of existing code with mutations to make it appear to fit the context and not directly identical to the existing code, then that 'new' code cannot be subject to copyright (you can't claim copyright to something you did not create, copyright does not protect output of mechanical or automatic transformations of other copyrighted content, and copyright does not protect the result of "natural processes", e.g 'I asked a statistical model to give me a statically plausible sequence of tokens and it did'). So in the best case scenario - the one where the copyright laundering as a service tool is not treated as just that, any code it produces is not protectable by copyright, and anyone can just copy "your work" without the license and (because you've said if you weren't intending to violate copyright it's ok) they can say they could not distinguish the non-copyright-protected work from the protected work and assumed that therefore none of it was subject to copyright. To be super sure though they weren't violating any of your copyrights, they then ran an "AI tool" to make the names better and better suit your style.
I am so sick of these arguments where people spout nonsense about "AI" systems magically "understanding" or "knowing" anything - they are very expensive statistical models, the produce statistically plausible strings of text, by a combination of copying the text of others wholesale, and filling the remaining space with bullshit that for basic tasks is often correct enough, and for anything else is wrong - because again they're just producing plausible sequences of tokens and have no understanding of anything beyond that.
To be very very very clear: if an AI system "understood" anything it was doing, it would not need to ingest essentially all the text that anyone has ever written, just to produce content that is at best only locally coherent, and that is frequently incorrect in more or less every domain to which it is applied. Take code completion (as in this case): Developers can write code without essentially reading all the code that has ever existed just so that they can write basic code, because developers understand code. Developers don't intermingle random unrelated and non-present variables or functions in their code as they write, because they understand what variables are and therefore they can't use non existent ones. "AI" on the other hand required more power than many countries to "learn" by reading as much as possible all code ever written, and then produce nonsense output for anything complex because they're still just generating a string of tokens that is plausible according to their statistical model - the result of these AIs is essentially binary: it has been in effect asked to produce code that does something that was in its training corpus and can be copied essentially verbatim, with a transformation path to make it fit, or it's not in the training corpus and you get random and generally incorrect code - hopefully wrong enough it fails to build, because they're also good at generating code that looks plausible but only fails at runtime because plausible sequence of tokens often overlaps with 'things a compiler will accept'.
I actually once tracked this claim down in the case of stable diffusion.
I concluded that it was just completely impossible for a properly trained stable diffusion model to reproduce the works it was trained on.
The SD model easily fits on a typical USB stick, and comfortably in the memory of a modern consumer GPU.
The training corpus for SD is a pretty large chunk of image data on the internet. That absolutely does not fit in GPU memory - by several orders of magnitude.
No form of compression known to man would be able to get it that small. People smarter than me say it's mathematically not even possible.
Now for closed models, you might be able to argue something else is going on and they're sneakily not training neural nets or something. But the open models we can inspect? Definitely not.
Modern ML/AI models are doing Something Else. We can argue what that Something Else is, but it's not (normally) holding copies of all the things used to train them.
I think this argument starts to break down for the (gigantic) GPTs where the model size is a lot closer to the size of the training corpus.
Thinking in terms of compression, the compression in generative AI models is lossy. The mathematical bounds on compression only apply to lossless compression. Keeping in mind that a small fraction of the training corpus is presented to the training algorithm multiple times, it's not absurd to suggest that these works exist inside the algorithm in a recallable form. Hence the NYT's lawyers being able to write prompts that recall large chunks of NYT articles verbatim.
And I seem to recall there are some theoretical lower bounds on even lossy compression. Some quick back of the envelope fermi estimation gets me a hard lower bound of 5TB for "all the images on the internet"; but I'm not quite confident enough in my math to quite back that up right here and now.
> And I seem to recall there are some theoretical lower bounds on even lossy compression.
I'm not sure what your math is coming from and it seems trivially wrong. A single black pixel is a very lossy compression of every image on the internet. A picture of the Facebook logo is a slightly-less-lossy compression of every picture on the internet (the Facebook logo shows up on a lot of websites). I would believe that you can get a bound on lossy compression of a given quality (whatever quality means) only if you assume that there is some balance of the images in the compressed representation. There are a lot of assumptions there, and we know for a fact that the text fed to the GPTs to train them was presented in an unbalanced way.
In fact, if you look at the paper "textbooks are all you need" (https://arxiv.org/pdf/2306.11644) you can see that presenting a very limited set of information to an LLM gets a decent result. The remaining 6 trillion tokens in the training set are sort of icing on the cake.
I think you'll agree that it would be a bit absurd to threaten legal action against someone for storing a single black pixel.
OTOH Someone might be tempted to start a lawsuit if they believe their image is somehow actually stored in a particular data file.
For this to be a viable class action lawsuit to pursue, I think you'd have to subscribe to the belief that it's a form of compression where if you store n images, you're also able to get n images back. Else very few people would have actual standing to sue.
I think that when you speak in terms of images, for a viable lawsuit, you need to have a form of compression that can recall n (n >= 1) images from compressing m (m >= n) images. Presumably n is very large for LLMs or image models, even though m is orders of magnitude larger. I do not think that your form of compression needs to be able to get all m images back. By forcing m = n in your argument, you are forcing some idea of uniformity of treatment in the compression, which we know is not the case.
The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.
That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.
I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .
I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).
While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .
And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]
Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.
That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)
* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.
* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)
[1] However, in practice there appear to be limits to floating point precision.
If you tell a programmer to implement a function foo(a, b) then there are actually only a tiny number of ways to do that, semantically speaking, for any given foo. The number of options narrows quickly as the programmer implementing it gets more competent.
Choosing function signatures is an art form but after that "copying" is hard to judge.
it depends on how much tax you are paying really. if you pay billions in taxes annually, they might see past it. if the company you copied from pays billions in taxes anually. you will go to jail. if this isn't painfuly obvious by now...
First: every human is per se doing that already. We have – to handwave – a "reasonable person" bar to separate violations versus results of learning and new innovation.
Second: You can be a holder of copyright and your creations result in copyrightable artifacts. Anything generated by the program has been held as uncopyrightable.
Days like this, I wonder what Borges would have made of such questions.
"Pierre Menard, author of redis"
I know from experience that parents are aggressively pushing their children into STEM to maximize their chances of being economically secure, but, I really feel that we need a generation of philosophers and humanists to sift through the issues that our technology is raising. What does it mean to know something? What does authorship mean? Is a translated work the same as the original? Borges, Steiner, and the rest have as much to contribute as Ellison, Zuckerberg, and Altman.
> I assume that I would get my ass kicked legally speaking.
Maybe, maybe not. It's not as simple as you made it out to be. If you write a book with lots of stuff and you got inspiration from other books, and even put in phrases wholesale, but modified to use your own character names instead, I'm not convinced you would lose.
The court would look at the work as a whole, not single pieces of it.
They would also check if you are just copying things verbatim, or if you memorize a pattern and emit the same pattern - for example look at lawsuits about copying music, where they'll claim this part of the music is the same as that part.
It's really not as cut and dry as you make it out to be.
If I, a human, were to:
1. Carefully read and memorize some copyrighted code.
2. Produce new code that is textually identical to that. But in the process of typing it up, I randomly mechanically tweak a few identifiers or something to produce code that has the exact same semantics but isn't character-wise identical.
3. Claim that as new original code without the original copyright.
I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.
How is it any different when a machine does the same thing?