Microsoft would need to pay me to use Copilot. Seems like a major scam for them to learn from our code and then tell us we can’t use it to make our own competing AI systems. “Limits on use of data from the AI Services. You may not use the AI services, or data from the AI services, to create, train, or improve (directly or indirectly) any other AI service.” From https://www.microsoft.com/en/servicesagreement#13r_AIService... TODAY — the ridiculous focus on AI over core business at github is the number one thing likely to kill the platform. They are coasting on network effects
I've been paying for GitHub for years and years, long before they introduced any kind of AI service. I don't use any of their AI products, even though they're now included with my plan. Hopefully that small datapoint shows up as a blip in their metrics somewhere.
"You may not use the AI services, or data from the AI services, to create, train, or improve (directly or indirectly) any other AI service" - It's not like anyone chose to indirectly train these existing systems - this means that anything published online that could be scraped and used to train something is not allowed right?
It wouldn't be an online service's T&C document if it didn't include at least one vague, threatening, unworkable, and unenforceable condition.
The useless but true answer is nobody knows what's allowed and what isn't, until it's tested in court. Practically (not being a lawyer, though) I suspect that the clause will never be pursued on its own, because it's bullshit and everyone involved knows it is so.
In your scenario, though, assuming you publish in a way that's not overtly and primarily meant for AI training, I think the "use" of data isn't yours and would be hard to argue as violating the terms of the agreement.
Of course we might take it to the absurd end of this line of reasoning and demand that any code base that Copilot was involved in should have a license term preventing the training of any other AI in it, and we wind up in a place where all AIs are trained on source material they're explicitly licensed not to be trained on, or trained only on a mostly static set of "pre-AI" publications.
I'm having a hard time determining if my private repo code is used for training their models. The GitHub Copilot VS Code Extension states:
> Your code is yours. We follow responsible practices in accordance with our Privacy Statement to ensure that your code snippets will not be used as suggested code for other users of GitHub Copilot.
IIRC, I think this statement gave me the initial reassurance I needed to use Copilot many months ago, however I feel like this could be deceptively reassuring. Does it mean they can use my code for training and suggestions to other users after changing the variable names?
I tried to dig deeper. The section on "Private repositories" in their Privacy Policy [1] says: "GitHub personnel does not access private repository information without your consent", with exceptions for security, customer support, and legal obligations. Again, this feels deceptively reassuring, since GitHub personnel and GitHub's AI services are separate entities.
In their Privacy Policy, "Code" falls under the definition of "Personal Data" (User Content and Files) [2], and they go on to list lots of broad ways the data can be used and shared.
Unless I've missed anything, and as other commenters have said much more succinctly, I have to assume that there's a real possibility that my private repo code is used to train their models.
It's a good example of how ridiculous the AI training situation is.
They claim it's fair use for them to steal all data they want, but you're not allowed to use AI data output, despite this data literally not being subject to copyright protections on account of lacking a human author.
And especially Github. They already have an enormous corpus that is licensed under MIT/equivalent licenses, explicitly permitting them to do this AI nonsense. All they had to do was use only the code they were allowed to use, maybe put up an attribution page listing all the repos used, and nobody would've minded because of the explicit opt-in given ahead of time.
But no. They couldn't bother with even that little respect.
I wonder how GPLv3 and CC BY SA licenses should be considered when training AIs like this? The model is software, and if it's sufficiently different from the source, it's a derivative work, isn't it?
Use tree-sitter, change some identifiers here and there, some function names also, then you can use the generated data for anything you like. Pass each identifier through the cheapest LLM possible and change it ever so slightly.