The amount of extremely sensitive corporate secrets in GitHub issues makes me assume that their security and privacy are pretty rock solid.
A lot of companies pay GitHub a lot of money to look after their source code and related artifacts. That's GitHub's business model. I don't think they would jeopardize that trust for the sake of training a model on private data.
Where I work we don’t use GitHub, but we do use Copilot. It took a long time before it was opened up for us to be able to use it, as deals had to be struck and accounts and auth setup to use our corporate logins, which have different rules for data privacy than public use of Copilot. We are explicitly forbidden from using the public version of Copilot, or any other AI for that matter.
I can only assume that companies paying for GitHub also pay for enhanced levels of privacy. Just because a company can pay GitHub not to train on their data, doesn’t mean they’re not going to train on your data that is being hosted for free. They are almost certainly crawling all free repos.
Assumptions are not sufficient when it comes to basic human rights. And my assumption is that they do or will use your data from their own interest as the likelihood that someone within Microsoft will benefit tends to 1 as time tends to infinity.
privacy needs to be verifiable (Apple has show this is possible with private cloud compute)
I dunno. For one thing, those companies are paying GitHub a lot of money for the enterprise version, separately hosted (right?). The data isn't actually available to Microsoft employees or LLMs, absent some security flaw or backdoor. For another, companies that pay for this also (sample size is small, though) have automation that scans GitHub repos, issues, etc for any secrets and require them to be removed and scrubbed from history, implying that they don't trust even the self-hosted GitHub Enterprise as much as you do.
I see secrets as a different issue. Putting those in an issue or repo exposes them to potentially hundreds of people within your own company, that's bad practice.
I remember awhile back that they were and do train on repositories to the point that I never wanted to use GitHub for anything other than submitting bug reports to projects.
Maybe the non-training only applies if you pay protection money? But then you run into the whole if it's public there's nothing stopping some other AI that isn't MS from accessing the repository and training on it.
There's been a huge amount of speculative information floating around that GitHub are training on private repos, but I've never seen anything credible.
Yeah, I generally expect big tech to be vacuuming, storing, and analyzing as much data as they can, but for Github doing something like training on private repos would be one of the riskiest things I can imagine. No way they are going to jeopardize their entire business to maybe get a little bit more data to train on.
That story appears to be about how if a repo has accidentally been made public various tools can access cached information about that repo even after it has been made private again. That doesn't say anything about whether or not that data will then be used for training models.
A lot of companies pay GitHub a lot of money to look after their source code and related artifacts. That's GitHub's business model. I don't think they would jeopardize that trust for the sake of training a model on private data.