About GitHub’s use of your data

rany_ · on June 23, 2023

> Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.

>

> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.

That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.

nmcela · on June 23, 2023

This is huge, and unfortunately not surprising at all in the age of massive ever-growing out of control tech monopolies that do whatever the fuck they want. Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.

The only solution is to self-host. Gitea is good.

mhaberl · on June 23, 2023

> The only solution is to self-host. Gitea is good.

Gitea project hosts its code on GitHub: https://github.com/go-gitea/gitea. You must admit that is a bit ironic.

> age of massive ever-growing out of control tech monopolies that do whatever the fuck they want

GitHub is not the only option for source code hosting. There are alternatives like GitLab, Bitbucket, and numerous smaller ones.

rapnie · on June 23, 2023

If you like the community driven fork of Gitea (which still upstreams to Gitea project) then you should check out https://forgejo.org

The fork was established at the time that Gitea got entepreneurial and founded Gitea Ltd. with plans for an enterprise version. https://codeberg.org used to run on Gitea, but switched to Forgejo, and Forgejo project is hosted on Codeberg at https://codeberg.org/forgejo

KirillPanov · on June 23, 2023

That is a real tongue-twister of a name.

rapnie · on June 23, 2023

Pronounced: for-jay-oh (it is a derivation of the Esperanto name for "forge")

sn0n · on June 23, 2023

Or maybe like .. "forge ho"

Am4TIfIsER0ppos · on June 23, 2023

Another fork? gogs -> gitea -> forgejo lol

nine_k · on June 23, 2023

OSS development model functioning as intended.

dewey · on June 23, 2023

It's possible with OSS development but spreading out contributors and patches over three projects instead of one with a functioning community is hardly the ideal OSS development model.

marcosdumay · on June 23, 2023

Yet, when you lose value alignment with the project, the best thing to do is to abandon the ship the sooner as possible. Insisting on total collaboration is bad for every party.

_ktx2 · on June 23, 2023

If they continue push upstream per the license then it works. There's a lot of Linux desktop apps that work this way; it's hardly a broken model.

mejutoco · on June 23, 2023

While not an option for everyone, if you have a server with ssh access you can do:

    git init --bare /path/to/repo.git

on the server. Then locally you git clone that repo with a ssh url.

It does not have any visual MR or enterprisey features, but it works.

pmontra · on June 23, 2023

That's OK but too uncomfortable when managing a number of git repositories on a ssh server. I'm using gitolite [1] for that.

The features are basic and managed by editing text files and git-pushing them to a control repository: create repositories, add users and their keys, readonly or readwrite. There is no GUI but once you have a copy of the repo on your machine you can use one of the several git GUIs available for any OS.

[1] https://gitolite.com/gitolite/

hoherd · on June 23, 2023

What aspects of bare git repos over ssh are uncomfortable or subpar?

pmontra · on June 23, 2023

With gitolite I don't have to manually setup every single repo and configure access with maybe one user name per project. That would be too much. And how about read only, read write?

hsbauauvhabzb · on June 23, 2023

A large portion of people don’t want to memorise all the commands related merging, branching etc, catering towards the lowest common denominator I’d important.

kroltan · on June 23, 2023

How does the choice of hosting change this?

Bare repo on a server is exposed to people exactly like Github: a remote URL you put in once and forget about it.

How people use their local git repository is their business, command-line, Sourcetree, GitKraken, what have you, but any of those work with any remotes.

(Sure, git by itself does not provide the other features from the hosting services like issue tracking and pull requests, but not every workflow requires those to be linked directly to the SCM)

hsbauauvhabzb · on June 23, 2023

I don’t care for issue tracking, but I do like the usability of diffs and merging in web ui apps - my primary job is to look at code not write it (meaning I’d fail basic git merge questions) but I’ve also found out the hard way that just because I know my way around a shell doesn’t mean I can force my views on the people my company hires, and I own the company.

kroltan · on June 23, 2023

I do genuinely appreciate that, but that's the point: the graphical clients that do visual diffing and merging like the ones I listed all work with a bare repo as the remote.

Heck I think even the Github Desktop application also works with non-Github repositories, and they would be the only ones that would have any interest in locking people in.

Unless you mean specifically the UX of having a URL you can copy to a specific line of a specific commit in a repository, which indeed is not possible without a standard URI scheme (which does not exist) or a web client.

mejutoco · on June 24, 2023

I get your point. At the same time I find it funny how Linus was checking patches via email deciding what gets merged for the linux kernel. Now, every service needs all the replicated enterprisey festures.

It is not a personal criticism to you. I find it interesting git gave us all this efficiency and the enterprise removes it by adding complexity back because employees supposedly cannot be bothered to learn their tools (or cannot be mandated) or plainly prefer a nicer ui. Not a crime, but I can see how big corporations become inefficient with this type of thinking, when appliend to hundreds of tools and processes.

bobsoap · on June 23, 2023

I use gitolite as well, it's great. Currently working on integrating it into a CI/CD pipeline, which admittedly proves to be a slight challenge, but I'm sure I'll get there eventually.

actionfromafar · on June 23, 2023

I was sold on the features but never got the hang of how to work with it.

sumanthvepa · on June 23, 2023

We've been doing that for years in our org. This works perfectly.we do push open source repos to GitHub though.

dividuum · on June 23, 2023

I, as a Linux user, built a similar system myself by getting an FTP account, mounting it locally with curlftpfs and then using git on the mounted filesystem.

sim7c00 · on June 23, 2023

this exactly. git itself is all ya need. u can connect clients/ides like vs code to such a repo easily.

mariusor · on June 23, 2023

> You must admit that is a bit ironic

It's a sad situation that if you desire exposure and community building you must maintain a fork on Github, but that's how it is for smaller projects. I am in a similar situation, with some of my projects with main repos hosted on sourcehut, but most of external engagement comes from clones on github. It is what it is, and we do what we must. :)

mhaberl · on June 23, 2023

I would agree for any other kind of project, except for a GitHub alternative.

How does it look from the potential users' perspective when the product they market is not the product they choose to use for themselves?

mariusor · on June 24, 2023

It looks like they are a pragmatic project that prefers to have contributors to being ideologically pure. It's not like there isn't an official repository hosted on gitea: https://gitea.com/gitea

treprinum · on June 23, 2023

GitLab entered a strategic partnership with Google, likely for the very same reason - feeding Google AI models with enough code.

bhrgunatha · on June 23, 2023

Could you link to some of the announcements or articles. I only ask because I was totally unaware and would like to learn more.

treprinum · on June 23, 2023

https://www.prnewswire.com/news-releases/gitlab-and-google-c...

mhaberl · on June 23, 2023

> GitLab is working with Google Cloud because of its strong commitment to privacy and enterprise readiness, and its leadership in AI.

Google's commitment to privacy? Google's leadership in AI?

O how I love marketing, you can say just about anything

mhaberl · on June 23, 2023

https://about.gitlab.com/press/releases/2023-05-02-gitLab-an...

n4r9 · on June 23, 2023

> You must admit that is a bit ironic.

Looks like they're working on migrating to a Gitea instance: https://github.com/go-gitea/gitea/issues/1029 .

KirillPanov · on June 23, 2023

Wow how pathetic that github is refusing to export their data:

https://github.com/go-gitea/gitea/issues/1029#issuecomment-1...

colejohnson66 · on June 23, 2023

Failing != Refusing

KirillPanov · on June 23, 2023

Failing = Refusing + hiding behind corporate bureacracy

marcosdumay · on June 23, 2023

> You must admit that is a bit ironic.

The people are on github, so it is really enticing.

Maybe the reddit and twitter drama creates a viable enough community for federated logins to become useful.

mhaberl · on June 23, 2023

> The people are on github, so it is really enticing.

For any other project, sure. But when building an alternative to GitHub.. there is value in dogfooding.

zikduruqe · on June 23, 2023

or just git init --bare

Klonoar · on June 23, 2023

> You must admit that is a bit ironic.

Every time someone parrots this, I have to wonder if they did more than 5 minutes of reading - it's one of the top issues on the issue tracker and they've outright stated they will move once Gitea is at a spot where they are not losing functionality and history.

mhaberl · on June 23, 2023

I did not parrot anything. This is the first time I have heard of Gitea, I have googled it and the 1. thing I have noticed it was hosted on GitHub. It was an original tought.

I did not care enough to open their issue tracker. I still don't. It is ironic, not a bit, a lot. That statement was a bit sarcastic.

I hope that put the end to your wondering.

itsoktocry · on June 23, 2023

>Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!

And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.

cmsonger · on June 23, 2023

I am agreeing and adding another solution.

The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.

treeman79 · on June 23, 2023

[flagged]

piva00 · on June 23, 2023

Parroting ideologue-dogmatic bullshit platitudes is not conducive to a good discussion.

mastax · on June 23, 2023

If you're going to wade into a 200 year old flame war at least have something interesting to say.

36364949thrw · on June 23, 2023

Emergence of machine intelligence* and its control by Capital was not foreseen by Karl Marx, and the intervening period between heat death of Capitalist system and the Workers Utopia has been indefinitely extended.

* pure transformation of energy into labor

thumbuddy · on June 23, 2023

There's an awful lot of very smart people who have studied economics for the majority of their lives who disagree with this. There are also alternatives to capitalism that don't entirely involve govt control.

theironhammer · on June 23, 2023

Absolute BULLSHIT!

Greed is what screws up the market. Ask Alan Greenspan re 2008 Banking Crisis.

bayindirh · on June 23, 2023

There is always SourceHut (https://sourcehut.org) if you want.

sureglymop · on June 23, 2023

Do you have experience with self-hosting Guitea? I am on to fence about going with Gitea because of the recent fork of the project (Forgejo). Seems that many contributors are now contributing mainly to Forgejo.

indigochill · on June 23, 2023

The reason for the fork was that Gitea was going for-profit and the folks that forked to Forgejo felt they went about that transition in a way that eroded trust. Here's their explanation: https://blog.codeberg.org/codeberg-launches-forgejo.html

phpisthebest · on June 23, 2023

Gitea is itself a fork of gogs (Go Git Server)

it is functioning like Open Source should, there was a disagreement in how the project was run so it gets forked

This used to be more common place when projects were run by people not companies. I wish the practice would come back we need more forks in Free Software

llanowarelves · on June 23, 2023

It feels bad to "waste" the work that could have otherwise gone into highly-paid billable hours, or at least charity work on other repos that get more use.

dismalpedigree · on June 23, 2023

I self host Gitea. Very reliable. Painless setup. I wish it had some sort of CI like github actions or bitbucket pipelines, but otherwise totally happy wit it.

KronisLV · on June 23, 2023

> I wish it had some sort of CI like github actions or bitbucket pipeline

I use Gitea with Drone CI and it works pretty well: https://www.drone.io/

Some might also prefer the Woodpecker CI fork due to the license: https://woodpecker-ci.org/

I setup Drone as a part of my migration away from GitLab Omnibus and have no complaints so far: https://blog.kronis.dev/articles/goodbye-gitlab-hello-gitea-...

Here's the Drone example in particular: https://blog.kronis.dev/tutorials/moving-from-gitlab-ci-to-d...

n3storm · on June 23, 2023

It's been added recently. Not sure how they compare.

v3ss0n · on June 23, 2023

GitHub actions works in gittea at version 1.19

eyegor · on June 23, 2023

Just self host the community edition of gitlab. It's miles better than gitea. It's got ci pipelines, it's got a pretty robust issue tracker, it's got wiki pages, it'll integrate with ldap/ad for authentication, it's got a package repository for self hosting libraries, it's got releases, it's got a service desk to make email -> ticket pipelines, etc.

renonce · on June 23, 2023

GitLab CE is far too heavy and requires minimum 4GB to run. Contains lots of componnents including PostgreSQL and Redis and various components and startup takes long. With Gitea I can run it with just 1GB or a raspberry pi. It includes wiki, package repositories and releases as well. ldap, service desk - these are enterprise features that I don't need.

marcosdumay · on June 23, 2023

> It's miles better than gitea.

Gitlab is a crazy setup full of services, with elaborate interdependence, absurd hardware requirements, iffy performance, and all the lack of confidence on security that comes from this (and it only ever running if you use their docker images and don't touch anything).

But yeah, it got everything.

silverwind · on June 23, 2023

Gitea has all these features as well, except maybe the last.

james-skemp · on June 23, 2023

I've got Gitea running on a $5 Vultr instance and it's great.

Upgrades have been painless. Doesn't tax the server.

Was using Gitea when that fork happened and didn't see a reason to migrate. Looked very much like poor communication on the behalf of Gitea causing a misunderstanding.

allarm · on June 23, 2023

I self host Gitea both on my home NAS and a DO droplet. I set up repos sync between the instances, it works flawlessly. I've moved the most of my projects off Github/Gitlab and overall I'm very happy with it.

tharos47 · on June 23, 2023

I self-host gitea as a github backup just in case. It's pretty easy and well documented (it's a single executable and you can use sqlite for the database).

prox · on June 23, 2023

Cyberpunk 2077 here we come!

andsoitis · on June 23, 2023

> The only solution is to self-host. Gitea is good.

I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/

codetrotter · on June 23, 2023

You install Gitea on a server that you own. You use that instance of Gitea to host your git repositories. That is self-hosted git hosting.

Gitea is an open source alternative to GitHub, that you run yourself.

Macha · on June 23, 2023

It's a "run your own github" application. akin to Github Enterprise Server or Gitlab CE/EE, except unlike Github Enterprise Server and Gitlab EE, it's open source.

loudambiance · on June 23, 2023

As far as I am aware, they do not offer a hosting service. I believe that statement was meant to convey that the Gitea software, once installed is a git host similar to the others. I think they were trying to differentiate between a typical remote git repo and all the web components that come with Gitea. They do offer paid support, but that's still for self hosting.

endymi0n · on June 23, 2023

Playing devil‘s advocate, all kinds of linting, vetting or security scanning with any degree of smartness beyond a regex would probably fall into my definition of non-human eyes too.

Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.

skydhash · on June 23, 2023

These tools are enabled by the owner of the repository. And I think consent has precedence over license terms. But this seems like an escape hatch for using your code as a training material.

Mystery-Machine · on June 23, 2023

If it doesn't explicitly say: we do not use private repos data to train our AI models, I wouldn't even consider assuming any other way than that's exactly what they are doing. They know this is a question everyone wants to know the answer to. Why would they leave any ambiguity? Let me help you answer that: because that's exactly what they are doing.

rany_ · on June 23, 2023

Also dependency graph is enabled by default. So there's that.

sheepscreek · on June 23, 2023

This. I believe the title of this post is a click bait, and doesn’t include the context of this verbiage.

Use of data for AI training is a big deal. Implicitly allowing it under this definition will be enough cause for a lawsuit.

JohnFen · on June 23, 2023

> in reality there is a legal framework and a ToS in place as well.

That doesn't give a lot of comfort. ToS can change at any time.

Shrezzing · on June 23, 2023

Everything referenced in this webpage refers only to public data, or private data where a user has enabled the dependency graph

>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph

So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.

imdsm · on June 23, 2023

For anyone wondering:

1. Click your avatar

2. Settings

3. Security / Code security and analysis

4. Dependency graph [Disable all]

You're welcome

mysterydip · on June 23, 2023

Is it on by default? Couldn't they just silently re-enable it with an "oops that's how site updates go sometimes!" later?

didntcheck · on June 23, 2023

Well they could silently do anything. No need to toggle the user-visible checkbox

It's like giving a party some of your possessions to store, and signing a form saying you don't consent to them snooping through them, then worrying "what if they change and forge the contract later?". They wouldn't need to. Either you trust them to keep their word or you don't trust them with your possessions at all

wokwokwok · on June 23, 2023

Its not a checkbox, it’s an action.

…and “automatically disable for all new private repos” is not set unless you explicitly opt in.

gorjusborg · on June 23, 2023

Interesting wording. I would call opting in to disabling a feature 'opt out'.

I'm sure github legal prefers your description, though.

everyone · on June 23, 2023

When I click that disable all button there is a modal window with a checkbox that says "Enable by default for new repositories"

I'm assuming I want to definitely not check that if I dont want my code stolen?

But I dunno, they have worded this all so confusingly, this is some dark pattern shit. Do I check the box or not?

HarveyKandola · on June 23, 2023

I believe you DO NOT check that box in the popup modal window.

rany_ · on June 23, 2023

So give them an inch (Dep Graph feature) and they'll take a mile (loss of confidentiality); that's not any better...

hobofan · on June 23, 2023

> So give them an inch

You are not giving them an inch, you are receiving an inch (and a whole lot more). And if you don't want that, you can turn the feature off.

What would you want instead? Them not offering the feature at all?

jxf · on June 23, 2023

This is a false dichotomy. They can enable the dependency graph feature without repurposing your data for machine learning.

didntcheck · on June 23, 2023

And they can also repurpose your data for ML, or otherwise illicitly pilfer your data, without you turning a random dependency visualization on. There seems to be an idea here that turning it off will change something, like how there are people adamant that Windows 10 has secret tracking features, yet believe they can defeat them using regedit. I'm afraid this is the technical equivalent of "swiper no swiping!"

And to be clear I'm not saying Github or Windows is doing this, I'm just saying either you trust them to keep their word or you don't

rany_ · on June 23, 2023

I'd like them to limit the scope of the data use to dependency graph ONLY and for it to not be stored. I swear this goes without saying.. but these blanket self granted permissions on the part of GitHub are a serious breach of trust. They don't need this to provide the dependency graph service.

Shrezzing · on June 23, 2023

My understanding of the way dependabot works is that they read your repository's manifest files to provide alerts about vulnerable dependencies. Then after you patch the repository they re-read it to identify the new package version you've moved to, and if there were any build errors introduced from the update. That info is anonymised & fed into an aggregated dataset, creating suggestions for other users like "95% of repositories impacted by this CVE upgraded from v1.0.4 to 1.0.5". Dependabot doesn't read through your code (unless your code is in `.manifest` or `.package` files).

That might still be too much for you, everyone's line is at a different place. Dependabot works better the more data it has, so personally I'm fine with that level of detail being returned to an aggregated service, for that very specific purpose.

There are other private providers of dependency scanning out there. I'm fairly sure they all work the same way Github does, just with smaller datasets.

rany_ · on June 23, 2023

I am not disagreeing with you but by your own admission it does not read your code; only the project's manifest. THEN why on Earth do they ask for so much more when all they need to do is bump some dependency version in the manifest.

As for this aggregate stats, why not limit the scope to access to the build status instead? They don't need anything more to provide this service...

Shrezzing · on June 23, 2023

As far as I can see from all of their documentation, they exclusively open a predefined list of common manifest files. They explicitly mention that they won't be able to do dependabot analysis if your deps are listed in a `setup.py` file for example. You'd need readonly access to the entire repository to actually find those manifest files though, as most of the time they're not going to be in the top-level directory.

For companies & people not willing or able to surrender that level of access, it's still possible to use the dependabot API, dissociated from any github repository (for example if you have an Atlassian git repository). You just send it a list of dependencies and an enum indicating its current build status, and they'll reply with a list of suggestions.

ysavir · on June 23, 2023

I doubt there's any legal way to say "we'll only read your manifests". Programmers may know what that means (and still debate it, honestly), but I'm skeptical the law has the ability to distinguish between the designated utilities of specific files.

hajile · on June 23, 2023

If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.

How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?

marcus0x62 · on June 23, 2023

> If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

How would you know the LLM used your work specifically and then prove that in court?

charles_f · on June 23, 2023

It's their problem to build lineage in their system.

Breach of licence and copyright laws is still illegal even if its very well concealed.

Lots of their customers' code is requiring attribution, or specifies acceptable uses (notably, what kind of license the product built with your code is allowed to have). Beyond simple copyright to which anyone can pretend, they're also concealing this.

marcus0x62 · on June 23, 2023

> It's their problem to build lineage in their system

From a practical standpoint, I don’t see how that is true.

In order to establish (civil) liability in the United States, someone would have to file suit, survive the inevitable standing challenges, and demonstrate on the preponderance of the evidence that they had been harmed by this conduct.

If you can’t prove your code was reproduced by this model, I don’t see how you can successfully do that.

CatWChainsaw · on June 28, 2023

Truly curious, do you feel the same way about art whose creator/s did not go out of their way to copyright images?

api · on June 23, 2023

It’s like saying you own an image after JPEG compressing it. There are going to be lawsuits, but MS is so big they might just consider settling them part of the cost of making sure they are on the cutting edge of generative AI.

PeterStuer · on June 23, 2023

Besides, the "except" part qualifying the "human eyes" is substantial.

GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.

simonw · on June 23, 2023

GitHub publish a transparency report where they reveal how many of these requests they processed.

https://github.blog/tag/github-transparency-report/

> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.

Concerning national security letters:

> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.

L3viathan · on June 23, 2023

> As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.

Seems like we can increase the lower bound ever so slightly: It couldn't have been zero, otherwise it would affect "0-249 accounts".

thumbuddy · on June 23, 2023

Now everyone knows why codeberg was invented. This type of thing is entirely foreseeable. Good luck not having all your IP be copy pasteable to anyone using a LLM.

bhrgunatha · on June 23, 2023

sigh I moved all my projects (from Bitbucket and Github) to Gitlab. Now it appears Gitlab is feeding its data to Google [1].

I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.

[1] https://news.ycombinator.com/item?id=36445526

orthoxerox · on June 23, 2023

We need GPLv4 that explicitly says that models trained on GPLv4 code and anything they produce are derivative works.

ralph84 · on June 23, 2023

GitHub/MSFT/OpenAI’s stance is that training AI models is fair use and doesn’t require a license. It doesn’t matter what kind of license you slap on the code. If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use, not a new software license.

JohnFen · on June 23, 2023

> If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use

True, assuming that GitHub/MSFT/OpenAI's stance is correct. Just because they have that stance doesn't mean it is[1].

In the end, these gray areas will be decided in court.

[1] I personally think it probably is, which is why I removed all of my stuff from the open web and these sorts of services -- I see no other way to prevent this use of my data.

Arnt · on June 23, 2023

The policy says human eyes will never see the contents of your private repositories. Suppose, as you say, the repos are one day fed into an AI. How does that policy then constrain the AI later?

AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.

But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.

And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"

JohnFen · on June 23, 2023

> The policy says human eyes will never see the contents of your private repositories.

I love how they say that as if it's a meaningful distinction in some way.

blibble · on June 23, 2023

> Your individual personal or repository data will not be shared with third parties.

OpenAI is a third party..?

rany_ · on June 23, 2023

GitHub owns the Copilot model; it is based on OpenAI's Codex but they own and operate it themselves. So technically feeding it into their AI would be OK by those terms. That's why Copilot still exists even after Codex was killed off by OpenAI.

visarga · on June 24, 2023

We’re still in the nascent stages of code models. There’s ample opportunity to transition to private models before the technology reaches maturity. As for the destiny of current private GitHub models, they may contribute to training models (through what’s referred to as “aggregate data”), but the potential backlash could be considerable, particularly when considering the bans on using ChatGPT that various firms have implemented.

WuxiFingerHold · on June 25, 2023

Indeed. I wonder how many people don't see this. Or don't want to see this. Nobody should assume that a private Github repo is really private.

My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.

mhaberl · on June 23, 2023

Is anyone aware of GitLab's policies regarding the privacy of private repositories? Do they have any "AI feeding" mechanisms in place?

indigochill · on June 23, 2023

Seems like one has to assume corporate repos have them. If they don't already, they could decide to at any time. If that's a problem, self-hosting puts the power back into the hands of the users (and can be done very cheaply - I'm paying ~$5/month for a server that's doing a bunch of stuff including very basic personal git hosting)

jasonlotito · on June 23, 2023

How do you think they provide all the services they provide? Secrets scanning, contextual source information, everything. I'm not sure how anyone could have gone this long and been surprised by this.

Next people will be surprised that GitHub has the right to show your source code on a website....

a-dub · on June 23, 2023

i think they are referring to automated abuse scanners which have been "made of ai" for decades.

i strongly doubt that this language would support the use of private repositories for training generative systems.

Sosh101 · on June 23, 2023

Right, that's a very loose definition of private.

cyanydeez · on June 23, 2023

Copilot will AI wash your data

hardware2win · on June 23, 2023

You guys realize that GitHub provides secret scannings and supply chain (libs) issues warnings?

How do you think they do that? Crystal ball?

Not everything is about AI

dijit · on June 23, 2023

I think this is probably true, despite critical software being leaked, that seems related to manually inputting data to chatGPT: https://mashable.com/article/samsung-chatgpt-leak-details

there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.

The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?

marcinzm · on June 23, 2023

>however this agreement does permit them to do so.

How so? Assuming you read the whole agreement and not just take a single line out of context?

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."

dijit · on June 23, 2023

Reading the terms of service is the most important thing; almost everything else is immaterial.

I am working from information derived from the terms of service: https://docs.github.com/en/site-policy/github-terms/github-t...

For a small example of what I mean:

> GitHub considers the contents of private repositories to be confidential to you. GitHub will protect the contents of private repositories from unauthorized use, access, or disclosure in the same manner that we would use to protect our own confidential information of a similar nature and in no event with less than a reasonable degree of care.

When talking about confidentiality, GitHub is specific to mention unauthorised access, implication is here that GitHub has the authority to authorise themselves.

The next section talks about how they will grant access to third-parties, but it's clear that as a first party they exclude themselves from the moniker of "unauthorised".

marcinzm · on June 23, 2023

I agree however taking random pieces of a long agreement in isolation is not how you read terms of services. The section you highlight is NOT, for example, the one that talks about what legal rights you give Github regarding your code. Construing it as such and the extrapolating from that is taking a piece of text way out of context.

dijit · on June 23, 2023

The section cited is the most prominent section in the terms of service regarding private repositories[0]

Everything else in that section relates to personnel (which is not the concern of the commenters here), what happens if you upload copyrighted material and codes of conduct.

Also, taking segments of the TOS is exactly what will happen in a court of law, unless there is another section that invalidates this section, this fragment will be argued as consent.

Additionally; the context of that section is very clear otherwise. Sorry.

The way ToS will be argued is: there are many references to processing information and nothing that explicitly prohibits or cannot be reasonably inferred to inhibit automatic processing, in fact there are many places where they make reference to "automatic processing for services"; You can take that to mean that it's just their SAST scanners, they can take it to mean training of an AI model, there's no distinction legally here.

[0]: https://docs.github.com/en/site-policy/github-terms/github-t...

marcinzm · on June 23, 2023

You mean like the earlier section that explicitly says "License Grant to Us" which the court would somehow amazingly ignore reading according to you? Because clearly the exact legal rights you give Github to use your work is not relevant to a discussion on how Github is legally allowed to use your work.

dijit · on June 23, 2023

I can tell you're a bit defensive about this, so let me be clear as possible: talk to a lawyer.

I am not a lawyer myself, though I consult with Swiss/German lawyers quite often regarding licensing because it is a large part of my role.

The fragment I mentioned is the beginning of section "E”; this document is using letters as the top level section separator, so using a different letter from the prior is taken as an intentional separation for the purposes of a legal document, what you are referring to is from section "D"; IE they are in a different context from each other entirely.

Even then, you do not need to have a license to automatically process source code, as per the text of this ToS.

There are more than a few points where they talk about what they have the rights to read, and there is only one point (which is legally required to be there, fwiw) which states they can't release your source code as if it's theirs.

If you read the copilot terms of service, they are not granting license to use the suggestions.

A very pessimistic take is that they are quite legally protected from using source code derived from these repositories, even if there wasn't the muddy discussion about training data and derivation.

simonw · on June 23, 2023

"why wouldn't you, eventually?"

Because if people found out you were doing that you would lose vast numbers of paying customers, which would cost you a lot of money.

JeremyNT · on June 23, 2023

You're talking about a company that puts ads in the start menu and opts users in to all kinds of telemetry.

If people haven't gotten tired of Microsoft's antics yet, they never will.

waboremo · on June 23, 2023

It's a little different, Microsoft doesn't care about consumers as they stopped being relevant to Microsoft's financials in the Ballmer era. Even consumer Office spending makes less than LinkedIn for companies to put into perspective how little they care.

However, enterprise/commercial? The same group you're trying to sandwich between a 365 subscription and azure/github? These are the people in which when they do have evidence that they are being negatively affected, will cause a massive dent in Microsoft's bottom line.

torginus · on June 23, 2023

In the intelligence business, they talk about capabilities, not intents.

bionade24 · on June 23, 2023

> The concern is, why not?

They need to reliably filter secrets or they are in trouble once Copilot suggests them.

dijit · on June 23, 2023

They already scan for secrets so that they can warn you that you checked them in.

bunga-bunga · on June 23, 2023

Temporarily reading into RAM and applying a regex isn’t the same as feeding your repos to a LLM, which may store parts of it permanently.

Obviously GitHub can read my repos in order to display them on GitHub.com, but access must be fleeting.

AndroTux · on June 23, 2023

[flagged]

didntcheck · on June 23, 2023

In that case they needn't put this in their ToS. It's strange how people will distrust a company with one hand, then claim their word as reliable evidence (and think they can be stopped by unticking a checkbox) with the other hand. If they wish to deceive you they would do it without telling on themselves

alwaysbeconsing · on June 23, 2023

Any sane business decisionmaker would want legal cover for something like this in case it draws a lawsuit. Regardless of whether they themselves think it's okay, doing it entirely on the sly makes it more likely an opposing lawyer can convince a judge/jury that it's not okay. If they have an agreement that the client "accepted" that strengthens their position quite a lot.

croes · on June 23, 2023

Of course it's about AI.

They have a legitimate cause to read and while they are at it they use the data for everything else they want.

No way for you to know.

paxys · on June 23, 2023

Do people want Github to display files in your private repos on github.com? Do you want syntax highlighting and intellisense? Do you want them tokenized for search? Should they be available for cloning when you need it? Should they run security scans and identify vulnerabilities? Should you be able to see git blame, diffs, commits, other metadata? Set up build pipelines? Modify contents via an API?

Tell me – how can Github do all of this without programmatically accessing the contents of your private repos from their servers?

marcinzm · on June 23, 2023

As I see it people are basically taking a single line of an agreement that explicitly references other agreements out of context:

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."

lopkeny12ko · on June 23, 2023

No, I don't want these features. I want reliable remote git storage, which lets me clone a copy of my source code anywhere so I can do these things locally, on my own machine. People forget that this was the original design and intent of Git as a decentralized protocol. As an industry I think we have become too reliant on Github as some kind of way-far-beyond-git magic tool. This might be fine until it isn't--like when Microsoft decides it's time to start training Copilot on your private code.

lucasfdacunha · on June 23, 2023

Use another place to store your git projects then, it's not like Github is the only one. Github is a PRODUCT built around git, not just a place to store your git repositories. I'm pretty sure you will be able to find other services which only exist to host git projects and do nothing else.

lopkeny12ko · on June 23, 2023

You say this as if you think most people have a choice...I've pushed hard in the past for my employer to move off Github but have been met with nothing but blind resistance.

rockemsockem · on June 23, 2023

Being concerned for your own personal reasons about Microsoft scanning your code of one thing. How does it make sense for your employer to switch off GitHub?

ziml77 · on June 23, 2023

But that's not your code, that's your employer's code. You might have written it but they fully own it. This doesn't seem like something to be concerned about on their behalf.

misnome · on June 23, 2023

Doing remote storage _still_ requires their servers to read and process your repository data.

paxys · on June 23, 2023

Then you shouldn't be using Github at all. Problem solved.

lopkeny12ko · on June 23, 2023

Tell this to my CTO.

paxys · on June 23, 2023

Why? It's his company, his code, his decisions. You are free to use whatever you want for your own projects.

lopkeny12ko · on June 23, 2023

Also, for what it's worth, I was extremely surprised and concerned when my private repos appeared in code search results. I had to triple-check that it wasn't showing up when logged out or logged in to a dummy account without access to the repo. This behavior is definitely not intuitive and if there's a way to disable search indexing on private repos I'd do so.

wg0 · on June 23, 2023

Or - how any provider even self hosted, other than GitHub can provide any of that without automated processing of the contents of private repositories?

23B1 · on June 23, 2023

1. Avoid VC and investors, and,

2. Build small- to mid-size businesses, that can prioritize healthy growth – like focusing on users, being kind to employees, building brand equity over time.

The pump-and-dump grift cycle of enshittification has got to stop.

tap-snap-or-nap · on June 23, 2023

As soon as finance capital gets involved, the core objective of business shifts to primarily providing higher returns for the investors. Small to medium businesses can have diversity but once they "go public" they all are almost alike.

ticviking · on June 23, 2023

Then don’t go public.

Accept that you cannot grow at the same speed and focus on deep roots and fanatical customers

Joker_vD · on June 23, 2023

Does that ever work? After all, an enterprise that cannot grow at the same speed is effectively shrinking, isn't it?

adambyrtek · on June 23, 2023

37signals (and others) proved that a different approach is possible.

eulers_secret · on June 23, 2023

Yeah, it can work. Valve is private and basically owns the pc game distribution market. Even competing agains industry titans like MS.

I’m not sure I know another example, so maybe it’s very rare.

_5ygc · on June 24, 2023

A few companies in tech have plenty of goodwill, even if not all are privately owned: RedHat (currently burning theirs), Mozilla, 37signals, Khan Academy, Mapbox, Komoot, DuckDuckGo, Brave, etc.

Not all these have a viable business model, but they’re not obsessed with growth and brokering user data.

lawn · on June 23, 2023

Why does it have to grow?

It can also be stagnant, which is totally acceptable as well.

waboremo · on June 23, 2023

You can't really be stagnant due to competition. Not unless you're deep into some niche that's difficult to get started in or stick around.

izacus · on June 23, 2023

Yeah, as soon as MBAs and financial investors get involved (which, for "inefficient" small companies, is just a matter of time) the companies change to be exploitative cancer on society like the rest.

There's whole industry looking for companies that are too nice to their employees and customers to buy, exploit, grab profits and leave a husk behind.

nine_k · on June 23, 2023

I suppose that's it's impossible to do a hostile takeover of a private company. All it takes is to stay financially healthy and reject offers to sell!

Much easier said than done. Sometimes the owners just get tired or bored, and want out. They sell the company to the highest bidder, within reason; often there's no crowd clamoring to buy a small, non-growing company.

Then it's usually over.

jstanley · on June 23, 2023

How exactly is GitHub supposed to work if their computers are not allowed to look at your code? Git doesn't work by magic.

pavlov · on June 23, 2023

I guess it hinges on the definition of “looking”.

Receiving files, copying them within a file system, and transmitting the contents? Clearly not looking.

Computing a hash for each file? Doesn’t seem like looking either.

Parsing the contents, or tokenizing them for a ML model? That’s looking IMO.

I think the defining factor is the size and nature of the derived state update on the server. Copying and transmitting retains only metadata about the file. Computing a hash produces only a handful of bytes of state about the file, and there’s nothing that can be inferred about the file’s contents from the hash. But if you start storing something like word frequencies, that’s not neutral anymore, it’s a “look” into the contents.

bentlegen · on June 23, 2023

What about parsing and tokenizing the code for code search?

pavlov · on June 23, 2023

Yes, that’s clearly GitHub looking at the contents. At that point you just have to trust they don’t do anything else with the internal representation they’ve built.

When Gmail was introduced two decades ago, there was some short-lived furore about the fact that Google admitted they were reading your emails for ad targeting purposes. To a lot of people like myself, this seemed a bridge too far when you’re used to servers that don’t “look” at your private stuff. But it certainly didn’t hurt Gmail’s adoption. (I still don’t use Gmail because of this original lingering distaste.)

jvolkman · on June 23, 2023

They don't "read" emails for ad targeting anymore (according to them). But they certainly do for search and spam filtering, which are features that users expect.

einpoklum · on June 23, 2023

Once things hinge on the definition of commonly-used verbs and nouns, you've got a problem.

And if you don't agree, just remember that my claim hinges on the definition of the words "you", "definition", "got" and "a".

pavlov · on June 23, 2023

The problem is GitHub’s if they’re the ones accessing copyrighted content under poorly defined terms.

paxys · on June 23, 2023

Go to github.com, open your repository and open a file. It shows up on the screen. Is that not "looking"?

pavlov · on June 23, 2023

No, because the server doesn’t update any state based on the transmitted file contents except some metadata like “last viewed”.

The state update based on the file contents happens in my brain, and that’s the “look” action.

eknkc · on June 23, 2023

They do code highlighting, search indexing and symbol resolution. I don't know if those can be disabled but they need them to look at the code, update state (indexes and such, maybe caching on highlighting).

dijit · on June 23, 2023

code highlighting is done in the client, the infrastructure just serves the content for that particular feature.

ChatGTP · on June 23, 2023

It's the sharing with partners bit that I don't like. If I want that to happen, I should be able to explicitly grant permissions.

croes · on June 23, 2023

Same way a postman delivers a postcard.

jameshart · on June 23, 2023

Postman doesn’t provide tools to diff your postcards, search them, scan their content for security issues, or automatically take actions based on the contents of your postcards.

GitHub does all those things with your commits.

can16358p · on June 23, 2023

Postman doesn't need to show, potentially scan, and parse the contents of some text while displaying on a website.

Not really analogous.

NoZebra120vClip · on June 23, 2023

You'd be surprised at what the United States Postal Service does with all flat paper mail, then!

https://www.usps.com/manage/informed-delivery.htm

Joker_vD · on June 23, 2023

A postcard is not a letter; its contents are written in the open, right near to the postal address, and are fully visible to the postman. I imagine that's why the OP chose specifically a postcard for his example.

croes · on June 23, 2023

Exactly.

You expect him to read the address not the content.

Maybe you don't mind if he reads the content sometimes but if he reads every postcard every time you would be pissed.

jstanley · on June 23, 2023

By carefully reading the entire thing and storing a list of deltas from some other postcard when it would help compress it better?

croes · on June 23, 2023

Didn't Amazon say the same about the Alexa data?

Nothing is stored at Amazon.

Ok, stored but no human has access.

Ok, humans have access but only special employees

Ok, some third party contractor get the data too.

flagrant_taco · on June 23, 2023

It really sucks when you want to keep a secret but keep getting asked the right questions. It would have been much easier for Amazon if people just took the first statement at face value.

ost-ing · on June 23, 2023

Why is this setting hidden behind the “Dependency Graph” feature?

GitHub should be making this blatantly clear and transparent for users, rather than using Facebook-esk cloak-and-dagger language.

moritzwarhier · on June 23, 2023

Not related to private repositories; but GitHub also keeps logs about commits in the "Activity Graph" if these commits do not exist anymore.

For example, clearing the full reflog of a repository and then either force pushing over or deleting the remote repository (then pushing it again under a different name) will keep the activity attribution, in the second case it might even attribute the commits to the new remote (404ing when clicking the "Created x commits in y repositories").

Good for team workflows and pull requests, but annoying for public repositories with intentionally rewritten history.

semiquaver · on June 23, 2023

Reflog is data local to a git node, not part of the repository. In particular reflogs are never transmitted between remotes. so “clearing the reflog” (presumably of your local machine) has nothing to do with removing any data at GitHub. You need to contact support for that which the docs advise you to do if you’ve force pushed over something you want to be fully deleted.

moritzwarhier · on June 23, 2023

Thank you, today I learned

semiquaver · on June 23, 2023

No problem. One easy way to gain confidence that data has actually been wiped instead of just obscured is to use one of the gitrevisions [1] specifiers that sources data from the reflog. For example,

    master@{yesterday}

resolves to whatever the reflog thinks the master branch was pointing to 24 hours ago. There are a bunch of places in the GitHub API and UI that look like they only accept a branch or commit OID but actually accept a revision specifier. So e.g

    https://github.com/torvalds/linux/tree/master@{yesterday}

lets you peek into the reflog for the master branch on torvalds/linux.

[1] man 7 gitrevisions

moritzwarhier · on June 23, 2023

Thank you very much!

Apart from the misconception, I worded my initial post in an inprecise way.

I did not know wether or not the reflog relates to what's pushed to the remote. I just had the expectation that an empty git log as well as git reflog output would mean that no objects exist except for what represents the tree of the current HEAD.

Still haven't tried the revision specifiers on a repository that I expected to be wiped, but will do that. I need to read more docs about git plumbing and the .git folder.

semiquaver · on June 23, 2023

  > I just had the expectation that an empty git log as well as git reflog output would mean that no objects exist except for what represents the tree of the current HEAD.

It’s a pretty common expectation but the nature of git’s object model means that generally it’s not true for any git provider. You should assume that data pushed to any git remote will live forever unless you both force push away all references to the data and GC all remotes (which for GitHub means contacting support IIRC)

moritzwarhier · on June 23, 2023

Thanks again for your thoughtful responses.

Havoc · on June 23, 2023

GitHub monoculture is becoming a major issue for the health of the entire software industry

politelemon · on June 23, 2023

The monoculture will always exist, humans tend to flock together. See also twitter, Reddit.

andsoitis · on June 23, 2023

“…except as described in our Terms of Service.” according to that paragraph

rany_ · on June 23, 2023

Yes, but that is the standard Terms of Service which applies to public repositories as well. The terms appear the same as the public repositories; there doesn't appear to be a carve out.

andrewstuart · on June 23, 2023

I've worked at companies that store all their source code in private github repos.

And I think - is github impossible to hack?

Will the day come when github is wide open and all that secret source gets drained out?

bob1029 · on June 23, 2023

I posted about this a week or so ago. Someone could come steal all of our "secret source" and still have absolutely no advantage over us. If the value of your business is entirely in the source code, you need to seriously re-investigate what exactly it is that you are doing. AAA gaming and Windows codebases have been leaked. In some cases partially, in others comprehensively. At no point do I recall Valve or Microsoft seeing adverse competition or other meaningful impact from this event. Legal teams were certainly up in arms, but in terms that investors would care about nothing really happens.

Turns out, you need more than source code to run a business. You need customers, markets & relationships. You need trust. You need a sales team. That's the real thing you should be worried about being stolen. Your goddamn sales team. Obviously the developers would be worried about the code base (and they should be) but at scale it's not the end of the world if some code gets out. Losing your best sales person & all their prospects would be much more catastrophic for your business.

not_alexb · on June 23, 2023

Yeah, it totally makes sense to compare gigantic companies to all other ones. Let's also extrapolate what should happen to a subcompact in a collision by looking at what happens to a tank in those same circumstances.