Hacker News new | past | comments | ask | show | jobs | submit login
About GitHub’s use of your data (docs.github.com)
404 points by fagnerbrack on June 23, 2023 | hide | past | favorite | 277 comments



> Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.

>

> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.

That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.


This is huge, and unfortunately not surprising at all in the age of massive ever-growing out of control tech monopolies that do whatever the fuck they want. Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.

The only solution is to self-host. Gitea is good.


> The only solution is to self-host. Gitea is good.

Gitea project hosts its code on GitHub: https://github.com/go-gitea/gitea. You must admit that is a bit ironic.

> age of massive ever-growing out of control tech monopolies that do whatever the fuck they want

GitHub is not the only option for source code hosting. There are alternatives like GitLab, Bitbucket, and numerous smaller ones.


If you like the community driven fork of Gitea (which still upstreams to Gitea project) then you should check out https://forgejo.org

The fork was established at the time that Gitea got entepreneurial and founded Gitea Ltd. with plans for an enterprise version. https://codeberg.org used to run on Gitea, but switched to Forgejo, and Forgejo project is hosted on Codeberg at https://codeberg.org/forgejo


That is a real tongue-twister of a name.


Pronounced: for-jay-oh (it is a derivation of the Esperanto name for "forge")


Or maybe like .. "forge ho"


Another fork? gogs -> gitea -> forgejo lol


OSS development model functioning as intended.


It's possible with OSS development but spreading out contributors and patches over three projects instead of one with a functioning community is hardly the ideal OSS development model.


Yet, when you lose value alignment with the project, the best thing to do is to abandon the ship the sooner as possible. Insisting on total collaboration is bad for every party.


If they continue push upstream per the license then it works. There's a lot of Linux desktop apps that work this way; it's hardly a broken model.


While not an option for everyone, if you have a server with ssh access you can do:

    git init --bare /path/to/repo.git
on the server. Then locally you git clone that repo with a ssh url.

It does not have any visual MR or enterprisey features, but it works.


That's OK but too uncomfortable when managing a number of git repositories on a ssh server. I'm using gitolite [1] for that.

The features are basic and managed by editing text files and git-pushing them to a control repository: create repositories, add users and their keys, readonly or readwrite. There is no GUI but once you have a copy of the repo on your machine you can use one of the several git GUIs available for any OS.

[1] https://gitolite.com/gitolite/


What aspects of bare git repos over ssh are uncomfortable or subpar?


With gitolite I don't have to manually setup every single repo and configure access with maybe one user name per project. That would be too much. And how about read only, read write?


A large portion of people don’t want to memorise all the commands related merging, branching etc, catering towards the lowest common denominator I’d important.


How does the choice of hosting change this?

Bare repo on a server is exposed to people exactly like Github: a remote URL you put in once and forget about it.

How people use their local git repository is their business, command-line, Sourcetree, GitKraken, what have you, but any of those work with any remotes.

(Sure, git by itself does not provide the other features from the hosting services like issue tracking and pull requests, but not every workflow requires those to be linked directly to the SCM)


I don’t care for issue tracking, but I do like the usability of diffs and merging in web ui apps - my primary job is to look at code not write it (meaning I’d fail basic git merge questions) but I’ve also found out the hard way that just because I know my way around a shell doesn’t mean I can force my views on the people my company hires, and I own the company.


I do genuinely appreciate that, but that's the point: the graphical clients that do visual diffing and merging like the ones I listed all work with a bare repo as the remote.

Heck I think even the Github Desktop application also works with non-Github repositories, and they would be the only ones that would have any interest in locking people in.

Unless you mean specifically the UX of having a URL you can copy to a specific line of a specific commit in a repository, which indeed is not possible without a standard URI scheme (which does not exist) or a web client.


I get your point. At the same time I find it funny how Linus was checking patches via email deciding what gets merged for the linux kernel. Now, every service needs all the replicated enterprisey festures.

It is not a personal criticism to you. I find it interesting git gave us all this efficiency and the enterprise removes it by adding complexity back because employees supposedly cannot be bothered to learn their tools (or cannot be mandated) or plainly prefer a nicer ui. Not a crime, but I can see how big corporations become inefficient with this type of thinking, when appliend to hundreds of tools and processes.


I use gitolite as well, it's great. Currently working on integrating it into a CI/CD pipeline, which admittedly proves to be a slight challenge, but I'm sure I'll get there eventually.


I was sold on the features but never got the hang of how to work with it.


We've been doing that for years in our org. This works perfectly.we do push open source repos to GitHub though.


I, as a Linux user, built a similar system myself by getting an FTP account, mounting it locally with curlftpfs and then using git on the mounted filesystem.


this exactly. git itself is all ya need. u can connect clients/ides like vs code to such a repo easily.


> You must admit that is a bit ironic

It's a sad situation that if you desire exposure and community building you must maintain a fork on Github, but that's how it is for smaller projects. I am in a similar situation, with some of my projects with main repos hosted on sourcehut, but most of external engagement comes from clones on github. It is what it is, and we do what we must. :)


I would agree for any other kind of project, except for a GitHub alternative.

How does it look from the potential users' perspective when the product they market is not the product they choose to use for themselves?


It looks like they are a pragmatic project that prefers to have contributors to being ideologically pure. It's not like there isn't an official repository hosted on gitea: https://gitea.com/gitea


GitLab entered a strategic partnership with Google, likely for the very same reason - feeding Google AI models with enough code.


Could you link to some of the announcements or articles. I only ask because I was totally unaware and would like to learn more.



> GitLab is working with Google Cloud because of its strong commitment to privacy and enterprise readiness, and its leadership in AI.

Google's commitment to privacy? Google's leadership in AI?

O how I love marketing, you can say just about anything



> You must admit that is a bit ironic.

Looks like they're working on migrating to a Gitea instance: https://github.com/go-gitea/gitea/issues/1029 .


Wow how pathetic that github is refusing to export their data:

https://github.com/go-gitea/gitea/issues/1029#issuecomment-1...


Failing != Refusing


Failing = Refusing + hiding behind corporate bureacracy


> You must admit that is a bit ironic.

The people are on github, so it is really enticing.

Maybe the reddit and twitter drama creates a viable enough community for federated logins to become useful.


> The people are on github, so it is really enticing.

For any other project, sure. But when building an alternative to GitHub.. there is value in dogfooding.


or just git init --bare


> You must admit that is a bit ironic.

Every time someone parrots this, I have to wonder if they did more than 5 minutes of reading - it's one of the top issues on the issue tracker and they've outright stated they will move once Gitea is at a spot where they are not losing functionality and history.


I did not parrot anything. This is the first time I have heard of Gitea, I have googled it and the 1. thing I have noticed it was hosted on GitHub. It was an original tought.

I did not care enough to open their issue tracker. I still don't. It is ironic, not a bit, a lot. That statement was a bit sarcastic.

I hope that put the end to your wondering.


>Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!

And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.


I am agreeing and adding another solution.

The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.


[flagged]


Parroting ideologue-dogmatic bullshit platitudes is not conducive to a good discussion.


If you're going to wade into a 200 year old flame war at least have something interesting to say.


Emergence of machine intelligence* and its control by Capital was not foreseen by Karl Marx, and the intervening period between heat death of Capitalist system and the Workers Utopia has been indefinitely extended.

* pure transformation of energy into labor


There's an awful lot of very smart people who have studied economics for the majority of their lives who disagree with this. There are also alternatives to capitalism that don't entirely involve govt control.


Absolute BULLSHIT!

Greed is what screws up the market. Ask Alan Greenspan re 2008 Banking Crisis.


There is always SourceHut (https://sourcehut.org) if you want.


Do you have experience with self-hosting Guitea? I am on to fence about going with Gitea because of the recent fork of the project (Forgejo). Seems that many contributors are now contributing mainly to Forgejo.


The reason for the fork was that Gitea was going for-profit and the folks that forked to Forgejo felt they went about that transition in a way that eroded trust. Here's their explanation: https://blog.codeberg.org/codeberg-launches-forgejo.html


Gitea is itself a fork of gogs (Go Git Server)

it is functioning like Open Source should, there was a disagreement in how the project was run so it gets forked

This used to be more common place when projects were run by people not companies. I wish the practice would come back we need more forks in Free Software


It feels bad to "waste" the work that could have otherwise gone into highly-paid billable hours, or at least charity work on other repos that get more use.


I self host Gitea. Very reliable. Painless setup. I wish it had some sort of CI like github actions or bitbucket pipelines, but otherwise totally happy wit it.


> I wish it had some sort of CI like github actions or bitbucket pipeline

I use Gitea with Drone CI and it works pretty well: https://www.drone.io/

Some might also prefer the Woodpecker CI fork due to the license: https://woodpecker-ci.org/

I setup Drone as a part of my migration away from GitLab Omnibus and have no complaints so far: https://blog.kronis.dev/articles/goodbye-gitlab-hello-gitea-...

Here's the Drone example in particular: https://blog.kronis.dev/tutorials/moving-from-gitlab-ci-to-d...


It's been added recently. Not sure how they compare.


GitHub actions works in gittea at version 1.19


Just self host the community edition of gitlab. It's miles better than gitea. It's got ci pipelines, it's got a pretty robust issue tracker, it's got wiki pages, it'll integrate with ldap/ad for authentication, it's got a package repository for self hosting libraries, it's got releases, it's got a service desk to make email -> ticket pipelines, etc.


GitLab CE is far too heavy and requires minimum 4GB to run. Contains lots of componnents including PostgreSQL and Redis and various components and startup takes long. With Gitea I can run it with just 1GB or a raspberry pi. It includes wiki, package repositories and releases as well. ldap, service desk - these are enterprise features that I don't need.


> It's miles better than gitea.

Gitlab is a crazy setup full of services, with elaborate interdependence, absurd hardware requirements, iffy performance, and all the lack of confidence on security that comes from this (and it only ever running if you use their docker images and don't touch anything).

But yeah, it got everything.


Gitea has all these features as well, except maybe the last.


I've got Gitea running on a $5 Vultr instance and it's great.

Upgrades have been painless. Doesn't tax the server.

Was using Gitea when that fork happened and didn't see a reason to migrate. Looked very much like poor communication on the behalf of Gitea causing a misunderstanding.


I self host Gitea both on my home NAS and a DO droplet. I set up repos sync between the instances, it works flawlessly. I've moved the most of my projects off Github/Gitlab and overall I'm very happy with it.


I self-host gitea as a github backup just in case. It's pretty easy and well documented (it's a single executable and you can use sqlite for the database).


Cyberpunk 2077 here we come!


> The only solution is to self-host. Gitea is good.

I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/


You install Gitea on a server that you own. You use that instance of Gitea to host your git repositories. That is self-hosted git hosting.

Gitea is an open source alternative to GitHub, that you run yourself.


It's a "run your own github" application. akin to Github Enterprise Server or Gitlab CE/EE, except unlike Github Enterprise Server and Gitlab EE, it's open source.


As far as I am aware, they do not offer a hosting service. I believe that statement was meant to convey that the Gitea software, once installed is a git host similar to the others. I think they were trying to differentiate between a typical remote git repo and all the web components that come with Gitea. They do offer paid support, but that's still for self hosting.


Playing devil‘s advocate, all kinds of linting, vetting or security scanning with any degree of smartness beyond a regex would probably fall into my definition of non-human eyes too.

Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.


These tools are enabled by the owner of the repository. And I think consent has precedence over license terms. But this seems like an escape hatch for using your code as a training material.


If it doesn't explicitly say: we do not use private repos data to train our AI models, I wouldn't even consider assuming any other way than that's exactly what they are doing. They know this is a question everyone wants to know the answer to. Why would they leave any ambiguity? Let me help you answer that: because that's exactly what they are doing.


Also dependency graph is enabled by default. So there's that.


This. I believe the title of this post is a click bait, and doesn’t include the context of this verbiage.

Use of data for AI training is a big deal. Implicitly allowing it under this definition will be enough cause for a lawsuit.


> in reality there is a legal framework and a ToS in place as well.

That doesn't give a lot of comfort. ToS can change at any time.


Everything referenced in this webpage refers only to public data, or private data where a user has enabled the dependency graph

>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph

So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.


For anyone wondering:

1. Click your avatar

2. Settings

3. Security / Code security and analysis

4. Dependency graph [Disable all]

You're welcome


Is it on by default? Couldn't they just silently re-enable it with an "oops that's how site updates go sometimes!" later?


Well they could silently do anything. No need to toggle the user-visible checkbox

It's like giving a party some of your possessions to store, and signing a form saying you don't consent to them snooping through them, then worrying "what if they change and forge the contract later?". They wouldn't need to. Either you trust them to keep their word or you don't trust them with your possessions at all


Its not a checkbox, it’s an action.

…and “automatically disable for all new private repos” is not set unless you explicitly opt in.


Interesting wording. I would call opting in to disabling a feature 'opt out'.

I'm sure github legal prefers your description, though.


When I click that disable all button there is a modal window with a checkbox that says "Enable by default for new repositories"

I'm assuming I want to definitely not check that if I dont want my code stolen?

But I dunno, they have worded this all so confusingly, this is some dark pattern shit. Do I check the box or not?


I believe you DO NOT check that box in the popup modal window.


So give them an inch (Dep Graph feature) and they'll take a mile (loss of confidentiality); that's not any better...


> So give them an inch

You are not giving them an inch, you are receiving an inch (and a whole lot more). And if you don't want that, you can turn the feature off.

What would you want instead? Them not offering the feature at all?


This is a false dichotomy. They can enable the dependency graph feature without repurposing your data for machine learning.


And they can also repurpose your data for ML, or otherwise illicitly pilfer your data, without you turning a random dependency visualization on. There seems to be an idea here that turning it off will change something, like how there are people adamant that Windows 10 has secret tracking features, yet believe they can defeat them using regedit. I'm afraid this is the technical equivalent of "swiper no swiping!"

And to be clear I'm not saying Github or Windows is doing this, I'm just saying either you trust them to keep their word or you don't


I'd like them to limit the scope of the data use to dependency graph ONLY and for it to not be stored. I swear this goes without saying.. but these blanket self granted permissions on the part of GitHub are a serious breach of trust. They don't need this to provide the dependency graph service.


My understanding of the way dependabot works is that they read your repository's manifest files to provide alerts about vulnerable dependencies. Then after you patch the repository they re-read it to identify the new package version you've moved to, and if there were any build errors introduced from the update. That info is anonymised & fed into an aggregated dataset, creating suggestions for other users like "95% of repositories impacted by this CVE upgraded from v1.0.4 to 1.0.5". Dependabot doesn't read through your code (unless your code is in `.manifest` or `.package` files).

That might still be too much for you, everyone's line is at a different place. Dependabot works better the more data it has, so personally I'm fine with that level of detail being returned to an aggregated service, for that very specific purpose.

There are other private providers of dependency scanning out there. I'm fairly sure they all work the same way Github does, just with smaller datasets.


I am not disagreeing with you but by your own admission it does not read your code; only the project's manifest. THEN why on Earth do they ask for so much more when all they need to do is bump some dependency version in the manifest.

As for this aggregate stats, why not limit the scope to access to the build status instead? They don't need anything more to provide this service...


As far as I can see from all of their documentation, they exclusively open a predefined list of common manifest files. They explicitly mention that they won't be able to do dependabot analysis if your deps are listed in a `setup.py` file for example. You'd need readonly access to the entire repository to actually find those manifest files though, as most of the time they're not going to be in the top-level directory.

For companies & people not willing or able to surrender that level of access, it's still possible to use the dependabot API, dissociated from any github repository (for example if you have an Atlassian git repository). You just send it a list of dependencies and an enum indicating its current build status, and they'll reply with a list of suggestions.


I doubt there's any legal way to say "we'll only read your manifests". Programmers may know what that means (and still debate it, honestly), but I'm skeptical the law has the ability to distinguish between the designated utilities of specific files.


If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.

How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?


> If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

How would you know the LLM used your work specifically and then prove that in court?


It's their problem to build lineage in their system.

Breach of licence and copyright laws is still illegal even if its very well concealed.

Lots of their customers' code is requiring attribution, or specifies acceptable uses (notably, what kind of license the product built with your code is allowed to have). Beyond simple copyright to which anyone can pretend, they're also concealing this.


> It's their problem to build lineage in their system

From a practical standpoint, I don’t see how that is true.

In order to establish (civil) liability in the United States, someone would have to file suit, survive the inevitable standing challenges, and demonstrate on the preponderance of the evidence that they had been harmed by this conduct.

If you can’t prove your code was reproduced by this model, I don’t see how you can successfully do that.


Truly curious, do you feel the same way about art whose creator/s did not go out of their way to copyright images?


It’s like saying you own an image after JPEG compressing it. There are going to be lawsuits, but MS is so big they might just consider settling them part of the cost of making sure they are on the cutting edge of generative AI.


Besides, the "except" part qualifying the "human eyes" is substantial.

GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.


GitHub publish a transparency report where they reveal how many of these requests they processed.

https://github.blog/tag/github-transparency-report/

> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.

Concerning national security letters:

> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.


> As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.

Seems like we can increase the lower bound ever so slightly: It couldn't have been zero, otherwise it would affect "0-249 accounts".


Now everyone knows why codeberg was invented. This type of thing is entirely foreseeable. Good luck not having all your IP be copy pasteable to anyone using a LLM.


sigh I moved all my projects (from Bitbucket and Github) to Gitlab. Now it appears Gitlab is feeding its data to Google [1].

I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.

[1] https://news.ycombinator.com/item?id=36445526


We need GPLv4 that explicitly says that models trained on GPLv4 code and anything they produce are derivative works.


GitHub/MSFT/OpenAI’s stance is that training AI models is fair use and doesn’t require a license. It doesn’t matter what kind of license you slap on the code. If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use, not a new software license.


> If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use

True, assuming that GitHub/MSFT/OpenAI's stance is correct. Just because they have that stance doesn't mean it is[1].

In the end, these gray areas will be decided in court.

[1] I personally think it probably is, which is why I removed all of my stuff from the open web and these sorts of services -- I see no other way to prevent this use of my data.


The policy says human eyes will never see the contents of your private repositories. Suppose, as you say, the repos are one day fed into an AI. How does that policy then constrain the AI later?

AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.

But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.

And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"


> The policy says human eyes will never see the contents of your private repositories.

I love how they say that as if it's a meaningful distinction in some way.


> Your individual personal or repository data will not be shared with third parties.

OpenAI is a third party..?


GitHub owns the Copilot model; it is based on OpenAI's Codex but they own and operate it themselves. So technically feeding it into their AI would be OK by those terms. That's why Copilot still exists even after Codex was killed off by OpenAI.


We’re still in the nascent stages of code models. There’s ample opportunity to transition to private models before the technology reaches maturity. As for the destiny of current private GitHub models, they may contribute to training models (through what’s referred to as “aggregate data”), but the potential backlash could be considerable, particularly when considering the bans on using ChatGPT that various firms have implemented.


Indeed. I wonder how many people don't see this. Or don't want to see this. Nobody should assume that a private Github repo is really private.

My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.


Is anyone aware of GitLab's policies regarding the privacy of private repositories? Do they have any "AI feeding" mechanisms in place?


Seems like one has to assume corporate repos have them. If they don't already, they could decide to at any time. If that's a problem, self-hosting puts the power back into the hands of the users (and can be done very cheaply - I'm paying ~$5/month for a server that's doing a bunch of stuff including very basic personal git hosting)


How do you think they provide all the services they provide? Secrets scanning, contextual source information, everything. I'm not sure how anyone could have gone this long and been surprised by this.

Next people will be surprised that GitHub has the right to show your source code on a website....


i think they are referring to automated abuse scanners which have been "made of ai" for decades.

i strongly doubt that this language would support the use of private repositories for training generative systems.


Right, that's a very loose definition of private.


Copilot will AI wash your data


You guys realize that GitHub provides secret scannings and supply chain (libs) issues warnings?

How do you think they do that? Crystal ball?

Not everything is about AI


I think this is probably true, despite critical software being leaked, that seems related to manually inputting data to chatGPT: https://mashable.com/article/samsung-chatgpt-leak-details

there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.

The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?


>however this agreement does permit them to do so.

How so? Assuming you read the whole agreement and not just take a single line out of context?

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."


Reading the terms of service is the most important thing; almost everything else is immaterial.

I am working from information derived from the terms of service: https://docs.github.com/en/site-policy/github-terms/github-t...

For a small example of what I mean:

> GitHub considers the contents of private repositories to be confidential to you. GitHub will protect the contents of private repositories from unauthorized use, access, or disclosure in the same manner that we would use to protect our own confidential information of a similar nature and in no event with less than a reasonable degree of care.

When talking about confidentiality, GitHub is specific to mention unauthorised access, implication is here that GitHub has the authority to authorise themselves.

The next section talks about how they will grant access to third-parties, but it's clear that as a first party they exclude themselves from the moniker of "unauthorised".


I agree however taking random pieces of a long agreement in isolation is not how you read terms of services. The section you highlight is NOT, for example, the one that talks about what legal rights you give Github regarding your code. Construing it as such and the extrapolating from that is taking a piece of text way out of context.


The section cited is the most prominent section in the terms of service regarding private repositories[0]

Everything else in that section relates to personnel (which is not the concern of the commenters here), what happens if you upload copyrighted material and codes of conduct.

Also, taking segments of the TOS is exactly what will happen in a court of law, unless there is another section that invalidates this section, this fragment will be argued as consent.

Additionally; the context of that section is very clear otherwise. Sorry.

The way ToS will be argued is: there are many references to processing information and nothing that explicitly prohibits or cannot be reasonably inferred to inhibit automatic processing, in fact there are many places where they make reference to "automatic processing for services"; You can take that to mean that it's just their SAST scanners, they can take it to mean training of an AI model, there's no distinction legally here.

[0]: https://docs.github.com/en/site-policy/github-terms/github-t...


You mean like the earlier section that explicitly says "License Grant to Us" which the court would somehow amazingly ignore reading according to you? Because clearly the exact legal rights you give Github to use your work is not relevant to a discussion on how Github is legally allowed to use your work.


I can tell you're a bit defensive about this, so let me be clear as possible: talk to a lawyer.

I am not a lawyer myself, though I consult with Swiss/German lawyers quite often regarding licensing because it is a large part of my role.

The fragment I mentioned is the beginning of section "E”; this document is using letters as the top level section separator, so using a different letter from the prior is taken as an intentional separation for the purposes of a legal document, what you are referring to is from section "D"; IE they are in a different context from each other entirely.

Even then, you do not need to have a license to automatically process source code, as per the text of this ToS.

There are more than a few points where they talk about what they have the rights to read, and there is only one point (which is legally required to be there, fwiw) which states they can't release your source code as if it's theirs.

If you read the copilot terms of service, they are not granting license to use the suggestions.

A very pessimistic take is that they are quite legally protected from using source code derived from these repositories, even if there wasn't the muddy discussion about training data and derivation.


"why wouldn't you, eventually?"

Because if people found out you were doing that you would lose vast numbers of paying customers, which would cost you a lot of money.


You're talking about a company that puts ads in the start menu and opts users in to all kinds of telemetry.

If people haven't gotten tired of Microsoft's antics yet, they never will.


It's a little different, Microsoft doesn't care about consumers as they stopped being relevant to Microsoft's financials in the Ballmer era. Even consumer Office spending makes less than LinkedIn for companies to put into perspective how little they care.

However, enterprise/commercial? The same group you're trying to sandwich between a 365 subscription and azure/github? These are the people in which when they do have evidence that they are being negatively affected, will cause a massive dent in Microsoft's bottom line.


In the intelligence business, they talk about capabilities, not intents.


> The concern is, why not?

They need to reliably filter secrets or they are in trouble once Copilot suggests them.


They already scan for secrets so that they can warn you that you checked them in.


Temporarily reading into RAM and applying a regex isn’t the same as feeding your repos to a LLM, which may store parts of it permanently.

Obviously GitHub can read my repos in order to display them on GitHub.com, but access must be fleeting.


[flagged]


In that case they needn't put this in their ToS. It's strange how people will distrust a company with one hand, then claim their word as reliable evidence (and think they can be stopped by unticking a checkbox) with the other hand. If they wish to deceive you they would do it without telling on themselves


Any sane business decisionmaker would want legal cover for something like this in case it draws a lawsuit. Regardless of whether they themselves think it's okay, doing it entirely on the sly makes it more likely an opposing lawyer can convince a judge/jury that it's not okay. If they have an agreement that the client "accepted" that strengthens their position quite a lot.


Of course it's about AI.

They have a legitimate cause to read and while they are at it they use the data for everything else they want.

No way for you to know.


Do people want Github to display files in your private repos on github.com? Do you want syntax highlighting and intellisense? Do you want them tokenized for search? Should they be available for cloning when you need it? Should they run security scans and identify vulnerabilities? Should you be able to see git blame, diffs, commits, other metadata? Set up build pipelines? Modify contents via an API?

Tell me – how can Github do all of this without programmatically accessing the contents of your private repos from their servers?


As I see it people are basically taking a single line of an agreement that explicitly references other agreements out of context:

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."


No, I don't want these features. I want reliable remote git storage, which lets me clone a copy of my source code anywhere so I can do these things locally, on my own machine. People forget that this was the original design and intent of Git as a decentralized protocol. As an industry I think we have become too reliant on Github as some kind of way-far-beyond-git magic tool. This might be fine until it isn't--like when Microsoft decides it's time to start training Copilot on your private code.


Use another place to store your git projects then, it's not like Github is the only one. Github is a PRODUCT built around git, not just a place to store your git repositories. I'm pretty sure you will be able to find other services which only exist to host git projects and do nothing else.


You say this as if you think most people have a choice...I've pushed hard in the past for my employer to move off Github but have been met with nothing but blind resistance.


Being concerned for your own personal reasons about Microsoft scanning your code of one thing. How does it make sense for your employer to switch off GitHub?


But that's not your code, that's your employer's code. You might have written it but they fully own it. This doesn't seem like something to be concerned about on their behalf.


Doing remote storage _still_ requires their servers to read and process your repository data.


Then you shouldn't be using Github at all. Problem solved.


Tell this to my CTO.


Why? It's his company, his code, his decisions. You are free to use whatever you want for your own projects.


Also, for what it's worth, I was extremely surprised and concerned when my private repos appeared in code search results. I had to triple-check that it wasn't showing up when logged out or logged in to a dummy account without access to the repo. This behavior is definitely not intuitive and if there's a way to disable search indexing on private repos I'd do so.


Or - how any provider even self hosted, other than GitHub can provide any of that without automated processing of the contents of private repositories?


1. Avoid VC and investors, and,

2. Build small- to mid-size businesses, that can prioritize healthy growth – like focusing on users, being kind to employees, building brand equity over time.

The pump-and-dump grift cycle of enshittification has got to stop.


As soon as finance capital gets involved, the core objective of business shifts to primarily providing higher returns for the investors. Small to medium businesses can have diversity but once they "go public" they all are almost alike.


Then don’t go public.

Accept that you cannot grow at the same speed and focus on deep roots and fanatical customers


Does that ever work? After all, an enterprise that cannot grow at the same speed is effectively shrinking, isn't it?


37signals (and others) proved that a different approach is possible.


Yeah, it can work. Valve is private and basically owns the pc game distribution market. Even competing agains industry titans like MS.

I’m not sure I know another example, so maybe it’s very rare.


A few companies in tech have plenty of goodwill, even if not all are privately owned: RedHat (currently burning theirs), Mozilla, 37signals, Khan Academy, Mapbox, Komoot, DuckDuckGo, Brave, etc.

Not all these have a viable business model, but they’re not obsessed with growth and brokering user data.


Why does it have to grow?

It can also be stagnant, which is totally acceptable as well.


You can't really be stagnant due to competition. Not unless you're deep into some niche that's difficult to get started in or stick around.


Yeah, as soon as MBAs and financial investors get involved (which, for "inefficient" small companies, is just a matter of time) the companies change to be exploitative cancer on society like the rest.

There's whole industry looking for companies that are too nice to their employees and customers to buy, exploit, grab profits and leave a husk behind.


I suppose that's it's impossible to do a hostile takeover of a private company. All it takes is to stay financially healthy and reject offers to sell!

Much easier said than done. Sometimes the owners just get tired or bored, and want out. They sell the company to the highest bidder, within reason; often there's no crowd clamoring to buy a small, non-growing company.

Then it's usually over.


How exactly is GitHub supposed to work if their computers are not allowed to look at your code? Git doesn't work by magic.


I guess it hinges on the definition of “looking”.

Receiving files, copying them within a file system, and transmitting the contents? Clearly not looking.

Computing a hash for each file? Doesn’t seem like looking either.

Parsing the contents, or tokenizing them for a ML model? That’s looking IMO.

I think the defining factor is the size and nature of the derived state update on the server. Copying and transmitting retains only metadata about the file. Computing a hash produces only a handful of bytes of state about the file, and there’s nothing that can be inferred about the file’s contents from the hash. But if you start storing something like word frequencies, that’s not neutral anymore, it’s a “look” into the contents.


What about parsing and tokenizing the code for code search?


Yes, that’s clearly GitHub looking at the contents. At that point you just have to trust they don’t do anything else with the internal representation they’ve built.

When Gmail was introduced two decades ago, there was some short-lived furore about the fact that Google admitted they were reading your emails for ad targeting purposes. To a lot of people like myself, this seemed a bridge too far when you’re used to servers that don’t “look” at your private stuff. But it certainly didn’t hurt Gmail’s adoption. (I still don’t use Gmail because of this original lingering distaste.)


They don't "read" emails for ad targeting anymore (according to them). But they certainly do for search and spam filtering, which are features that users expect.


Once things hinge on the definition of commonly-used verbs and nouns, you've got a problem.

And if you don't agree, just remember that my claim hinges on the definition of the words "you", "definition", "got" and "a".


The problem is GitHub’s if they’re the ones accessing copyrighted content under poorly defined terms.


Go to github.com, open your repository and open a file. It shows up on the screen. Is that not "looking"?


No, because the server doesn’t update any state based on the transmitted file contents except some metadata like “last viewed”.

The state update based on the file contents happens in my brain, and that’s the “look” action.


They do code highlighting, search indexing and symbol resolution. I don't know if those can be disabled but they need them to look at the code, update state (indexes and such, maybe caching on highlighting).


code highlighting is done in the client, the infrastructure just serves the content for that particular feature.


It's the sharing with partners bit that I don't like. If I want that to happen, I should be able to explicitly grant permissions.


Same way a postman delivers a postcard.


Postman doesn’t provide tools to diff your postcards, search them, scan their content for security issues, or automatically take actions based on the contents of your postcards.

GitHub does all those things with your commits.


Postman doesn't need to show, potentially scan, and parse the contents of some text while displaying on a website.

Not really analogous.


You'd be surprised at what the United States Postal Service does with all flat paper mail, then!

https://www.usps.com/manage/informed-delivery.htm


A postcard is not a letter; its contents are written in the open, right near to the postal address, and are fully visible to the postman. I imagine that's why the OP chose specifically a postcard for his example.


Exactly.

You expect him to read the address not the content.

Maybe you don't mind if he reads the content sometimes but if he reads every postcard every time you would be pissed.


By carefully reading the entire thing and storing a list of deltas from some other postcard when it would help compress it better?


Didn't Amazon say the same about the Alexa data?

Nothing is stored at Amazon.

Ok, stored but no human has access.

Ok, humans have access but only special employees

Ok, some third party contractor get the data too.


It really sucks when you want to keep a secret but keep getting asked the right questions. It would have been much easier for Amazon if people just took the first statement at face value.


Why is this setting hidden behind the “Dependency Graph” feature?

GitHub should be making this blatantly clear and transparent for users, rather than using Facebook-esk cloak-and-dagger language.


Not related to private repositories; but GitHub also keeps logs about commits in the "Activity Graph" if these commits do not exist anymore.

For example, clearing the full reflog of a repository and then either force pushing over or deleting the remote repository (then pushing it again under a different name) will keep the activity attribution, in the second case it might even attribute the commits to the new remote (404ing when clicking the "Created x commits in y repositories").

Good for team workflows and pull requests, but annoying for public repositories with intentionally rewritten history.


Reflog is data local to a git node, not part of the repository. In particular reflogs are never transmitted between remotes. so “clearing the reflog” (presumably of your local machine) has nothing to do with removing any data at GitHub. You need to contact support for that which the docs advise you to do if you’ve force pushed over something you want to be fully deleted.


Thank you, today I learned


No problem. One easy way to gain confidence that data has actually been wiped instead of just obscured is to use one of the gitrevisions [1] specifiers that sources data from the reflog. For example,

    master@{yesterday}
resolves to whatever the reflog thinks the master branch was pointing to 24 hours ago. There are a bunch of places in the GitHub API and UI that look like they only accept a branch or commit OID but actually accept a revision specifier. So e.g

    https://github.com/torvalds/linux/tree/master@{yesterday}
lets you peek into the reflog for the master branch on torvalds/linux.

[1] man 7 gitrevisions


Thank you very much!

Apart from the misconception, I worded my initial post in an inprecise way.

I did not know wether or not the reflog relates to what's pushed to the remote. I just had the expectation that an empty git log as well as git reflog output would mean that no objects exist except for what represents the tree of the current HEAD.

Still haven't tried the revision specifiers on a repository that I expected to be wiped, but will do that. I need to read more docs about git plumbing and the .git folder.


  > I just had the expectation that an empty git log as well as git reflog output would mean that no objects exist except for what represents the tree of the current HEAD.
It’s a pretty common expectation but the nature of git’s object model means that generally it’s not true for any git provider. You should assume that data pushed to any git remote will live forever unless you both force push away all references to the data and GC all remotes (which for GitHub means contacting support IIRC)


Thanks again for your thoughtful responses.


GitHub monoculture is becoming a major issue for the health of the entire software industry


The monoculture will always exist, humans tend to flock together. See also twitter, Reddit.


“…except as described in our Terms of Service.” according to that paragraph


Yes, but that is the standard Terms of Service which applies to public repositories as well. The terms appear the same as the public repositories; there doesn't appear to be a carve out.


I've worked at companies that store all their source code in private github repos.

And I think - is github impossible to hack?

Will the day come when github is wide open and all that secret source gets drained out?


I posted about this a week or so ago. Someone could come steal all of our "secret source" and still have absolutely no advantage over us. If the value of your business is entirely in the source code, you need to seriously re-investigate what exactly it is that you are doing. AAA gaming and Windows codebases have been leaked. In some cases partially, in others comprehensively. At no point do I recall Valve or Microsoft seeing adverse competition or other meaningful impact from this event. Legal teams were certainly up in arms, but in terms that investors would care about nothing really happens.

Turns out, you need more than source code to run a business. You need customers, markets & relationships. You need trust. You need a sales team. That's the real thing you should be worried about being stolen. Your goddamn sales team. Obviously the developers would be worried about the code base (and they should be) but at scale it's not the end of the world if some code gets out. Losing your best sales person & all their prospects would be much more catastrophic for your business.


Yeah, it totally makes sense to compare gigantic companies to all other ones. Let's also extrapolate what should happen to a subcompact in a collision by looking at what happens to a tank in those same circumstances.


These are well established business with deeply developed ecosystems (monopolies?) and market traction behind them.

Smaller players, especially threats to OpenAI and whoever else could easily get squashed in such a scenario.


You’re simplifying this to infinity.

Coca Cola’s empire surely isn’t dependent on Coke’s recipe, yet they’re still keeping it a secret. Same goes for the vast majority of for-profit organizations who do not want their core product’s recipe exposed, for a variety of reasons.


> If the value of your business is entirely in the source code, you need to seriously re-investigate what exactly it is that you are doing.

Most software companies' "value" (liquidatable assets) depend almost exclusively on their IP, as embodied in their patent portfolio and their source-code. Their people may be more valuable, but people can't be traded or used as collateral.


I think hedge funds would be a little uncomfortable with their source code being leaked.

It’s also a security risk having your source code leaked.


Nothing is impossible to hack - I don't think GitHub themselves would agree that it's impossible to hack. But not using Github, and having your own git hosting also doesn't make it impossible to hack.


And I think - is github impossible to hack?

I think it's a more "How much will it cost to host repositories that are at less of a risk than GitHub".


If there is proof that they are training their commercial AI from private professional source code it could feed a COLOSSAL class action lawsuit. I understand that 1-3 percent of the American GDP goes to lawyers who work on contingency for these kinds of class actions. These guys eat what they kill, and they feed on corporations. That doesn't mean that Microsoft won't try it.


This is the kind of privacy that I (along with a large majority of unsophisticated tech users) care about most. I don't care if machines access my data and serve me better ads, I do care whether customer data can easily be viewed by a HR person / angry ex / private investigator / controlling ultra-religious parents / crooked cop of a third-world-country with a bullshit "this person wants to commit suicide" excuse.

I have actually talked with non-tech-savvy people about this, most of them have heard of tech companies tracking their users, and most of them genuinely believe that there are tech company employees out there going through their private lives. This is why I really dislike the use of the word "tracking" in a tech privacy context.


Where is the difference?

A machine will read your data and will set a flag in your HR file.

No human read your data but you still face consequences


I've heard plenty of non-technical users complain about ads that they think we're shown based on tracking, most often the complaint is concerns of the use of an idle phone's mic.

I've also talked with non-tech people about how the tracking and data mining world actually works, I've never had someone say "so what?". It may simply be that non-tech people aren't as aware of what the tracking does based on ignorance of the problem, not a genuine acceptance of the practice itself.


> most of them genuinely believe that there are tech company employees out there going through their private lives.

Are you implying this is an incorrect belief? There are reports of employees stalking their former partners at data collection companies.


Github is a great service. However, it is a bit like being in a happy marriage but knowing your spouse is going to murder you in a few years.


More like a happy marriage where your partner drugs your bedtime cocoa and then grates you feet with a cheese-grater for some reason that you really don't want to know.


That's hyperbole. I think it's closer to knowing that your spouse may leave you and expose your nudes to the world. It's still not a comforting thought, but there's no risk to your (or your project's) existence.


Well if the USA arbitrary decide that your country suck for them, there is a risk to your project existence.


And is currently cheating on you with "third-party analysis and tracking partners."


you know that in the future your baby is going to be taken away


You’ve taken this out of context. This was specifically referring to “scanning” or vulnerability scanning. I doubt it will stand the test of law if someone uses them for their “private” data being used to train AI.


GitHub: We are owned by Microsoft. And Microsoft is known to secretly forward much or all of the user data and communications to the NSA (National Security Agency), as part of the PRISM program

(https://www.lawfareblog.com/snowden-revelations)

... but we'll never share your data with anyone!


I’m switching and creating all my new projects at sourcehut. Nothing good ever comes out of Microsoft having dominance.


Just be aware that sourcehut is Opinionated and some projects are unwelcome there (eg: even vaguely cryptocurrency related code)


That's one of the nicest features of SourceHut, IMHO. There's a real person to talk with, and some hard principles behind the service. Don't on the same page? Nobody forces you to use it.

The related update, in short, tells this:,

    You have a legit blockchain project, tell us, and we'll allow you. Crypto? We most probably don't allow it [0].
The wording is this (verbatim):

    We will exercise discretion when applying this rule. If you believe that your use-case for cryptocurrency or blockchain is not plagued by these social problems, you may ask for permission to host it on SourceHut, or appeal its removal, by contacting support.
I'd rather have my project removed by a person who I can talk with, rather than a corporate chat bot pretending to be human, and my source code is not harvested without my consent for a service which is sold at me $10/mo.

[0]: https://sourcehut.org/blog/2022-10-31-tos-update-cryptocurre...


Nice feature!


I dunno, I consider it an anti feature personally.

They don’t allow crypto related code to be hosted there purely because the guy who runs the show dislikes crypto.

What else does he dislike in the future? AI/ML projects are just as divisive as crypto coin projects…


Given his opinions on package managers like npm, I guess anything vaguely adjacent doesn't feel welcome.


SourceHut planned to ban Go Module mirror due to bandwidth they consume, then they have talked with Russ Cox, and Go team agreed to implement caching into Go (-reuse flag, IOW) [0].

I think this is a major win, because assuming everyone have their unlimited 10gpbs symmetric dark fibers at their home, office, car and mobile phone is wrong on so many levels.

[0]: https://sourcehut.org/blog/2023-01-09-gomodulemirror/


When then product is free...


except you can't get free private repos anymore... oh wait...

Yeah they give you a lot more on the free account now, don't they? :/


So naturally repositories in Github Enterprise accounts are excluded?


it depends on the terms of agreement.


When the software is propriety...


This is the only answer


OK correct me if I am wrong, but Git is designed to not need a Hub, right?

We've maybe been lazy with how we centralise repos in this way.

But whatever GitHub is doing, Vervel/AWS/CF can also do when you deploy, and they maybe already do to detect bad actors?


Git doesn't need a hub. But a hub is convenient enough to use compared to self-hosting.

Just like many of us here know how to freely download pirated torrent mp3s safely, though pay for Spotify/Apple Music as its more convenient (also legal, but doubt many care about that especially here).


Sounds like Gmail not having a human reading your emails like 20 years ago


Right. And all the GPL, MIT, and CC Sharealike-with-attribution material I've been stashing on Github is going into the vast code anonymizer called Copilot.

The good news for me the open-source developer: it looks like the closed-source code people write with the help of Copilot and all the open-source code in its learning set will also feed into that learning set.

I guess code just wants to be free -- free as in speech, expensive as in yachts and fine wine for VCs and private equity pirates. Plus ça change, plus ça reste la même chose. Sigh.


"Computer eyes" only look to pick what and how to show "human eyes", otherwise there'd be no incentive to look at private repositories.


...but they will see some kind of mash-up, remixed, synthesized version of it that might be 99.9% similar so practically this means absolutely nothing.


This reads like someone’s boss drunkenly told the author to “just tell them no human eyes will read it, har har.” The author then just phoned it in.


That explains their move to make private repos free a few years ago.

What frustrates me is that it feels like they didn't announce their intentions of what they'd do with your code. They use the same kind of verbiage ib t&c as they would to tell they collect some usage telemetry and might occasionally share it with partners. And that is definitely not the same.


I am thinking about self-hosting my private repositories for some time now. I also want a simple web interface for convenience. I don't need a "Github clone" like Gogs, Gitea..

What simpler alternatives are there? I found https://github.com/jonashaag/klaus . Is there something else?


You don't need any extra software. Git already has a built-in web interface (https://git-scm.com/book/en/v2/Git-on-the-Server-GitWeb).


Oh wow I didn't know that.. thanks I will have a look at this!


I feel like git services like github would be well-suited to join the fediverse. It's clear to a lot of us that our code is becoming too centralized, but it's also hard to deny the usefulness of being able to search across so much code in a single place.

I'd also expect that users of github would have an easier time than the average person figuring that change out.


It's an interesting thought but GitHub specifically has absolutely no reason to want this; doubly so after the Microsoft acquisition. Sourcehut, Gitea.io, and Codeberg would be the places to start.


Having a computer read your data can be even worst then a human... After all a human programmed it to find whatever it wants at scale.


How usable is Pijul and its Best forge these days? Maybe something low-stakes but (for me) highly used like my dotfiles would be a good fit for learning my way around over there.

I'm getting a 500 Internal Server Error on their pricing page, which I assume may have something to do with this post on HN. Anyone know what it costs to have private repos there?


As the author, I do. Pijul is usable, we use it for itself, no problem.

The best forge is still a bit experimental, since we've recently rewritten it using tech that isn't as ready as it promises (looking at you, Cloudflare Workers).

A private repo is 5€/month, plus 0.01€/Gb.day for storage above 100Mb.

And we don't do the kind of stuff discussed here, we're just offering storage and collaboration tools, and intend to keep it this way (as well as open sourcing everything).


It's starting to look like github is slowly arriving to its own funeral. Lots of pointless site changes that hide or change location of important functionality. New twitter like feed that tries to recommend you all sort of crap you don't care about and now this. It's shame as github actions is not half bad.


Have we all settled on what's gonna be the self-hosted version yet?


Git


Git has a web server?



I seriously had no idea lol. What a thing that is.


I was kind of making a joke that Git is much more important than any individual forge software though, because Git is a decentralized protocol. Any forge software you feel like running will work.

I personally don't use gitweb, but I do use cgit. It has all the features I care about (making my git repos available to view on the web). If you find that you want different features, you should use a forge that provides those features.


That plus reasonable security (my code isn't worth money) is all I really want...


This really shouldn't be the step it takes for you to think "maybe I shouldn't give Microsoft all the source code that I think is important enough to keep private".


How else did people think Github was able to do things like link to function definitions or scanning for secrets stored in private repos to alert you?


Confident statement coming from them - What about like when it was hacked 6 years ago and some kid had access to all the private repos ?


I don’t know if that’s reassuring or worrying…


you want to minimize trust.

you have to trust linux and kind of trust aws, you don’t have to trust github at all.

not your keys, not your coins. it’s really such a good catchphrase.

for anything non-public, git data goes here:

https://github.com/nathants/git-remote-aws


One wonders if there's a shadow field dedicated to coaxing these datasets which "will never show up, don't worry" out of LLMs.

I accidentally got a few hundred fake(?) names out of the codex models because I asked it to LARP as a professor (I was hoping for better code). Probably almost all hallucinations, but I wouldn't be surprised if some had some truth to them.


We really need to have an open source sw license that specifically forbid any use of the code by AIs


We'd still need a way to know when someone violates it. When you see a copy of your licensed software being used you can at least try to enforce that, but how do you know it was fed through an AI for training?

At best that would be like a Don't Track Me flag. Some will follow it, though ironically it is also useful for fingerprinting and can signal you may be doing something worth tracking.


"Open source" doesn't really help much without the "Free and" prefix, and unfortunately there's no way to reconcile software freedom and banning specific use cases. Dunno if you remember the "can we remove Nazis from ToR" controversy awhile back but it's the same kind of thing. It's either free or it isn't.


I don't know what you guys expect from a Microsoft company.... Seems obvious.


does anyone have specific and unique enough code in their pvt repos to see if co pilot and such things will actually recommend their specific private code or methods (i guess this is the thing ppl worry about?)?


"Rest assured that your code will only ever be viewed by a vast and unfathomable superintelligence whose motives are as yet unclear. Please be cognizant of any code or comments which may inadvertently invoke the shoggoth's displeasure." /s


Concerned about this privacy policy, every user should not upload their private data to GitHub, even though this repository is set to private. GitHub should only host public repository or unrelated data.


If you use a free service, you're the product.


This applies to 'pro & team' paid services too.


ah yes it's only memorized by an AI but don't worry we will do some changes to it so that it no one will recognize it


Just buggy machine eyes?


aggregate data is getting a whole new meaning these days.

traditionally I would have juxtaposed aggregate and privacy preserving, though I think in the lights of LLMs, that is not reasonable anymore.

soon a zip file is also "aggregate" data.


No chance GitHub is anywhere near using private repo source code to train the public GitHub Copilot.

On the other hand, getting a personalized Copilot within your org is something entirely different and probably will be out in the not too distant future.


How do you arrive at that conclusion?

They could feed private repos into copilot. It might make copilot better. It's certainly useful to automate extraction of IP with plausible deniability. If someone notices, they can claim "oh that must be in a public dataset as well".


[flagged]


[flagged]


> Please don't comment about the voting on comments. It never does any good, and it makes boring reading.

https://news.ycombinator.com/newsguidelines.html


Also

> Eschew flamebait. Avoid generic tangents. Omit internet tropes.

OP is also in poor taste


> By default, all public repositories are included in the GitHub Archive Program, a partnership between GitHub and organizations such as Software Heritage Foundation and Internet Archive to ensure the long-term preservation of the world's open source software

You no longer own your projects, they have the final word.. scary times

Glad I moved away from Microsoft products many years ago, sad for my country




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: