> Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.
>
> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.
That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.
This is huge, and unfortunately not surprising at all in the age of massive ever-growing out of control tech monopolies that do whatever the fuck they want. Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.
Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.
If you like the community driven fork of Gitea (which still upstreams to Gitea project) then you should check out https://forgejo.org
The fork was established at the time that Gitea got entepreneurial and founded Gitea Ltd. with plans for an enterprise version. https://codeberg.org used to run on Gitea, but switched to Forgejo, and Forgejo project is hosted on Codeberg at https://codeberg.org/forgejo
It's possible with OSS development but spreading out contributors and patches over three projects instead of one with a functioning community is hardly the ideal OSS development model.
Yet, when you lose value alignment with the project, the best thing to do is to abandon the ship the sooner as possible. Insisting on total collaboration is bad for every party.
That's OK but too uncomfortable when managing a number of git repositories on a ssh server. I'm using gitolite [1] for that.
The features are basic and managed by editing text files and git-pushing them to a control repository: create repositories, add users and their keys, readonly or readwrite. There is no GUI but once you have a copy of the repo on your machine you can use one of the several git GUIs available for any OS.
With gitolite I don't have to manually setup every single repo and configure access with maybe one user name per project. That would be too much. And how about read only, read write?
A large portion of people don’t want to memorise all the commands related merging, branching etc, catering towards the lowest common denominator I’d important.
Bare repo on a server is exposed to people exactly like Github: a remote URL you put in once and forget about it.
How people use their local git repository is their business, command-line, Sourcetree, GitKraken, what have you, but any of those work with any remotes.
(Sure, git by itself does not provide the other features from the hosting services like issue tracking and pull requests, but not every workflow requires those to be linked directly to the SCM)
I don’t care for issue tracking, but I do like the usability of diffs and merging in web ui apps - my primary job is to look at code not write it (meaning I’d fail basic git merge questions) but I’ve also found out the hard way that just because I know my way around a shell doesn’t mean I can force my views on the people my company hires, and I own the company.
I do genuinely appreciate that, but that's the point: the graphical clients that do visual diffing and merging like the ones I listed all work with a bare repo as the remote.
Heck I think even the Github Desktop application also works with non-Github repositories, and they would be the only ones that would have any interest in locking people in.
Unless you mean specifically the UX of having a URL you can copy to a specific line of a specific commit in a repository, which indeed is not possible without a standard URI scheme (which does not exist) or a web client.
I get your point. At the same time I find it funny how Linus was checking patches via email deciding what gets merged for the linux kernel. Now, every service needs all the replicated enterprisey festures.
It is not a personal criticism to you. I find it interesting git gave us all this efficiency and the enterprise removes it by adding complexity back because employees supposedly cannot be bothered to learn their tools (or cannot be mandated) or plainly prefer a nicer ui. Not a crime, but I can see how big corporations become inefficient with this type of thinking, when appliend to hundreds of tools and processes.
I use gitolite as well, it's great. Currently working on integrating it into a CI/CD pipeline, which admittedly proves to be a slight challenge, but I'm sure I'll get there eventually.
I, as a Linux user, built a similar system myself by getting an FTP account, mounting it locally with curlftpfs and then using git on the mounted filesystem.
It's a sad situation that if you desire exposure and community building you must maintain a fork on Github, but that's how it is for smaller projects. I am in a similar situation, with some of my projects with main repos hosted on sourcehut, but most of external engagement comes from clones on github. It is what it is, and we do what we must. :)
It looks like they are a pragmatic project that prefers to have contributors to being ideologically pure. It's not like there isn't an official repository hosted on gitea: https://gitea.com/gitea
Every time someone parrots this, I have to wonder if they did more than 5 minutes of reading - it's one of the top issues on the issue tracker and they've outright stated they will move once Gitea is at a spot where they are not losing functionality and history.
I did not parrot anything.
This is the first time I have heard of Gitea, I have googled it and the 1. thing I have noticed it was hosted on GitHub. It was an original tought.
I did not care enough to open their issue tracker. I still don't. It is ironic, not a bit, a lot. That statement was a bit sarcastic.
>Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.
This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!
And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.
The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.
Emergence of machine intelligence* and its control by Capital was not foreseen by Karl Marx, and the intervening period between heat death of Capitalist system and the Workers Utopia has been indefinitely extended.
There's an awful lot of very smart people who have studied economics for the majority of their lives who disagree with this. There are also alternatives to capitalism that don't entirely involve govt control.
Do you have experience with self-hosting Guitea? I am on to fence about going with Gitea because of the recent fork of the project (Forgejo). Seems that many contributors are now contributing mainly to Forgejo.
The reason for the fork was that Gitea was going for-profit and the folks that forked to Forgejo felt they went about that transition in a way that eroded trust. Here's their explanation: https://blog.codeberg.org/codeberg-launches-forgejo.html
it is functioning like Open Source should, there was a disagreement in how the project was run so it gets forked
This used to be more common place when projects were run by people not companies. I wish the practice would come back we need more forks in Free Software
It feels bad to "waste" the work that could have otherwise gone into highly-paid billable hours, or at least charity work on other repos that get more use.
I self host Gitea. Very reliable. Painless setup. I wish it had some sort of CI like github actions or bitbucket pipelines, but otherwise totally happy wit it.
Just self host the community edition of gitlab. It's miles better than gitea. It's got ci pipelines, it's got a pretty robust issue tracker, it's got wiki pages, it'll integrate with ldap/ad for authentication, it's got a package repository for self hosting libraries, it's got releases, it's got a service desk to make email -> ticket pipelines, etc.
GitLab CE is far too heavy and requires minimum 4GB to run. Contains lots of componnents including PostgreSQL and Redis and various components and startup takes long. With Gitea I can run it with just 1GB or a raspberry pi. It includes wiki, package repositories and releases as well. ldap, service desk - these are enterprise features that I don't need.
Gitlab is a crazy setup full of services, with elaborate interdependence, absurd hardware requirements, iffy performance, and all the lack of confidence on security that comes from this (and it only ever running if you use their docker images and don't touch anything).
I've got Gitea running on a $5 Vultr instance and it's great.
Upgrades have been painless. Doesn't tax the server.
Was using Gitea when that fork happened and didn't see a reason to migrate. Looked very much like poor communication on the behalf of Gitea causing a misunderstanding.
I self host Gitea both on my home NAS and a DO droplet. I set up repos sync between the instances, it works flawlessly. I've moved the most of my projects off Github/Gitlab and overall I'm very happy with it.
I self-host gitea as a github backup just in case. It's pretty easy and well documented (it's a single executable and you can use sqlite for the database).
> The only solution is to self-host. Gitea is good.
I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/
It's a "run your own github" application. akin to Github Enterprise Server or Gitlab CE/EE, except unlike Github Enterprise Server and Gitlab EE, it's open source.
As far as I am aware, they do not offer a hosting service. I believe that statement was meant to convey that the Gitea software, once installed is a git host similar to the others. I think they were trying to differentiate between a typical remote git repo and all the web components that come with Gitea. They do offer paid support, but that's still for self hosting.
Playing devil‘s advocate, all kinds of linting, vetting or security scanning with any degree of smartness beyond a regex would probably fall into my definition of non-human eyes too.
Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.
These tools are enabled by the owner of the repository. And I think consent has precedence over license terms. But this seems like an escape hatch for using your code as a training material.
If it doesn't explicitly say: we do not use private repos data to train our AI models, I wouldn't even consider assuming any other way than that's exactly what they are doing. They know this is a question everyone wants to know the answer to. Why would they leave any ambiguity? Let me help you answer that: because that's exactly what they are doing.
Everything referenced in this webpage refers only to public data, or private data where a user has enabled the dependency graph
>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph
So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.
Well they could silently do anything. No need to toggle the user-visible checkbox
It's like giving a party some of your possessions to store, and signing a form saying you don't consent to them snooping through them, then worrying "what if they change and forge the contract later?". They wouldn't need to. Either you trust them to keep their word or you don't trust them with your possessions at all
And they can also repurpose your data for ML, or otherwise illicitly pilfer your data, without you turning a random dependency visualization on. There seems to be an idea here that turning it off will change something, like how there are people adamant that Windows 10 has secret tracking features, yet believe they can defeat them using regedit. I'm afraid this is the technical equivalent of "swiper no swiping!"
And to be clear I'm not saying Github or Windows is doing this, I'm just saying either you trust them to keep their word or you don't
I'd like them to limit the scope of the data use to dependency graph ONLY and for it to not be stored. I swear this goes without saying.. but these blanket self granted permissions on the part of GitHub are a serious breach of trust. They don't need this to provide the dependency graph service.
My understanding of the way dependabot works is that they read your repository's manifest files to provide alerts about vulnerable dependencies. Then after you patch the repository they re-read it to identify the new package version you've moved to, and if there were any build errors introduced from the update. That info is anonymised & fed into an aggregated dataset, creating suggestions for other users like "95% of repositories impacted by this CVE upgraded from v1.0.4 to 1.0.5". Dependabot doesn't read through your code (unless your code is in `.manifest` or `.package` files).
That might still be too much for you, everyone's line is at a different place. Dependabot works better the more data it has, so personally I'm fine with that level of detail being returned to an aggregated service, for that very specific purpose.
There are other private providers of dependency scanning out there. I'm fairly sure they all work the same way Github does, just with smaller datasets.
I am not disagreeing with you but by your own admission it does not read your code; only the project's manifest. THEN why on Earth do they ask for so much more when all they need to do is bump some dependency version in the manifest.
As for this aggregate stats, why not limit the scope to access to the build status instead? They don't need anything more to provide this service...
As far as I can see from all of their documentation, they exclusively open a predefined list of common manifest files. They explicitly mention that they won't be able to do dependabot analysis if your deps are listed in a `setup.py` file for example. You'd need readonly access to the entire repository to actually find those manifest files though, as most of the time they're not going to be in the top-level directory.
For companies & people not willing or able to surrender that level of access, it's still possible to use the dependabot API, dissociated from any github repository (for example if you have an Atlassian git repository). You just send it a list of dependencies and an enum indicating its current build status, and they'll reply with a list of suggestions.
I doubt there's any legal way to say "we'll only read your manifests". Programmers may know what that means (and still debate it, honestly), but I'm skeptical the law has the ability to distinguish between the designated utilities of specific files.
If their AI uses my code as the basis to generate similar code, how is that not a derivative work?
If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.
How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?
It's their problem to build lineage in their system.
Breach of licence and copyright laws is still illegal even if its very well concealed.
Lots of their customers' code is requiring attribution, or specifies acceptable uses (notably, what kind of license the product built with your code is allowed to have). Beyond simple copyright to which anyone can pretend, they're also concealing this.
> It's their problem to build lineage in their system
From a practical standpoint, I don’t see how that is true.
In order to establish (civil) liability in the United States, someone would have to file suit, survive the inevitable standing challenges, and demonstrate on the preponderance of the evidence that they had been harmed by this conduct.
If you can’t prove your code was reproduced by this model, I don’t see how you can successfully do that.
It’s like saying you own an image after JPEG compressing it. There are going to be lawsuits, but MS is so big they might just consider settling them part of the cost of making sure they are on the cutting edge of generative AI.
Besides, the "except" part qualifying the "human eyes" is substantial.
GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.
> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.
Concerning national security letters:
> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.
Now everyone knows why codeberg was invented. This type of thing is entirely foreseeable. Good luck not having all your IP be copy pasteable to anyone using a LLM.
sigh I moved all my projects (from Bitbucket and Github) to Gitlab.
Now it appears Gitlab is feeding its data to Google [1].
I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.
GitHub/MSFT/OpenAI’s stance is that training AI models is fair use and doesn’t require a license. It doesn’t matter what kind of license you slap on the code. If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use, not a new software license.
> If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use
True, assuming that GitHub/MSFT/OpenAI's stance is correct. Just because they have that stance doesn't mean it is[1].
In the end, these gray areas will be decided in court.
[1] I personally think it probably is, which is why I removed all of my stuff from the open web and these sorts of services -- I see no other way to prevent this use of my data.
The policy says human eyes will never see the contents of your private repositories. Suppose, as you say, the repos are one day fed into an AI. How does that policy then constrain the AI later?
AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.
But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.
And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"
GitHub owns the Copilot model; it is based on OpenAI's Codex but they own and operate it themselves. So technically feeding it into their AI would be OK by those terms. That's why Copilot still exists even after Codex was killed off by OpenAI.
We’re still in the nascent stages of code models. There’s ample opportunity to transition to private models before the technology reaches maturity. As for the destiny of current private GitHub models, they may contribute to training models (through what’s referred to as “aggregate data”), but the potential backlash could be considerable, particularly when considering the bans on using ChatGPT that various firms have implemented.
Indeed. I wonder how many people don't see this. Or don't want to see this. Nobody should assume that a private Github repo is really private.
My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.
Seems like one has to assume corporate repos have them. If they don't already, they could decide to at any time. If that's a problem, self-hosting puts the power back into the hands of the users (and can be done very cheaply - I'm paying ~$5/month for a server that's doing a bunch of stuff including very basic personal git hosting)
How do you think they provide all the services they provide? Secrets scanning, contextual source information, everything. I'm not sure how anyone could have gone this long and been surprised by this.
Next people will be surprised that GitHub has the right to show your source code on a website....
there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.
The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?
>however this agreement does permit them to do so.
How so? Assuming you read the whole agreement and not just take a single line out of context?
"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."
> GitHub considers the contents of private repositories to be confidential to you. GitHub will protect the contents of private repositories from unauthorized use, access, or disclosure in the same manner that we would use to protect our own confidential information of a similar nature and in no event with less than a reasonable degree of care.
When talking about confidentiality, GitHub is specific to mention unauthorised access, implication is here that GitHub has the authority to authorise themselves.
The next section talks about how they will grant access to third-parties, but it's clear that as a first party they exclude themselves from the moniker of "unauthorised".
I agree however taking random pieces of a long agreement in isolation is not how you read terms of services. The section you highlight is NOT, for example, the one that talks about what legal rights you give Github regarding your code. Construing it as such and the extrapolating from that is taking a piece of text way out of context.
The section cited is the most prominent section in the terms of service regarding private repositories[0]
Everything else in that section relates to personnel (which is not the concern of the commenters here), what happens if you upload copyrighted material and codes of conduct.
Also, taking segments of the TOS is exactly what will happen in a court of law, unless there is another section that invalidates this section, this fragment will be argued as consent.
Additionally; the context of that section is very clear otherwise. Sorry.
The way ToS will be argued is: there are many references to processing information and nothing that explicitly prohibits or cannot be reasonably inferred to inhibit automatic processing, in fact there are many places where they make reference to "automatic processing for services"; You can take that to mean that it's just their SAST scanners, they can take it to mean training of an AI model, there's no distinction legally here.
You mean like the earlier section that explicitly says "License Grant to Us" which the court would somehow amazingly ignore reading according to you? Because clearly the exact legal rights you give Github to use your work is not relevant to a discussion on how Github is legally allowed to use your work.
I can tell you're a bit defensive about this, so let me be clear as possible: talk to a lawyer.
I am not a lawyer myself, though I consult with Swiss/German lawyers quite often regarding licensing because it is a large part of my role.
The fragment I mentioned is the beginning of section "E”; this document is using letters as the top level section separator, so using a different letter from the prior is taken as an intentional separation for the purposes of a legal document, what you are referring to is from section "D"; IE they are in a different context from each other entirely.
Even then, you do not need to have a license to automatically process source code, as per the text of this ToS.
There are more than a few points where they talk about what they have the rights to read, and there is only one point (which is legally required to be there, fwiw) which states they can't release your source code as if it's theirs.
If you read the copilot terms of service, they are not granting license to use the suggestions.
A very pessimistic take is that they are quite legally protected from using source code derived from these repositories, even if there wasn't the muddy discussion about training data and derivation.
It's a little different, Microsoft doesn't care about consumers as they stopped being relevant to Microsoft's financials in the Ballmer era. Even consumer Office spending makes less than LinkedIn for companies to put into perspective how little they care.
However, enterprise/commercial? The same group you're trying to sandwich between a 365 subscription and azure/github? These are the people in which when they do have evidence that they are being negatively affected, will cause a massive dent in Microsoft's bottom line.
In that case they needn't put this in their ToS. It's strange how people will distrust a company with one hand, then claim their word as reliable evidence (and think they can be stopped by unticking a checkbox) with the other hand. If they wish to deceive you they would do it without telling on themselves
Any sane business decisionmaker would want legal cover for something like this in case it draws a lawsuit. Regardless of whether they themselves think it's okay, doing it entirely on the sly makes it more likely an opposing lawyer can convince a judge/jury that it's not okay. If they have an agreement that the client "accepted" that strengthens their position quite a lot.
Do people want Github to display files in your private repos on github.com? Do you want syntax highlighting and intellisense? Do you want them tokenized for search? Should they be available for cloning when you need it? Should they run security scans and identify vulnerabilities? Should you be able to see git blame, diffs, commits, other metadata? Set up build pipelines? Modify contents via an API?
Tell me – how can Github do all of this without programmatically accessing the contents of your private repos from their servers?
As I see it people are basically taking a single line of an agreement that explicitly references other agreements out of context:
"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."
No, I don't want these features. I want reliable remote git storage, which lets me clone a copy of my source code anywhere so I can do these things locally, on my own machine. People forget that this was the original design and intent of Git as a decentralized protocol. As an industry I think we have become too reliant on Github as some kind of way-far-beyond-git magic tool. This might be fine until it isn't--like when Microsoft decides it's time to start training Copilot on your private code.
Use another place to store your git projects then, it's not like Github is the only one. Github is a PRODUCT built around git, not just a place to store your git repositories. I'm pretty sure you will be able to find other services which only exist to host git projects and do nothing else.
You say this as if you think most people have a choice...I've pushed hard in the past for my employer to move off Github but have been met with nothing but blind resistance.
Being concerned for your own personal reasons about Microsoft scanning your code of one thing. How does it make sense for your employer to switch off GitHub?
But that's not your code, that's your employer's code. You might have written it but they fully own it. This doesn't seem like something to be concerned about on their behalf.
Also, for what it's worth, I was extremely surprised and concerned when my private repos appeared in code search results. I had to triple-check that it wasn't showing up when logged out or logged in to a dummy account without access to the repo. This behavior is definitely not intuitive and if there's a way to disable search indexing on private repos I'd do so.
Or - how any provider even self hosted, other than GitHub can provide any of that without automated processing of the contents of private repositories?
2. Build small- to mid-size businesses, that can prioritize healthy growth – like focusing on users, being kind to employees, building brand equity over time.
The pump-and-dump grift cycle of enshittification has got to stop.
As soon as finance capital gets involved, the core objective of business shifts to primarily providing higher returns for the investors. Small to medium businesses can have diversity but once they "go public" they all are almost alike.
A few companies in tech have plenty of goodwill, even if not all are privately owned: RedHat (currently burning theirs), Mozilla, 37signals, Khan Academy, Mapbox, Komoot, DuckDuckGo, Brave, etc.
Not all these have a viable business model, but they’re not obsessed with growth and brokering user data.
Yeah, as soon as MBAs and financial investors get involved (which, for "inefficient" small companies, is just a matter of time) the companies change to be exploitative cancer on society like the rest.
There's whole industry looking for companies that are too nice to their employees and customers to buy, exploit, grab profits and leave a husk behind.
I suppose that's it's impossible to do a hostile takeover of a private company. All it takes is to stay financially healthy and reject offers to sell!
Much easier said than done. Sometimes the owners just get tired or bored, and want out. They sell the company to the highest bidder, within reason; often there's no crowd clamoring to buy a small, non-growing company.
Receiving files, copying them within a file system, and transmitting the contents? Clearly not looking.
Computing a hash for each file? Doesn’t seem like looking either.
Parsing the contents, or tokenizing them for a ML model? That’s looking IMO.
I think the defining factor is the size and nature of the derived state update on the server. Copying and transmitting retains only metadata about the file. Computing a hash produces only a handful of bytes of state about the file, and there’s nothing that can be inferred about the file’s contents from the hash. But if you start storing something like word frequencies, that’s not neutral anymore, it’s a “look” into the contents.
Yes, that’s clearly GitHub looking at the contents. At that point you just have to trust they don’t do anything else with the internal representation they’ve built.
When Gmail was introduced two decades ago, there was some short-lived furore about the fact that Google admitted they were reading your emails for ad targeting purposes. To a lot of people like myself, this seemed a bridge too far when you’re used to servers that don’t “look” at your private stuff. But it certainly didn’t hurt Gmail’s adoption. (I still don’t use Gmail because of this original lingering distaste.)
They don't "read" emails for ad targeting anymore (according to them). But they certainly do for search and spam filtering, which are features that users expect.
They do code highlighting, search indexing and symbol resolution. I don't know if those can be disabled but they need them to look at the code, update state (indexes and such, maybe caching on highlighting).
Postman doesn’t provide tools to diff your postcards, search them, scan their content for security issues, or automatically take actions based on the contents of your postcards.
A postcard is not a letter; its contents are written in the open, right near to the postal address, and are fully visible to the postman. I imagine that's why the OP chose specifically a postcard for his example.
It really sucks when you want to keep a secret but keep getting asked the right questions. It would have been much easier for Amazon if people just took the first statement at face value.
Not related to private repositories; but GitHub also keeps logs about commits in the "Activity Graph" if these commits do not exist anymore.
For example, clearing the full reflog of a repository and then either force pushing over or deleting the remote repository (then pushing it again under a different name) will keep the activity attribution, in the second case it might even attribute the commits to the new remote (404ing when clicking the "Created x commits in y repositories").
Good for team workflows and pull requests, but annoying for public repositories with intentionally rewritten history.
Reflog is data local to a git node, not part of the repository. In particular reflogs are never transmitted between remotes. so “clearing the reflog” (presumably of your local machine) has nothing to do with removing any data at GitHub. You need to contact support for that which the docs advise you to do if you’ve force pushed over something you want to be fully deleted.
No problem. One easy way to gain confidence that data has actually been wiped instead of just obscured is to use one of the gitrevisions [1] specifiers that sources data from the reflog. For example,
master@{yesterday}
resolves to whatever the reflog thinks the master branch was pointing to 24 hours ago. There are a bunch of places in the GitHub API and UI that look like they only accept a branch or commit OID but actually accept a revision specifier. So e.g
Apart from the misconception, I worded my initial post in an inprecise way.
I did not know wether or not the reflog relates to what's pushed to the remote. I just had the expectation that an empty git log as well as git reflog output would mean that no objects exist except for what represents the tree of the current HEAD.
Still haven't tried the revision specifiers on a repository that I expected to be wiped, but will do that. I need to read more docs about git plumbing and the .git folder.
> I just had the expectation that an empty git log as well as git reflog output would mean that no objects exist except for what represents the tree of the current HEAD.
It’s a pretty common expectation but the nature of git’s object model means that generally it’s not true for any git provider. You should assume that data pushed to any git remote will live forever unless you both force push away all references to the data and GC all remotes (which for GitHub means contacting support IIRC)
Yes, but that is the standard Terms of Service which applies to public repositories as well. The terms appear the same as the public repositories; there doesn't appear to be a carve out.
I posted about this a week or so ago. Someone could come steal all of our "secret source" and still have absolutely no advantage over us. If the value of your business is entirely in the source code, you need to seriously re-investigate what exactly it is that you are doing. AAA gaming and Windows codebases have been leaked. In some cases partially, in others comprehensively. At no point do I recall Valve or Microsoft seeing adverse competition or other meaningful impact from this event. Legal teams were certainly up in arms, but in terms that investors would care about nothing really happens.
Turns out, you need more than source code to run a business. You need customers, markets & relationships. You need trust. You need a sales team. That's the real thing you should be worried about being stolen. Your goddamn sales team. Obviously the developers would be worried about the code base (and they should be) but at scale it's not the end of the world if some code gets out. Losing your best sales person & all their prospects would be much more catastrophic for your business.
Yeah, it totally makes sense to compare gigantic companies to all other ones. Let's also extrapolate what should happen to a subcompact in a collision by looking at what happens to a tank in those same circumstances.
Coca Cola’s empire surely isn’t dependent on Coke’s recipe, yet they’re still keeping it a secret. Same goes for the vast majority of for-profit organizations who do not want their core product’s recipe exposed, for a variety of reasons.
> If the value of your business is entirely in the source code, you need to seriously re-investigate what exactly it is that you are doing.
Most software companies' "value" (liquidatable assets) depend almost exclusively on their IP, as embodied in their patent portfolio and their source-code. Their people may be more valuable, but people can't be traded or used as collateral.
Nothing is impossible to hack - I don't think GitHub themselves would agree that it's impossible to hack. But not using Github, and having your own git hosting also doesn't make it impossible to hack.
If there is proof that they are training their commercial AI from private professional source code it could feed a COLOSSAL class action lawsuit. I understand that 1-3 percent of the American GDP goes to lawyers who work on contingency for these kinds of class actions. These guys eat what they kill, and they feed on corporations. That doesn't mean that Microsoft won't try it.
This is the kind of privacy that I (along with a large majority of unsophisticated tech users) care about most. I don't care if machines access my data and serve me better ads, I do care whether customer data can easily be viewed by a HR person / angry ex / private investigator / controlling ultra-religious parents / crooked cop of a third-world-country with a bullshit "this person wants to commit suicide" excuse.
I have actually talked with non-tech-savvy people about this, most of them have heard of tech companies tracking their users, and most of them genuinely believe that there are tech company employees out there going through their private lives. This is why I really dislike the use of the word "tracking" in a tech privacy context.
I've heard plenty of non-technical users complain about ads that they think we're shown based on tracking, most often the complaint is concerns of the use of an idle phone's mic.
I've also talked with non-tech people about how the tracking and data mining world actually works, I've never had someone say "so what?". It may simply be that non-tech people aren't as aware of what the tracking does based on ignorance of the problem, not a genuine acceptance of the practice itself.
More like a happy marriage where your partner drugs your bedtime cocoa and then grates you feet with a cheese-grater for some reason that you really don't want to know.
That's hyperbole. I think it's closer to knowing that your spouse may leave you and expose your nudes to the world. It's still not a comforting thought, but there's no risk to your (or your project's) existence.
You’ve taken this out of context. This was specifically referring to “scanning” or vulnerability scanning. I doubt it will stand the test of law if someone uses them for their “private” data being used to train AI.
GitHub: We are owned by Microsoft. And Microsoft is known to secretly forward much or all of the user data and communications to the NSA (National Security Agency), as part of the PRISM program
That's one of the nicest features of SourceHut, IMHO. There's a real person to talk with, and some hard principles behind the service. Don't on the same page? Nobody forces you to use it.
The related update, in short, tells this:,
You have a legit blockchain project, tell us, and we'll allow you. Crypto? We most probably don't allow it [0].
The wording is this (verbatim):
We will exercise discretion when applying this rule. If you believe that your use-case for cryptocurrency or blockchain is not plagued by these social problems, you may ask for permission to host it on SourceHut, or appeal its removal, by contacting support.
I'd rather have my project removed by a person who I can talk with, rather than a corporate chat bot pretending to be human, and my source code is not harvested without my consent for a service which is sold at me $10/mo.
SourceHut planned to ban Go Module mirror due to bandwidth they consume, then they have talked with Russ Cox, and Go team agreed to implement caching into Go (-reuse flag, IOW) [0].
I think this is a major win, because assuming everyone have their unlimited 10gpbs symmetric dark fibers at their home, office, car and mobile phone is wrong on so many levels.
Git doesn't need a hub. But a hub is convenient enough to use compared to self-hosting.
Just like many of us here know how to freely download pirated torrent mp3s safely, though pay for Spotify/Apple Music as its more convenient (also legal, but doubt many care about that especially here).
Right. And all the GPL, MIT, and CC Sharealike-with-attribution material I've been stashing on Github is going into the vast code anonymizer called Copilot.
The good news for me the open-source developer: it looks like the closed-source code people write with the help of Copilot and all the open-source code in its learning set will also feed into that learning set.
I guess code just wants to be free -- free as in speech, expensive as in yachts and fine wine for VCs and private equity pirates. Plus ça change, plus ça reste la même chose. Sigh.
...but they will see some kind of mash-up, remixed, synthesized version of it that might be 99.9% similar so practically this means absolutely nothing.
That explains their move to make private repos free a few years ago.
What frustrates me is that it feels like they didn't announce their intentions of what they'd do with your code. They use the same kind of verbiage ib t&c as they would to tell they collect some usage telemetry and might occasionally share it with partners. And that is definitely not the same.
I am thinking about self-hosting my private repositories for some time now. I also want a simple web interface for convenience. I don't need a "Github clone" like Gogs, Gitea..
I feel like git services like github would be well-suited to join the fediverse. It's clear to a lot of us that our code is becoming too centralized, but it's also hard to deny the usefulness of being able to search across so much code in a single place.
I'd also expect that users of github would have an easier time than the average person figuring that change out.
It's an interesting thought but GitHub specifically has absolutely no reason to want this; doubly so after the Microsoft acquisition. Sourcehut, Gitea.io, and Codeberg would be the places to start.
How usable is Pijul and its Best forge these days? Maybe something low-stakes but (for me) highly used like my dotfiles would be a good fit for learning my way around over there.
I'm getting a 500 Internal Server Error on their pricing page, which I assume may have something to do with this post on HN. Anyone know what it costs to have private repos there?
As the author, I do. Pijul is usable, we use it for itself, no problem.
The best forge is still a bit experimental, since we've recently rewritten it using tech that isn't as ready as it promises (looking at you, Cloudflare Workers).
A private repo is 5€/month, plus 0.01€/Gb.day for storage above 100Mb.
And we don't do the kind of stuff discussed here, we're just offering storage and collaboration tools, and intend to keep it this way (as well as open sourcing everything).
It's starting to look like github is slowly arriving to its own funeral. Lots of pointless site changes that hide or change location of important functionality. New twitter like feed that tries to recommend you all sort of crap you don't care about and now this. It's shame as github actions is not half bad.
I was kind of making a joke that Git is much more important than any individual forge software though, because Git is a decentralized protocol. Any forge software you feel like running will work.
I personally don't use gitweb, but I do use cgit. It has all the features I care about (making my git repos available to view on the web). If you find that you want different features, you should use a forge that provides those features.
This really shouldn't be the step it takes for you to think "maybe I shouldn't give Microsoft all the source code that I think is important enough to keep private".
One wonders if there's a shadow field dedicated to coaxing these datasets which "will never show up, don't worry" out of LLMs.
I accidentally got a few hundred fake(?) names out of the codex models because I asked it to LARP as a professor (I was hoping for better code). Probably almost all hallucinations, but I wouldn't be surprised if some had some truth to them.
We'd still need a way to know when someone violates it. When you see a copy of your licensed software being used you can at least try to enforce that, but how do you know it was fed through an AI for training?
At best that would be like a Don't Track Me flag. Some will follow it, though ironically it is also useful for fingerprinting and can signal you may be doing something worth tracking.
"Open source" doesn't really help much without the "Free and" prefix, and unfortunately there's no way to reconcile software freedom and banning specific use cases. Dunno if you remember the "can we remove Nazis from ToR" controversy awhile back but it's the same kind of thing. It's either free or it isn't.
does anyone have specific and unique enough code in their pvt repos to see if co pilot and such things will actually recommend their specific private code or methods (i guess this is the thing ppl worry about?)?
"Rest assured that your code will only ever be viewed by a vast and unfathomable superintelligence whose motives are as yet unclear. Please be cognizant of any code or comments which may inadvertently invoke the shoggoth's displeasure." /s
Concerned about this privacy policy, every user should not upload their private data to GitHub, even though this repository is set to private. GitHub should only host public repository or unrelated data.
No chance GitHub is anywhere near using private repo source code to train the public GitHub Copilot.
On the other hand, getting a personalized Copilot within your org is something entirely different and probably will be out in the not too distant future.
They could feed private repos into copilot. It might make copilot better. It's certainly useful to automate extraction of IP with plausible deniability. If someone notices, they can claim "oh that must be in a public dataset as well".
> By default, all public repositories are included in the GitHub Archive Program, a partnership between GitHub and organizations such as Software Heritage Foundation and Internet Archive to ensure the long-term preservation of the world's open source software
You no longer own your projects, they have the final word.. scary times
Glad I moved away from Microsoft products many years ago, sad for my country
>
> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.
That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.