Has GitHub been down more since its acquisition by Microsoft?

oliwarner · on June 29, 2020

GitHub is a lot more enticing now it has free private repos and the launch of Actions. I suspect they're a lot busier than they were before Microsoft. More features also naturally means more status tickets.

robertlagrant · on June 29, 2020

Has nimbleindustries.io been down more since Github's acquisition by Microsoft?

colinbartlett · on June 29, 2020

Sorry... we probably should have put a CDN in front of our humble Linode box earlier. Here's a cache link:

https://web.archive.org/web/20200629103627/https://nimbleind...

Edit: Cloudflare seems to be working now. Wow, that was really easy to setup.

mattbillenstein · on June 29, 2020

How many req/s before you fell over?

dom96 · on June 29, 2020

I often wonder what kind of stack websites that fall over under HN traffic have.

From my perspective HN doesn't actually bring that much traffic, so I find it curious which software is so fragile to fall over when getting HN front page traffic.

derwiki · on June 29, 2020

Before you have that need it seems like premature optimization. If I had a blog that was getting a handful of views/day, I probably wouldn't bother with CDN/scalability until it became an issue. Or use something like Medium (which I do).

mattbillenstein · on June 29, 2020

Yeah, I've always thought blogs that didn't render to static pages were just sorta silly - for just a few hits a day, it's very simple, and if you get a flood of traffic, even a small vps could keep up.

robertlagrant · on June 29, 2020

:-) Cloudflare is amazing, yes. Everyone should use it by default, I say.

aerojoe23 · on June 29, 2020

As an individual or small business I agree with you. Their free tier is amazing? And they seem to be awesome people with only the best intentions.

I hope for more diversity, for many reasons. One of which is that they are a know man-in-the-middle, that can see the plan text of what appears to be secure communication to the user. It operates in the US, it can be compelled to, and to do it silently, into using their position to spy on people.

That said we use them at work and I use them personally. I'm not doing anything "risky" at all. But there are many cases in history where suddenly the government changes and now you are. And apparently black people are always doing something "risky" in the US.

stepbeek · on June 29, 2020

Man, I really want to say we don't because of the implications of too centralised an internet, but there's no way it isn't a transitive dependency.

syberspace · on June 29, 2020

What the hell are you building that there has to be Cloudflare in the dependency tree at some point?

sgtfrankieboy · on June 29, 2020

Well for our use case their caching infrastructure handles 63% of all our traffic and 70% of our bandwidth.

We'd have to basically quadruple our expenses without CF. Not to mention their available protections.

Lammy · on June 29, 2020

Any javascript CDN?

classified · on June 29, 2020

I'm sure there are more Node.js modules for that than you can count.

sixhobbits · on June 29, 2020

Until they go down too..

I vote for more diversity rather.

mekster · on June 30, 2020

Stop embracing a single company. Everyone goes bad when they figured they're the monopoly.

risyachka · on June 29, 2020

Its hard for me to trust your products if you didn't even have a Cloudflare in front.

EDIT: I don't imply the product is bad, sorry

foota · on June 29, 2020

I imagine this is just for their blog, which is probably an afterthought.

colinbartlett · on June 29, 2020

That's correct, this was only our blog. We've only just recently begun an effort to publish more frequent and more interesting content. Although OP's point is fair.

wolco · on June 29, 2020

I would trust it less.

risyachka · on June 29, 2020

The point is attention to details. Their blog went down in a matter of minutes.

Can you be sure their product will not?

asattarmd · on June 29, 2020

That’s a statement

thescribbblr · on June 29, 2020

bryanrasmussen · on June 29, 2020

I think it's pretty fair to say we've all been pretty down since GitHub's acquisition by Microsoft.

waheoo · on June 29, 2020

I'm on sourcehut. Whats a microsoft?

squarefoot · on June 29, 2020

A few years from now it will be the brand of the most widely used Linux distribution.

tobylane · on June 29, 2020

Do we know it’s not there already? Can’t quickly find any installs count.

TiredOfLife · on June 29, 2020

Both wsl1 and 2 are optional installs. So the number is not that big. Curl.exe on the other hand is installed by default.

Shared404 · on July 1, 2020

an Eager, Easy, and Exciting company to do business with. :)

edit: MS in general, not GitHub specifically.

ToolsDevler · on June 29, 2020

[flagged]

bryanrasmussen · on June 29, 2020

it must be a very niche market, people who prefer small toilet paper.

dilandau · on June 29, 2020

This is the comment I wish I had made.

This is just clickbait. The "GitHub down" schadenfreude has really gotten tired.

srean · on June 29, 2020

Taking the numbers at their face value this is a good exercise in some text book statistical test of hypothesis.

Incidents: Before 89. After 126. What is the chance of this happening if the 'rate' of occurrence has not changed ?

Assuming an unknown but constant Poisson rate, we get the probability of observing what has been observed to be 0.00225.

A fortuitous thing about this test is that one does not need to know what that unknown constant rate is.

solidasparagus · on June 29, 2020

> Assuming a Poisson rate

That feels like a mighty big assumption. Probably big enough that trying to calculate the probability is more misleading than enlightening.

srean · on June 29, 2020

As I mentioned, my comment is meant as an exercise. If we were to take the numbers more seriously, due diligence is necessary. That said, if we assume that one incident does not affect the other, then the Poisson nature falls out as a natural consequence of that independence and the assumption of a constant rate (our Null hypothesis).

As long as the incidents are spaced out enough, that the possibility of one incident affecting the other is low, Poisson can be surprisingly realistic. Quite remarkable, given how simple it is. All in all not that bad an assumption for a back of the envelope calculation in a meeting.

In practice, however, given more time, I would be looking at the statistics of inter-incident times more carefully. If those look sufficiently different from Exponentially distributed, a non-Poisson renewal process might be more appropriate than a Poisson process.

solidasparagus · on June 29, 2020

But why even start with an assumption that is so likely to be wrong? We know that incidents are frequently correlated. We know that scale and complexity add fragility. We know that GitHub has gotten bigger and more complex. The chance of the probability distribution holding constant over 5 years of major growth is basically zero.

srean · on June 29, 2020

We are talking about different things. One is about attributing causes to an increase in failure rate, the other is about verifying whether there is any material increase in the rate at all. My comment addresses the latter as a back of the envelope calculation.

Strictly speaking, when looked at through a fine toothed comb, yes the assumptions are very likely wrong. All models are wrong [0], but some of them are useful.

The question is can we get some useful conclusions from such a simple model. In my experience I have been surprised by how often low failure rates are captured well by Poisson processes. Yes the assumptions could be wrong, but are they very likely to lead to wrong conclusions ? Empirical experience and math says otherwise.

There are sound reasons for why this happens. If you are interested, you can pick that up from Feller. These [1] [2] links might also help.

Given the data that we have, its a plenty good first cut, but that's what it is -- a first cut. With more data one can do a more refined analysis.

[0] https://en.wikipedia.org/wiki/All_models_are_wrong

[1] https://en.wikipedia.org/wiki/Poisson_point_process#Approxim...

[2] https://en.wikipedia.org/wiki/Poisson_point_process#Converge...

Eugeleo · on June 29, 2020

I, for one, enjoyed your little foray into the field of statistics. So much so, that I'd love to learn some of this stuff as well! I'm a bioinformatics student, so I have a rigorous math background (analysis, linear algebra), but for some reason, our course is quite light on statistics and probability.

What resource would you recommend to get an intuitive grasp of statistics?

To give you an idea about what kind of resource (book) I'm looking for: I'm currently reading Elements of Statistical Learning and I enjoy that it has all the mathematical rigour I need to really understand why all of it works, but also that it's heavy on commentary and pictures, which helps me to understand the math quicker. Counterexamples: Baby Rudin one one side of the spectrum, The Hundred-Page Machine Learning Book on the other.

srean · on June 29, 2020

Hi Eugelo am glad you liked it. Feller is not a stats book neither is it a machine learning book but you might like it. It is full of cute and relatable exercises.

Books like ESL are front end books, they cover the shiny and the methods. Feller is more of the backend.

Eugeleo · on June 30, 2020

Thanks for the recommendation! When you say "Feller", do you mean this book? [1]

I'm already looking forward to it.

[1]: https://www.amazon.com/Introduction-Probability-Theory-Appli...

srean · on June 30, 2020

That indeed is the book, just to set expectations its an old style book. I don't recall any diagrams. You may find an online copy in the usual places to check it out before you buy.

Eugeleo · on June 30, 2020

No worries, I'm carefully weighting each purchase, especially when it's one third of my monthly budget (greetings from Europe!).

It seems to me that it's considered a classic, although I've never heard of it (probably due to my ignorance). Do you have any more such nice recommendations up your sleeve? Don't limit yourself to probability, I'm looking for some reading for the summer :-D

srean · on July 1, 2020

You can try "All of Statistics" its a concise but useful and modern take on statistics. For a different approach I quite like Allen Downey's ".. for the Hacker" series. For stochastic processes Parzen's "Stochastic Processes" is a nice and approachable read. If you want to go down the rabbit hole I would recommend graycat's comment stream here on HN. Time to time he posts about books to read.

You said you are familiar with linear algebra. The logical next stop could be Hilbert spaces. It looks at functions as vectors and analyzes their properties using linear algebraic tools that work even in infinite dimensional spaces. This sees quite a heavy use in traditional machine learning. Before diving into Hilbert spaces proper, you could revisit linear algebra in Halmos' "Vector Spaces" there he pretends to teach you linear algebra but actually teaches you about Hilbert spaces -- in other words, teaches you linear algebra but without the restriction of finite dimensionality.

And you are right, books are so damn expensive. India is somewhat better in the sense that we have 'low price editions' same content but printed in lower quality paper, not the prettiest things, but very student friendly. Note these are legit printings, not pirated copies.

solidasparagus · on June 29, 2020

I see where you're coming from and I agree with you overall, in particular how this is a first cut approximation.

But I still want to nitpick the details a bit. If you want to determine whether there was a change in the failure rate, you need to use rate statistics - failures per service-hour. Your analysis is only using the numerator while we know the denominator (the number of services in GitHub that can go out) has increased over time - GitHub Actions and Packages are relatively new.

srean · on June 29, 2020

Totally agree with your second paragraph. As I said it was more of a text bookish exercise. If the volume of traffic in the two periods were available, it would have been possible to do the kind of analysis you indicate.

visarga · on June 29, 2020

Maybe Github has grown significantly since then. More complexity, more down time.

dutchmartin · on June 29, 2020

Well, Github introduced way more new features since Microsoft took over.

bocytron · on June 29, 2020

Also, according to the article itself, "[it] could be all a part of coordinated effort to be more transparent about their service status".

jtdev · on June 29, 2020

Maybe Microsoft is less averse to complexity - which would still make uptime degradation land squarely on Microsoft’s shoulders. Microsoft seems to be an organization that actually introduces too much complexity to their products imo.

jcrubino · on June 29, 2020

Microsoft has every right to move fast and break things on their watch.

cj · on June 29, 2020

This was exactly my thought.

More or less downtime (as reported by a status page) is probably affected more by changes to policies re: how / when incidents are posted publicly between "Github" and "Microsoft Github".

Subjectively speaking (using Github daily) I haven't noticed a difference. In general, Github has never been extremely reliable even pre-Microsoft.

wilde · on June 29, 2020

What was the YoY change in incidents prior to the acquisition? I’d expect incidents to grow as a function of employee count at least.

battery423 · on June 29, 2020

It could also mean that they are more active and creating more chaos due to more changes.

shameless and useless advertising.

danpalmer · on June 29, 2020

On my team our downtime is very closely correlated with how much we're working and changing the product. I'd be surprised if GitHub is much different.

waheoo · on June 29, 2020

Or in other words, facebook is more stable on the holidays.

virtue3 · on June 29, 2020

exactly. Before M$ there wasn't much product movement at Github :/

danpalmer · on June 29, 2020

I think they got out of the rut before Microsoft's involvement, but I suspect MS has allowed them to accelerate a lot of it with more investment and a more defined direction.

I've heard that for a number of years the Product org at GitHub essentially considered the product "finished", and that it did not need more features. Things like the rise of GitLab and the "Dear GitHub" letter, plus I believe a change of leadership, helped that completely turn around. Obviously that took time to yield benefits, so I think they've been in a much better place for quite a few years now.

Lammy · on June 29, 2020

> and a more defined direction

Except for Atom :(

bnegreve · on June 29, 2020

That's pretty much their conclusion

> [..] GitHub has been down more since the acquisition by Microsoft. But that could be all a part of coordinated effort to be more transparent about their service status, an effort that should be applauded.

robertlagrant · on June 29, 2020

That's not a conclusion, it's a theory.

jbarberu · on June 29, 2020

Or hypothesis.

blueflow · on June 29, 2020

Personally, I'd prefer some availability over sparkly features.

echelon · on June 29, 2020

Not counting the redesign, there haven't been "sparkly" features. Sparkly features are the kinds of things the old management used to let engineers have free reign to implement.

Github pre-Microsoft was rudderless. They were more prone to implementing silly 3d model diff tools and things instead of supporting enterprise features or building powerful CI/CD tools and automation.

New Github is on the right track.

blueflow · on June 29, 2020

- Personal Status

- Round Avatars (all catgirl ears are now cut off)

- Collapsing of some similar messages on the Dashboard into one (i missed some things because of this)

- Not just one, but several redesigns (im old)

- The whole "marketplace" functionality (imitation of an app store)

- The "explore" and "trending" functionality (i see this like a Facebook feed)

Not sure about others, but i used GitHub as a git host and issue/MR tracker. All the other stuff is just distraction in my eyes.

echelon · on June 29, 2020

> (a bunch of comments about the UX redesign)

I agree. I'll add one: README files with tables now require horizontal scrolling, and that's utterly disappointing.

> The "explore" and "trending" functionality

I do work in ML and computer vision, and this is how I discover all of the new models and code people are using. It's awesome.

> Not sure about others, but i used GitHub as a git host and issue/MR tracker. All the other stuff is just distraction in my eyes.

You're missing out! Deploying code is so easy once you're using CI/CD. Github actions are powerful.

blueflow · on June 29, 2020

Many of my projects have a fixed scope and thus are 'done' at some point. If you develop like this, CI/CD is rarely worth the effort. Its better suited for projects that suffer from scope creep or regular upstream breakages.

ies7 · on June 29, 2020

But before that ‘done’ point or maybe for a new project, Github Action is a good feature introduced by Microsoft.

robertlagrant · on June 29, 2020

Yeah it's only needed for projects that might change. Anything you can guarantee will never change, is fine.

tnolet · on June 29, 2020

changes = downtime. Has been since the advent of computing

h91wka · on June 29, 2020

Not if you have a competent devops team and a deployment pipeline.

battery423 · on June 29, 2020

Nonethless we are building and operating complex systems.

Testing something 100% has diminishing returns.

Therefore -> you will never be able to prevent all issues. As you are not aware how many changes are happening, you don't have a value which could indicate the healthiness of it.

I personally think that when i'm getting older and better in my job, i still make errors, the amount and severity is going down though.

marcinzm · on June 29, 2020

I'm sure they have both but complex shit breaks when you change it often.

marcosdumay · on June 29, 2020

Your devops team assures software quality?

But well, no QA effort is complete enough to really assure the quality of a large system. Thus, the GP holds true.

scribu · on June 29, 2020

It's easy to see from the uptime history [1] that there have been many more incidents in April-June than in January-April.

Don't know if that has anything to do with the Microsoft acquisition, but it is concerning.

[1] https://www.githubstatus.com/uptime?page=1

esperent · on June 29, 2020

They're still averaging 99.81% up-time over the last three months. This means they have been down for a total of 4 hours out of 90 days. I don't think it's all that concerning.

jillesvangurp · on June 29, 2020

The gold standard here is 5 9s, not 2.

They have paying customers that are being inconvenienced by this. We lost time over this multiple times in the last few weeks.

So, not good. IMHO they are having some major issues with their release process that they need to address. Standards have slipped there; they used to be better at this.

robertlagrant · on June 29, 2020

The numbers on Github Enterprise might be different. They probably roll out changes to the free/$4 per month Github first.

z9e · on June 29, 2020

My company has GH enterprise and these numbers look accurate based on my experience. We’ve noticed a lot of downtime that has impacted us.

shaklee3 · on June 29, 2020

Isn't enterprise the self-hosted solution?

ilikehurdles · on June 29, 2020

I believe Enterprise has a self-hosted option, and you can have Enterprise without the self-hosting.

trm42 · on June 29, 2020

This kind of depends whether you have something urgent work to handle through Github. Been waiting for over an hour so I could continue my work but Github PR-functionality hasn't shown latest changes because of the currently happening outage so I cannot. Really, really annoying and wastes partially my planned work for the day.

(of course I'm working on other stuff in the meantime but splices focus unnecessarily)

rattray · on June 29, 2020

Wow, that's stark. I count 6 in April-June (2 major outages) and only 1 in January-March (0 major outages.

I wonder if COVID has affected this somehow. Anecdotally I've heard of at least one other ~peer company with a large rise in incidents/outages since April.

Strange since both companies previously had a strong culture of remote work (maybe 1/4 to 1/3 of eng was remote) going into the pandemic, so I'd be quite surprised if all-WFH contributed somehow...

rb12345 · on June 29, 2020

If GitHub's been migrating to Azure infrastructure on the backend, it's possible that everyone else moving to all-WfH helped to cause it. Teams use massively increased during lockdown for obvious reasons and that's definitely Azure-based. If Teams managed to overload Azure capacity, I could see that having knock-on effects on GitHub.

robertlagrant · on June 29, 2020

> it's possible that everyone else moving to all-WfH helped to cause it

No, MS prioritising Teams land grabbery is what caused it.

rattray · on June 29, 2020

The other company does not use Azure.

ksec · on June 29, 2020

I remember there were a few incident with AWS taking down the whole internet, Verizon BGP incident, and a few others that is completely not the fault of Github. I wonder if those were included or affected.

And generally speaking Github has been 100x more active post Microsoft acquisition. So I am not surprised at the downtime.

And I will gladly trade another few hours if not more downtime if they could just rollback the side panel design.

samwhiteUK · on June 29, 2020

This is a thinly veiled advert, and tells you literally nothing concrete, apart from Github's status page is updated more. The article even admits this!

vadasambar · on June 29, 2020

Mirror on wayback machine: https://web.archive.org/web/20200629103627/https://nimbleind...

Denvercoder9 · on June 29, 2020

Ironically this link is down for me.

RMPR · on June 29, 2020

Was about to say the same, but it's working now

colinbartlett · on June 29, 2020

Author here, just put a CDN in front of the blog so maybe that will help.

shawabawa3 · on June 29, 2020

CDN's not working i'm afraid (SSL error, 403 forbidden if you ignore it)

robertlagrant · on June 29, 2020

Not yet :)

Fiveplus · on June 29, 2020

Ironically, Firefox tells me to stay away from your website. Which, I will.

update: the website is back and well now from my end.

colinbartlett · on June 29, 2020

I can't reproduce, would you mind emailing me a screenshot of this? (I'm the author) colin at any of the domains in my bio.

Fiveplus · on June 29, 2020

I would have happily complied. On a welcome note, however, the backend changes made after the initial hickups brought the website back on its feet.

sbmthakur · on June 29, 2020

I am also using Firefox. It blocked Google analytics on the site but did not get a warning.

z9e · on June 29, 2020

Has anyone noticed their search has gotten worse? I sometimes will search for some code that I know exists in a repo, but it gives no results, and when searching a little later it shows them as I expect.

We have Github enterprise, and Slack notifications when Github has any issues. Nearly every week there’s a problem, sometimes it’s resolved in a few minutes, other times it goes for an hour or so. I’ve pondered the question if there has been more outages since the MS acquisition and in my experience that’s a hard yes.

m12k · on June 29, 2020

What I don't get is how they can be down for hours on end. If I make a deploy that turns out to be buggy, I'll roll my site back to a stable previous version within minutes. Sure, that's not always possible if I've made incompatible database schema changes, but in my experience those are very, very rare (i.e. I almost always add db columns, and only rarely delete or rename columns - and when I do, I do so after those columns haven't been in use for a while).

floucky · on June 29, 2020

I'm sure these down times are more about infrastructure problems than just database updates and I'm also sure that this infrastructure is a little more complexe than your site.

m12k · on June 29, 2020

I'm sure their deployment process is way more complex than mine. I'm also sure that unlike me, they have engineers dedicated to managing the complexity. GitHub is a company that has blogged about how easy they have made deployment and how they deploy many times a day. I don't believe there are any compelling reasons why that ease of deployment shouldn't also extend to re-deploying previous versions of their service, so they don't need to leave a bad version up for hours.

scaryclam · on June 29, 2020

I think you're missing the point a bit. Not all changes are simple code deploys. They take time to diagnose, mitigate and fix. This is especially true when you have a lot of services that talk to each other, working on infrastructure that can support so many services and users.

cranekam · on June 29, 2020

It’s often not as easy as deploy -> immediate problem -> rollback. Problems can take a while to diagnose, or may cause some kind of poisoning that needs to be fixed (eg rebuild a lost or corrupt cache), or be in some part of the system that nobody knew was related (eg maybe someone deployed code that talks to a hitherto-unqueried accounting system and that worked fine at 4pm on Thursday but come 9am Monday it melts).

My point is that in big complex systems sometimes there is not a straight line between cause end effect. Sometimes there’s just effect and you need to work out the cause.

rbanffy · on June 29, 2020

> If I make a deploy that turns out to be buggy

Unless your deploy reconfigures some networking component that makes a large part of your network inaccessible. Then you need to fix the network issue before you can rollback to a previous version. That may require someone driving up to a datacenter and logging into a racked server.

And then you may need to restore data if the network misconfiguration caused data to be corrupted somehow (I admit this is getting a bit worst-possible-case-scenario) and, if the data got crossed - that one client could see data from another - you'll need to prevent access until you are sure everything is where it should be.

Finally, depending on your scale, the deploy of a new version can take a long time by itself. People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

m12k · on June 29, 2020

> People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

Even better - then they shouldn't even need to make another deploy, just flip the feature flag back off. And if you need to make network changes, then test those out behind a load balancer in parallel to the existing topology, so you can start routing more traffic to the new setup, but can stop doing so if any problems arise. I'm not saying any of this is trivial, but the point is, best practices exist to start deploying pretty much any kind of change in a way that can be undone in minutes or even seconds. When you have access to the resources and talent that Github has, then there's zero acceptable reasons why your site would ever be down or degraded for hours on end - zero.

evanelias · on June 29, 2020

There are countless ways for infrastructure to break on its own, without being tied to a specific deploy or feature flag. A few common examples in the db tier alone:

Have you ever encountered a write rate that exceeds your db replica's ability to keep up with async replication? There's nothing to "roll back" in this case, and it takes time to determine whether the increase in write rate is from legitimate usage growth vs some recent feature (possibly deployed hours/days ago) writing more than expected during peak periods vs DDOS/bot activity.

Have you worked on multi-region infrastructure, where traffic is actively served from multiple geographic regions, with fully automated failover during regional outages? This is impossible to fully automate every possible situation -- even Google and Facebook have outages sometimes! Even just as a first step, it's hard to figure out conclusively which situations should be automated vs which ones need to alert humans.

Have you ever implemented read-after-write consistency for multi-region infrastructure, where multiple async DB replicas, caches, and backend file stores are not automatically in sync, but need to appear in sync to users making writes from non-master regions? The network latency between regions is sufficient to make this a complicated problem even when things are stable, let alone when there's other sources of replication lag to consider. There's no "out of the box" solution for this; every company needs to handle it in a way specific to their infrastructure and product.

Have you ever implemented a realistic dev/test environment for a massive infrastructure involving dozens to hundreds of services, and many different data stores, some of which are sharded? Again, no "out of the box" solution exists. You need to do something custom, and there will be plenty of cases where it doesn't accurately mirror production.

Or for a non-technical one: have you ever worked for a medium-to-large size company whose exit was via acquisition, rather than IPO? In my experience this always results in a major increase in attrition of the acquired company's top engineers. With an IPO, early folks are more incentivized to stay on; there's a better feeling of ownership, and the efforts of good talent can directly impact the stock price. But when it's an acquisition by some corporate behemoth, the opposite dynamic is at play: there's very little that the acquired company can do to impact the parent company's stock price, leading to a feeling of helplessness. Couple that with different policies and values mindset (say, a contract with a government agency that puts children in cages) and you can guess what happens.

loldot_ · on June 29, 2020

Which commit would you roll back to if the integer id of your comments table overflows? There are many bugs that are more complex than a buggy commit.

samcrawford · on June 29, 2020

Independent of whether it's been down more frequently since the acquisition, I find Github's status page reporting to be lacking. It's also quite generous with definitions of outages and downtime.

For example, on June 22nd they had an issue whereby half of their nameservers were responding with an empty answer for queries for github.com. A very nice explanation is here: https://news.ycombinator.com/item?id=23605409. So for roughly half of users (including myself) this would have manifested itself as a complete outage. It also lasted for a good couple of hours. Yet on their status page it's listed as a 46 minute degradation only.

So relying on their status page reporting to draw conclusions about availability (as Statusgator seems to) will mean that an overly optimistic picture of availability is presented.

solidninja · on June 29, 2020

I really wish there was an option to decouple the core git functionality from the 'fancy' stuff (PRs, actions, issues) etc. When your CI pipeline runs directly from your git repository, not being able to commit/merge means you cannot release to production.

For binary artifacts, you can get around dependencies on a single provider by mirroring, but because git is mutable, you can't just mirror the repo and allow changes there, because you need the same permissions, ssh keys etc. for the repositories and because changes will need to be synced back to the source repository. You might as well not use Github in the first place and self-host (with all the problems that entails).

As far as I know, none of the major git providers offer this - I've experienced outages with GitLab, Bitbucket and GitHub that all affected the production environment (luckily, it's never been critical so far).

hashhar · on June 29, 2020

That is exactly how GitHub is architectured. There have been a lot of time I've been to able to push/pull but GitHub pages was down or their web backend was down.

Bitbucket is the worst in this regard.

majkinetor · on June 29, 2020

Definitelly by my perosnal experience.

Part of it might be due to covid, but it started happening before that.

On the positive side, github had tone of changes since then, all seem to be good ones, so its understandable that it has more problems now as well.

jna_sh · on June 29, 2020

A number of the staff who left GitHub in protest of the ICE contract were senior SRE. I wonder if the data says anything about the dates of their departures...

tehjoker · on June 29, 2020

Looking at the chart, there were more warns but fewer downtimes so I'd say no. There was one giant downtime that happened, but I don't know why it happened so it's hard to say if there was a cause relevant to the acquisition. The other downtimes that happened seemed to be slightly longer in duration than before.

bovermyer · on June 29, 2020

I'm more inclined to believe this is due to the increased reporting and attention to the status rather than an actual increase in downtime.

However, I have no hard data to suggest that, only my experience as someone who's had to manage and maintain reports like this.

darkwater · on June 29, 2020

Anyone from GH willing to share - anonymously maybe - some insights? I really find the lack of post-mortems from GitHub outages a bit... weird given GH audience. I think all we dev could learn from a properly written public GH post-mortem.

xtracto · on June 29, 2020

It´s Microsoft, when has Microsoft shared a downtime postmortem?

oltdaniel · on June 29, 2020

Personally I think it is connected to adding more features to GitHub. Microsoft is rolling out more and more small but also big changes, and as everyone knows, this always can have a few minutes downtime for each update.

samth · on June 29, 2020

This analysis has a significant comparability problem because of added features. The core git-repository service is almost never down, but GitHub Actions has had a lot of reliability problems (you can see this clearly on the status page). But GitHub Actions is new, so it skews the recent availability problems up. Unfortunately, as the post notes, the detailed status information is _also_ new, so it's not really possible to analyze the data accurately.

lwheelock · on June 29, 2020

Irony aside, there is some degree of credibility lost when offering a critical opinion on availability and your cloud hosted site throws a 502

Yuioup · on June 29, 2020

I have zero evidence but I would not be surprised if they are busy absorbing Azure Devops into GitHub and that's causing some hiccups.

dang · on June 29, 2020

This was prompted by the latest "Github is down" (now transformed to "Github was down" by the arrow of time):

https://news.ycombinator.com/item?id=23675864

maypeacepreva1l · on June 29, 2020

Github is evolving pretty rapidly than I thought it would. I like their new changes. Deep underneath it being with Microsoft just reminds me of oracle/google debacle. Definitely discourages big projects to be hosted there.

nitinreddy88 · on June 29, 2020

There's already one news on front page https://news.ycombinator.com/item?id=23675864

Can we merge them? Mods

st_goliath · on June 29, 2020

> There's already one news on front page

And the comment that is currently at/near the top and mostly discussed happens to link to exactly this site.

heinrich5991 · on June 29, 2020

Mail hn@ycombinator.com to talk to the mods (see footer).

julius_set · on June 29, 2020

I mean makes sense. They are probably migrating their infrastructure over to Microsoft’s and such.

Disclaimer: I don’t work at MS so no clue, but have been part of acquisitions

jtolds · on June 29, 2020

I'm sure MS is super uncomfortable with GitHub's AWS bill. I've just assumed they're fast-tracking a migration to Azure

IceWreck · on June 29, 2020

Yeah really. I feel the same thing. Can't comment anything right now and Ive been trying for the past hour. It dies every three weeks.

pbnjay · on June 29, 2020

How has traffic/attacks changed since acquisition? Microsoft name surely gives them an even larger target on their back.

bizzleDawg · on June 29, 2020

The charts and diagrams in that post are really nice, how did you make them?

quickthrower2 · on June 29, 2020

If it has, what is p?

mikece · on June 29, 2020

Any chance that prior to acquisition there was more focus on stability —- because it was assumed there would be a buyer —- than upgrading to the latest version of Rails and adding features?

oknoorap · on June 29, 2020

maybe this down caused by cloud migration from AWS to Azure?

DrBazza · on June 29, 2020

tl;dr yes - and an exception to Betteridge's law!!! https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...

w0m · on June 29, 2020

tl;dr maybe - is that an exception?

alt_f4 · on June 29, 2020

Anecdotally, yes. I almost never saw internal error pages before, now I've been seeing them every week for a few weeks. Some people need to be fired.

zxcvbn4038 · on June 29, 2020

What’s really wonderful is when Github goes down I get a flurry of people broadcasting on Slack channels that they can’t work and all their builds are failing, and I’m wondering What they want me to do about it. Should I get out and push? Call Bill Gates? Tweet Trump? When Github doesn’t work that’s your excuse to go get coffee, grab a pint, browse Hacker News, etc. - people should be happy with GitHub goes down. (Or self-host gitea and work from that if your so determined to keep on)

brylie · on June 29, 2020

They are porting things over to ASP.NET on Azure.

quickthrower2 · on June 29, 2020

If the implication is that asp.net or Azure are unreliable then think again

cdbattags · on June 29, 2020

Such an underrated comment. I applaud.

rvz · on June 29, 2020

Yes.

Last time it was down was just 6 days ago: [0]

I think its time to look for alternatives or frankly in the long term, follow what some open-source orgs are doing and self-host instead.

[0] https://news.ycombinator.com/item?id=23604944

tmpfs · on June 29, 2020

Both the link and github comments are down for me too; I am unsure whether it has been down more since the acquisition but this is unacceptable to me. I have been trying to post a comment to an issue i made for the last few hours and every time it says "You can't comment at this time" which I find very misleading. It should say "Sorry, our system has a problem, please try again later". The implication of the current message is that my permissions are wrong or something else has happened.

This has bumped self-hosting all my repositories much higher up my list.