Hacker News new | past | comments | ask | show | jobs | submit login
GitHub: October 21 Incident Report (blog.github.com)
275 points by pietroalbini on Oct 22, 2018 | hide | past | favorite | 110 comments



I saw some comments on reddit which highlighted a pretty serious problem - many orgs rely on github as a fully integrated CD platform, with everything from code hosting, to running CI hooks, to pushing to staging or prod.

It seems very unwise to have essentially your whole deployment process manager in the hands of an entity which you don't only have no control over, but which has regularly been targeted in attacks by nation-state-level actors because of their role as a code hosting platform.

EDIT: GH hasn't been targeted regularly, but it has been so historically, so this is a plausible thing which might happen again.


It's a tradeoff, just like anything else.

We use Github heavily at my work, and at past jobs as well. But at the same time it's not the ONLY way we can work. If github has an outage, our CD will shut down, but we can still run all tests locally, and we can still push to the server directly, and we can still push code between ourselves manually and review it.

Sure, it's a hiccup in our day when it goes down, but it's not like the entire company grinds to a halt. And the alternative of maintaining servers and systems to replicate all of that would take significantly more time, potentially cost more, and is probably more likely to go down.


> and is probably more likely to go down.

People often say this but I wonder if it's true. Your replication wouldn't be handling the same load as GitHub itself, so perhaps the issues that GitHub experiences would rarely happen to a self hosted version.


On premises git hosting with Gittea or GitLab with mirrors to GitHub seems like a smart idea going forward.


But then you have to host it and maintain it. It's a slippery slope. How many 3rd party services do you in-house with hosted OS software. Pretty soon you're spending a huge chunk of your time doing ops work. And, where do you host it? On AWS, which can also go down, or on hardware hosted at your office. With on premise hosting, now you're in the hardware game too.


Well, your code is your companies' IP. It might be prudent to have that IP on-prem(eg Gitlab). Every business has varying requirements, but I've yet to be employed at one using 3rd party hosting for source control without an on-site mirror. Disclaimer: my employment has been at megacorps so far, smaller shops may not do this.


with git, you have a mirror on every single developer machine (kinda... depending on what you consider "all" of the code).

We also have a full copy on our CI server (which is hosted on another service, so still not "in-house"), and we have a copy of at least the master branch (and all it's history) on our production boxes (which also gets pulled into our whole backup system there).

In a disaster recovery scenario, that's more than enough for me.

Sure, if github blinked out of existence we would probably be at a fraction of our normal productivity for a while until we fully recover everything and find new workflows, but the risk vs reward there is well within the margins of what I'd consider acceptable for a company like us.


> In a disaster recovery scenario

That's not really enough. You need more than "all the code is around here somewhere" you need a plan with specific steps that have been tested.


Honestly for git, it probably is enough. We're talking about someone deleting your github account or github closing overnight with no warning (it's been acquired by Microsoft, so it's much more likely that the company you're working for will shutter). It should take ~30 minutes to push your repo to another provider including looking up instructions. Unlike database backups, there is rarely any data loss and any data loss should be recoverable. It's also not client-facing, but is a temporary problem similar to wifi going down at your office. An inconvenience and hassle, yes. Long-term problem, no.

Furthermore, the problem with these disaster scenarios are that there are much more dangerous problems than your account being deleted. Someone with admin access, could insert a back door or sell your source code to someone else. That's honestly scarier.


That's probably the case for valley-style startups where the whole team can fit in a room and they all hack on the same handful of repos, but most "enterprise" customers will have hundreds of repos with not necessarily anybody hacking on most of them at any given moment. It's very good policy for such organizations to have a plan in place to "break glass in case Github is down" with local mirroring of all data and a tested process for doing deploys without Github.


We have exactly that... in our DR plan, there is a section for how to cope with the 3rd party source control provider being unavailable/compromised/etc. Update DNS for the equivalent of "upstream-git.foo.com" to an internal address, and continue business as usual.

It's like you said, smaller shops probably think it's over-planning and overkill, but we do indeed have 100's of projects that are "mission critical", that might not have been touched in 1+ years.


>but I've yet to be employed at one using 3rd party hosting for source control without an on-site mirror.

For me it's actually the opposite, except for one company that had only an internal SCM and no cloud stuff.


I'm not sure how hard it is to maintain some self hosted apps. It's just set up and forget. It doesn't randomly change interface or license unless you upgrade it as well.

I've used AWS for 10 years but for last 5 years, I've never seen it just go down randomly and even if it did, you have the room to redeploy with some clicks (assuming you have your data backed up regularly) instead of waiting for the uncontrollable.

You seem to take the ops work a bit overly.


You know, the time required to host Gittea (or Gogs) is basically nil (you'll have to set your environment up for Github too). And by maintaining it, you mean making sure it's online and taking backups? Because I don't see how you can save any time on those by using Github.

Where you host it is of less relevance, because you can simply take your backup and server script and run them at any different provider any time you want.


I am currently leading the on premise hosting of a Gitlab instance and I can say this with ease: I have been spending 1 day of each week for ops work. Let it be helping people, database adjustments, admin stuff, hardware checks, etc.


>It seems very unwise to have essentially your whole deployment process manager in the hands of an entity which you don't only have no control over,

Welcome to the SAAS world. Where you offload institutional knowledge and hiring to a 3rd party and roll the dice.


This is why we built and manage our own git servers internally. We run emergency services infrastructure so we can not afford to be disconnected from our version control. Ever.

Same reasons we don't run our application and supporting infrastructure in a cloud provider really - apart from the complexity of our particular network which has a heavy dependency on mobile and satellite links - we cannot guarantee our customer of our availability if we don't control the compute, storage, SAN fabric, and as much of the network infrastructure as is possible.


Why would state actor level attacker target for being a code hosing platform? At best they can temporarily suspend operations.


Yeah, we're still facing issues with, erm, Github issues.

Also, while they haven't updated this blog post for a while, their status page has been very up-to-date and informative: https://status.github.com/messages


Yeah but their recovery estimates were completely off. Pulls, hooks, checks and issues still are unusable


Releasss and gh pages are still completely broken too.


> very up-to-date and informative

Is that satire? It said 2-hour ETA 5 hours ago and the last update was over two hours ago.


I see an update 7 minutes ago.

>12:56 British Summer Time

>The majority of restore processes have completed. We anticipate all data stores will be fully consistent within the next hour.


Every hour they promise something will be done until the next hour. I haven’t been able to work all day so far.


Consider this a lesson on serverlessness. (We have been similarly afflicted, but their git backend seems to be up; and even further, we have rediscovered what we stopped paying attention to: that with Git, a centralized repo is just a convenience, not a requirement.)


Yes, I agree, but here it’s not up to me to choose the infrastructure.

I wonder what the total cost of this ordeal must be. Surely in the tens of millions.


The latest message appears to be:

"We are validating the consistency of information across all data stores. Webhooks and Pages builds remain paused."

Which is a bit scary. Half my requests appear to hit some storage which is still many hours behind. They should be seeing that...


On the plus side, this disastrous calamity by Github really made me try out Gitlab and in the process, I will now set-up a second remote on my repo's:

https://stackoverflow.com/questions/11690709/can-a-project-h...

  Quoted:
  "Try adding a remote called "github" instead:

   $ git remote add github    
   https://github.com/Company_Name/repository_name.git

   # push master to github
   $ git push github master

   # Push my-branch to github and set it to track github/my-branch
   $ git push -u github my-branch

   # Make some existing branch track github instead of origin
   $ git branch --set-upstream other-branch github/other-branch"

Actually, I don't know why I pay for Github private repo's.. I might as well set-up two origins, one at Gitlab and one at Bitbucket, for all my privates. Then keep Github as a public front-facing portal.


Every company at some point has some kind of incidents. It just happens. GitHub is most of the time rock solid and doesn’t deserve to be judged based on one major incident like this. On the contrary they need our support. Bitbucket and GitLab both have had problems of the same magnitude.


But a multiple origin solution seems the most sensical. We have failover for everything infra and services.. it seems we now need failover for cloud-based code. It just seems logical, especially if all 3 of the big cloud providers have big incidents.


I think it's more about people not looking for alternatives while things are going smooth. Now they will, even though the grass my not be greener.


Why does GitHub need my support ?


> this disastrous calamity by Github really made me try out Gitlab and in the process

Every company faces problems like this. GitLab infamously lost their entire production database [0] -- and I think we can agree that's more serious. Knee-jerk reactions to incidents will leave you without any "trusted" services, because mistakes happen to everybody at some point. BitBucket has certainly had its own fair share of downtime.

[0]: https://about.gitlab.com/2017/02/10/postmortem-of-database-o...


> GitLab infamously lost their entire production database

What?? The link you posted says

> Database data such as projects, issues, snippets, etc. created between January 31st 17:20 UTC and 23:30 UTC has been lost. Git repositories and Wikis were not removed as they are stored separately.

> It's hard to estimate how much data has been lost exactly, but we estimate we have lost at least 5000 projects, 5000 comments, and roughly 700 users. This only affected users of GitLab.com, self-hosted instances or GitHost instances were not affected.

How is that "their entire production database"? You make it sound so much worse than it was. While it was a horrible incident, they did not lose their whole production database.


Technically they did, for a brief period (24 hours??) at least, they just restored it from backup.

That was the final last ditch backup too where something like 5 out of 6 of the planned backups weren't actually working and nobody realised.

So you're right, they didn't lose it, but they came pretty damn close!


Picture of the Google Docs post on what they tried (which appears to no longer be available online) can be found:

https://femto.pw/6bsm.png

As a reference to the above poster.


> Actually, I don't know why I pay for Github private repo's.. I might as well set-up two origins, one at Gitlab and one at Bitbucket, for all my privates. Then keep Github as a public front-facing portal.

You do realize that every cloned tree can be a git "repo", regardless of the machine it's on, right? GH and GL surround the repo with some other things (bug reports etc) to rope you in but if you are cloning from one to the other you already aren't migrating that stuff as well, so it's not really clear what additional value that provides.

Is there something I'm not seeing?


Which brings into question... what is so special about the "front-end" features that Git provides? Why not use another third-party service that integrates into your Github/Gitlab/Bitbucket repo, storing the meta data of all the pull requests, code reviews, discussions, etc, inside a "meta repo" itself? I see this incident as an important note in taking steps to decouple important "meta" features from the cloud-host itself. It will be better to keep the cloud hosts as a bare bones skeleton that we can swap in/out at will (Github, Gitlab, etc), and then use another layer on top to provide all the nice fancy features.


Based on what I read on their engineering blog they will eventually have a similar outage.


Ugh I detest the term " abundance of caution " as used in that message. Weasel words designed to stop you thinking about the problem as who wouldn't want them to be overly cautious


It's not that bad. X definitely modifies Y but it may also modify Z. You are informing the audience of this fact and it provides them with 'warm fuzzies'.


> During this time, information displayed on GitHub.com is likely to appear out of date;

if the new data is not presented, users will typically retry which may result in duplicated new content.


Happened to me already. Created a new branch but it kept me informing me the non-existence, retried a few times and shortly after was created with a bunch of duplicates.


Same here, but last night I was trying to post a comment to another repo, refreshed and tried again like 5 times before realizing there was an outage and it wasn't on my end. This morning that issue has 5 dupe comments (so I look like an idiot), and deleting them does nothing; they just reappear when I refresh the issue.


As long as the same problem doesn't occur again, I have no hard feelings against them.


Wow, is this still going on? I noticed this yesterday and figured it would be fixed in a few hours, but [GitHub's status page][1] is still showing red. Is this the longest outage GitHub has had?

[1]: https://status.github.com/messages


This incident report tells very little. I hope they release what actually happened and how it affected their services. And how they are going to avoid it in the future. Almost any issue can be publicly described as: stuff broke because of network.


"Partition failure" is a relatively specific term of art that implies it was a partial database failure, so "stuff broke because we lost part of a database".

It may not be in GitHub's best interest to describe much more than that because which database, which tables, and how they were partitioned can be secret performance sauce.

Postgres partitioning: https://www.postgresql.org/docs/9.1/static/ddl-partitioning.... Redis partitioning: https://redis.io/topics/partitioning


"Network partition" is unrelated to partial database failure or database partitioning. It means that database servers got disconnected from each other. Normally this shouldn't be a problem, but as far as I know they don't use proper distributed algorithms and so it's possible that disconnected servers each became masters and were serving requests independently, which is why they speak about inconsistencies. I believe this problem is commonly known as split brain [1].

[1] https://en.wikipedia.org/wiki/Split-brain_(computing)


It's not "unrelated", it's an intertwined phenomenon, especially in practical consequences. In Redis, as one primary example I already mentioned, all database partitioning is network partitioning (and vice versa). The logical object is still a "database" even if the physical object suffering problems is the "network".

For similar reasons that you suppose they "don't use proper distributed algorithms" (which seems an overly harsh way to put it), I simply presumed the logical entity in question is still a "database", even in the case of a network partition problem. I don't think it was either Postgres or Redis in this case, but they are simple enough examples to illustrate the overall problem.

Either way, my case still seems to stand that we aren't likely to get more information about the specifics of the partition failure because likely both the logical (which "database") and physical layer specifics (which datacenters/"network") are things that are internal to GitHub that they may not be able to publicly describe much more than what they have.


It doesn’t really matter what they used, I just pointed out that their incident report is BS. I can blame virtually any production problem on network, but the ugly truth is - it is not networks fault in most of cases, but fault of engineers who developed the system with unreasonable assumptions.


I don't understand why code hosting platforms like GitHub, GitLab or BitBucket have so many issues regularly. Is there anything special about it?


> Is there anything special about it?

Yup. You notice when they're down. Whereas when your self-hosted git server goes offline for a couple of hours, nobody else notices.


I am not talking about git hosting per se, but compared to other SaaS companies it seems GitHub/GitLab and Bitbucket are down much more often.


Facebook had an outage recently, Google has had outages. Hulu has an outage every week. You can check https://downdetector.com/ for recent history of many major services.

It's a fair bet that developers are more likely to notice and complain about software services being down on the internet.


I don't know about GitLab and Bitbucket, but GitHub's uptime is relatively phenomenal, given its load and feature set. I've never been in a programming environment where it's been the weakest link among hosted services.


[citation needed]


Lots of users. Also, a lot of package managers rely on hosting platforms like GitHub to host their packages, so if Github breaks, a lot of CI processes around the world break.


Which is kind of ridiculous. If your CI breaks because GitHub is down, it means it's not caching dependencies locally, but keeps re-downloading them every time it runs (e.g. every commit), generating tons of waste and unnecessary load on the hosting service.

Or, to put it bluntly, if your CI works like this, it's contributing to climate change.


I think you are wrong. Our CI infra caches all dependencies, but it depends on github for new internal code pushes (kind of the point). If github is not sending events, CI doesnt kick.

Youre ignoring half of the problem. If you dont receive events from github because they are down, your CI doesnt work either -- dependency caching doesnt matter at that point.


That's assuming you're putting your own organization's code on GitHub. Then of course if GitHub doesn't work, neither does the CI that's hooked to it. This is a separate topic.


Or it's something using something like cargo (rust's package manager) - and checking if any dependencies have a newer version by checking the package registry (which is stored on github for no apparent reason).


That’s an excellent point. How can you tell if the CI system uses caching, other than waiting for a github outage to notice something broke?


Many CI systems provide a build log, do they not? Look for a “git fetch” in it instead of a “git clone”.


Check in documentation, or when in doubt, run a job, redirect GitHub to 127.0.0.1 in /etc/hosts on the CI server, and run that job again.


Define regularly.

I can recall only 2 incidents this year. I think that's not too bad considering the level of traffic they have to contend with.


Considering this is going on for several hours now, their SLA is down to at least 99.9 and going down by the hour. Their business SLA is 99.95% (though I have no idea what it refers to), so it's quite possible that they are in breach.

Still not bad, but 2 incidents like this a year, is usually considered unacceptable for infrastructure service providers.


I wonder how they define their SLA though. If only some of the features are down, does it impact the SLA?


> Define regularly

More often than what is considered a standard 99.99% uptime SLA? (about an hour per year.)

You seem to be making it out like a couple of days a year of lost [1] developer productivity is no big deal.

That said, these things happen and you should probably check your workflows if you're all that blocked by GitHub being down.


> More often than what is considered a standard 99.99% uptime SLA?

GitHub SLA is 99.95% and apparently exclusive to Business Cloud customers[1].

[1] https://github.com/pricing


I wasn't saying it applied. Just what's expected from a large international company relied on by so many.


It seems kind of 5 hours per year.


What's special about it is that you have terabytes of data to keep available, and at scale git does not play well with technologies like NFS or cloud object storage like S3, so each major provider either pays a lot of money to specialized vendors or has homegrown solutions to deal with the problem.

So on top of your usual problems with keeping a cloud service up and running, you also have that git IO problem to contend with, and to rub salt in the wound, that wrinkle also makes it difficult to fully adopt many "standard" cloud architectures or vendors (such as AWS) which work for non-IO-heavy applications: you always have this major part of your infrastructure that has this special requirement holding you back at least partially (and that can hurt your availability for related services which are not even IO-heavy).

(That said, it's hard to guess whether that was the problem, a contributing factor, or unrelated entirely based on the details provided here.)

source: I work at Atlassian (though not on the Bitbucket team) and occasionally chat to current and former Bitbucket devs on this topic.


I suspect it's just that you hear about them all


It's just noticeable when it happens during someone's work day.


No, that's what a typical single organization with a typical RDBMS-centric non-resilient architecture can provide. But of course it's very hard for organizations at certain sizes to do it better, it's something they have to start with.


It’s hard.

But really, do they have that many problems?


It would be interested to know what was the cause of the network partition failure.


Something that struck me yesterday as this started, is that Github isn’t really just a dvcs hosting solution, Github is a social network.

It’s easy to change the remote origin of a git repository. It’s not hard to migrate a project to Gitlab etc, and duplicate all the technical features of Github.

What’s really hard is replacing the social graph. If you’re a large project with a lot of contributors, onboarding everyone is not going to be easy. About as easy as convincing all your friends to stop using Facebook.


> Further, this incident only impacted website metadata stored in our MySQL databases, such as issues and pull requests.

Basically the only things I was going to do on Github today :)


Why bother telling people five incorrect estimates for service recovery?

https://twitter.com/sneakdotberlin/status/105434537971655884...

After one or two wrong guesses it just shows that you’re making it up.


I hope they are going to provide a better RCA than that!


Sure, but I don't think they have the time since the outage is still ongoing.


Don't try creating a new repo, it will create part of the meta data but not allow you to see or use the repo and the repo name gets taken.


I created a new repository at about 9 UTC, and it started to work some of the time at 12 UTC — I've pushed the code from my computer, and pulled it from elsewhere.

However, it's still intermittently failing as of 14 UTC, so I haven't managed a Maven release build yet.


I created a new repository 30 minutes ago.

As I initialized it with a readme and an ignore file, I had to clone it. Cloning only succeeded by doing `watch git clone` and waiting a few minutes. But it worked.


GitHub also silently published a private repo of mine this week. I checked audit logs for both the owner user and the org and it didn’t show a permissions change anywhere.

Netlify has caused me to stop using GitHub Pages and between the clownshoes outage reports and the security issue I am now a GitLab user.

This is GitHub’s jump the shark episode. :(


Gitlab has had its own issues.


This was a head-on blow for MS?



As I rearrange today's todo list, I recall wishing that I'd used the Microsoft purchase event to encourage folks to increase their familiarity with gitlab. So I now note:

https://gitlab.com/explore/projects is a live feed of project activity, suitable for code surfing. It also sorts by stars and trending.


Lets do a quick back of the envelope calculation:

Github reports 28,337,706 users by 2018-06-05 [1]. Lets assume 50% of these are active. Lets also assume that, due to the unavailability of GH, around 2 usable hours per developer are lost. Another assumption is that each developer contributes around 50 US$ per hour.

This means, this outage has cost us users: (28337706 * .5 * 2 * 50) = 1.351 billion US$.

Perhaps not use MySQL for such critical systems?

[1] https://github.com/search?q=type:user&type=Users


The assumption that developer time is lost when Github is unavailable is wrong. The whole idea of Git being a distributed VCS that it does not require any connection to the main server (i.e. Github) to work with a local copy of the repository. If Github is down, I can still do my work locally and then push changes to Github when it's back online. The only case when I may get blocked, is when I need to fetch project dependencies hosted on Github to do the initial build of my project, which doesn't happen very often.


But you are ignoring a huge component of using github. If you only use github as a central, shared repo, sure you lost almost nothing. Their git infra seemed to be operational throughout this, but if you use the major features of github: issues, PRs, review, webhooks for CI, etc, you probably did lose out on developer time.

Me and my team pushed tons of code to origin today (JST btw), but we were almost at a stand still as far as merging to master and closing out branches. Github being down had a huge affect on the process -- review, merging and CI success. Merging was our big one since we have protected branches via github -- master; CI was second since we received no webhook events. So either we wait it out (~8 hours) or we throw out our process and do something different until they fix it. The later didnt seem like a reasonable course of action.

github != git

Its easy to say, "well they are down, that doesnt affect git", but the reality is that a lot of orgs dont just use git. They use github. That fact envelopes a lot of process, routine, infra, schedule, money, etc. Luckily, its only been one day. But developer time has definitely beem lost. You cant reasonably say otherwise if you use github.

edit: I will say I dont fully agree with the GP. Blaming the downtime on MySQL is silly and coming up with dollars lost is over the top. Things happen. That comment was a weird attack on a reasonably stable database. Github had a bad day; it happens.


Almost everything you mentioned can be done without Github. Add a new remote, make teammates push branches to that remote. Merge to master locally. Review old-school way with 'git diff'. Apart from issues and CI, progress can still be made if you use Git features that make it distributed in the first place. If you depend on Github that much, you should think about a fallback strategy. Otherwise your business model is just unsound.


Going forward, I see that the sensible solution for all small companies relying on cloud-hosted git is to always have a secondary cloud provider at all times.


> Which brings into question... what is so special about the "front-end" features that Git provides? ... bare bones skeleton that we can swap in/out at will

Your comment disappeared. I think you had a good point. And itd be cool if it could be a real thing, but...

I dont disagree. But our team realized that we rely almost too much on github; we just decided to put up with it. Is there a solution that doesnt depend on running your own "github"? My infra lead and I had the usual, fun tongue and cheek chat that began with... them: "maybe its time we switch to gitlab", me: "can we be up tomorrow?" Weve had that same discussion many times before.

In the end, it comes down to process. If you buy into the features, PRs, CI hooks, etc, then its really hard to just say "well we can maintain and replicate the alternative for the .1% edge-case". Otherwise, you might as well just use that and not github, gitlab, etc. Its hard to decouple from github. They do that by nature -- dev still continues; they are a piece of the process puzzle. I think abstracting them away just complicates things unnecessarily.


Ah yeah, it disappeared because I thought I would get more reads If I moved it up in the post chain lol.

May I ask what makes your company feel secure about using cloud hosted solutions. I mean, can’t a disgruntled employee easily clone the git repo to his own GitHub account ? I suppose they could do that anyways, just by copying the repo somewhere, but having an entire company’s secret code on the cloud just seems to remove too many barriers of entry for protecting the code.


Average 50 percent active (as in daily usage)? 2 hours lost?

I'd estimate a couple of order of magnitudes less...


I don't think 2 hours is an exaggeration for an active user. The 50% might be too high, I agree.


2 hours lost? Maybe 2 hours affected, but then just don't push/fetch during that time. Very few would have to stop and wait.

Average lost I'd guess to 10 minutes (most of it being "huh, wonder what's up with github")


You lose quite a bit more. People keep checking if it's back online. The usual flow is disrupted. Questions might be asked about whether branches can be merged. Integration with CI might fail. Managers start asking questions why features cannot be deployed. Some of the tooling I wrote even would stop working because the API was unavailable.

Unfortunately, it's easier to down vote, than to come with a better estimate of the total cost of a 13 hour down-time of github.


That's a vast exaggeration. You're assuming every single one of GitHub's active users: 1) was active during the exact incident time and 2) is a business user earning US wages.


How would you do the calculation of the costs of this outage? Don't forget that the 50 US$ is lost added value, not wages.


GitHub team seems to be VERY unprofessional. 15 hour outage means ~99.82% availability which is extremely bad. 9 hours ago they also told that they would fix the problem within 2 hours... still not fixed!!!


In order to objectively assess their level of professionalism, you'd need to know what happened, and what's going on in there right now. Think about it this way - can someone break things where you work right now to cause an outage of this magnitude? For all places that I've worked the answer would be "yes, of course".


It doesn't matter, really. Its a black box from a business perspective. Some users have lost faith, and some people will migrate to other solutions. Regardless of how fair or unfair the incident was, it is a fact that it was poor up-time especially for such an important cloud provider for code.


Making decisions based on a single event is risky. One could even argue that an event of this magnitude is likely to cause significant improvements in reliability. And you just paid the price for that as a user, so unless they keep failing repeatedly, you may be better off sticking with them.


> One could even argue that an event of this magnitude is likely to cause significant improvements in reliability.

Doesn't happen in practice and usually the whole thing is just blamed on "process" with subsequent "process changes".


You need to judge them by the recovery in the context of what happened. GitHub has a world class engineering team, but major outages still happen. Even in the context of a major outage, they have best-in-class status pages and frequent around-the-clock updates.

I helped manage a hosted DVCS and CI system in my previous job, and do you know what we would've called an outage that happened at 7:00PM and had recovered partially before start of business?

A Tuesday.

Wait for the RCA to come out before throwing any stones.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: