GitHub availability this week

pbiggar · on Sept 14, 2012

I know that they have to be apologetic like this, but the simple fact is that GitHub's uptime is fantastic.

I run http://CircleCi.com, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.

cagenut · on Sept 14, 2012

I'd like to welcome the github ops/dbas to the club of people who've learned the hard way that automated database failover usually causes more downtime than it prevents.

Here's sortof the seminal post on the matter in the mysql community: http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-s...

Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.

ghshephard · on Sept 15, 2012

Automated database failover is absolutely mandatory for HA environments (as in, there is no way to run a 5 9s system without it) but, poorly done, results in actually reducing your uptime (which is a separate concept from HA).

I've been in a couple of environment in which developers have successfully rolled out automated database failover, and, my takeaway, is that's it usually not worth the cost - and with very, very few exceptions, most organizations can take the downtime of several minutes to do manual failover.

In general, when rolling out these operational environment, they are only ready when you've found, and demonstrated 10-12 failure cases, and come up with workarounds.

In other words - if you can't demonstrate how your environment will fail, then it's not ready for an HA deployment.

Xorlev · on Sept 15, 2012

Every HA deployment I've done, the HA manager inevitably had issues to begin with. It takes time, patience, and a few late nights.

ghshephard · on Sept 15, 2012

With the possible exception of life safety systems, credit card processing, stock exchanges, and other "High $ per second applications" - I just don't see getting HA right on transactional databases as worth the effort. Properly rehearsed, a good Ops/DBA (and, in the right environment) NOC team can execute a decent failover in just a few minutes - and there aren't that many environments (with the exceptions listed above) - that can't take two or three 5 minute outages a year.

The alternative is your HA manager decides to act wacky on you, and your database downtime is extended.

For some reason - this rarely (almost never, in my practical experience) is a problem with HA systems in networking. With just a modicum of planning, HA Routers, Switches, and Load Balancers Just Seem to Work (tm).

Likewise, HA Storage arrays are bullet proof to the point at which a lot of reasonably conservative companies are comfortable picking up a single array/frame.

But HA transactional databases - still don't seem to be there.

aaronblohowiak · on Sept 15, 2012

automated failover in the case of too much load is usually not what you want to do. automated failure in the case of hw/network failure is usually what you want to do. differentiating the former from the latter is left as an exercise for the reader.

WestCoastJustin · on Sept 14, 2012

Here are the makings of a bad week (Monday of all things)

- MySQL schema migration causes high load, automated HA solution causes cascading database failure

- MySQL cluster becomes out of sync

- HA solution segfaults

- Redis and MySQL become out of sync

- Incorrect users have access to private repositories!

Cleanup and recovery takes time, all I can say is, I'm glad it was not me who had that mess to clean up. I'm sure they are still working on it too!

This brings to mind some my bad days.. OOM killer decides your Sybase database is using too much memory. Hardware error on DRBD master causes silent data corruption (this took a lot of recovery time on TBs of data). I've been bitten by the MySQL master/slave become out of sync. That is a bad place to be in.. do you copy your master database to the slaves.. that takes a long time even of a fast network.

cageface · on Sept 15, 2012

This kind of thing is one of the main reasons I prefer to do app development instead of backend work now. I don't get calls at 3am any more.

andrewljohnson · on Sept 15, 2012

The lack of any negative response on this thread is a testament both to the thoroughness of the post-mortem, and the outstanding quality of GitHub in general.

In GitHub we trust. I can't imagine putting my code anywhere else right now.

gbog · on Sept 15, 2012

I like github too but please remember that things come and go. Some time ago it was SourceForge that was hot.

mckoss · on Sept 15, 2012

... but never as well loved.

cwb71 · on Sept 14, 2012

The part of this post that really blew my mind:

  We host our status site on Heroku to ensure its availability
  during an outage. However, during our downtime on Tuesday
  our status site experienced some availability issues.

  As traffic to the status site began to ramp up, we increased
  the number of dynos running from 8 to 64 and finally 90.
  This had a negative effect since we were running an old
  development database addon (shared database). The number of
  dynos maxed out the available connections to the database
  causing additional processes to crash.

Ninety dynos for a status page? What was going on there?

wfarr · on Sept 14, 2012

At the time of the outage, the status site was seeing upwards of 30,000/req minute.

AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

ashray · on Sept 15, 2012

What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?

erichocean · on Sept 15, 2012

Or memcache, with one worker dyno dedicated to updating it, cron-like.

adgar · on Sept 15, 2012

30,000req/minute is 500qps. That's... just not a lot for a large service.

mbell · on Sept 14, 2012

Anyone tested S3's static page hosting under heavy load? I would think you could just update the static file as a result of some events fired by your internal monitoring process.

dustym · on Sept 15, 2012

We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's been nothing but rock solid. We essentially create a static site and push up JSON blobs. See here:

http://product.voxmedia.com/post/25113965826/introducing-syl...

sophiebits · on Sept 15, 2012

This is really interesting -- thanks for sharing. It seems to me that you could probably have nginx running on a regular box and then CloudFront as a caching CDN to avoid the S3 update delay.

dustym · on Sept 15, 2012

Probably could figure that out, yeah. But we didn't want to take any chances given how important it was to get our live blog situation under control.

[edit]

Which is to say, we wanted a rock solid network and to essentially be a drop in a bucket of traffic, even at the insane burst that The Verge live blog gets.

donavanm · on Sept 15, 2012

Could you say more about using both the Cache-Control and a query string of epoch time? In particular the query string has me puzzled. On it's face it seems to decrease your cache hit ratio, with no/little benefit. Im assuming the epoch time is the clients local time. The clock skew across the client population increases the number of cache keys active at any one time. The incrementing query string also forces a new cache key once per second. Those would force a cache miss and complete request to S3 even when content has not changed. It's even worse with the skew as you now force a cache miss per second for each unique epoch time in your client clocks. Without the query string the cache could do a conditional GET for live.json. That would save latency & bytes as the origin could respond with a 304 instead of the complete 200.

dustym · on Sept 15, 2012

Great point. I don't speak for the guys that made the decision to append the timestamp to the query, but I assume our concern is in intermediate network caches that don't honor low TTLs. Though I don't know how founded that is, we won't ever have to deal with the issue if we take control of it with the url string.

It'd be interesting to see how wide the key space is due to clock skew. I suppose we could specify some number and consider it a global counter that is incremented every second, then when someone comes in for the first time they can by synced in with the global incrementing counter. That counter is used to ensure a fresh cloudfront hit.

I think at end of the day, these issues haven't been a huge concern for a one month emergency project, but they are good points.

WestCoastJustin · on Sept 14, 2012

S3 is great for static content. I was taking the AWS ops course and the instructor mentioned some very large organizations redirect their site to S3 when under DDOS so they can remain on-line. In fact, he said that AWS recommended this solution to them?! Can you fathom someone who is under DDOS, and you tell them, hey, just redirect that our way ;)

fierarul · on Sept 15, 2012

You pay for the bandwidth on AWS. Of course they would be glad to redirect a DDOS their way. It's pure gold for AWS.

moe · on Sept 15, 2012

"Heavy load"?

30 kRPM is 500 hits/sec. Nginx will serve >2000/sec from a m1.small. For S3 that is about the equivalent of a mosquito fart.

biot · on Sept 14, 2012

Use Jekyll and push the site to S3:

https://github.com/mojombo/jekyll/wiki

https://github.com/laurilehmijoki/jekyll-s3#readme

druiid · on Sept 14, 2012

Well, I have to say... replication related issues like this are why I/we are now using a Galera backed DB cluster. No need to worry about which server is active/passive. You can technically have them all live all the time. In our case we have two live and one failover that only gets accessed by backup scripts and some maintenance tasks.

Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.

aaronblohowiak · on Sept 14, 2012

any details on the kinks you worked out?

druiid · on Sept 14, 2012

Sure. Maybe I should do a writeup for it on my blog at some point in the near future :).

The two main issues we encountered both had to do with search for products/categories on our sites. The first was that Galera/WSREP doesn't support MyISAM replication (It has beta support, but I wouldn't trust it). This meant that we had to transition our fulltext data to something else. The something else in this case was Solr which has been a much better solution anyway (fulltext based search was legacy anyway so this I can kind of count as a win).

The second issue and the one that was causing random OOM crashes was partly due to a bug, partly due to the way the developer responsible for the search changes implemented things. The bug part is that galera doesn't specifically differentiate between a normal table and a temp table. When you have very very small/fast temporary tables that are created and truncated before the creation of the table is replicated across the cluster it can leave some of these tables open in memory (memory leak whoo!). We were able to fix for this and have been happy ever since.

If there's any interest I can do a larger writeup about actual implementation of the cluster, caveats and the like.

sciurus · on Sept 14, 2012

Consider this an expression of extreme interest on my part.

gsibble · on Sept 15, 2012

aaronblohowiak · on Sept 14, 2012

If Github hasn't gotten their custom HA solution right, will you?

Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.

How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?

There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!

jaggederest · on Sept 15, 2012

It blows my mind that they aren't simply using Jekyll to generate and update the status page. I mean... they wrote it, right?

rhizome · on Sept 15, 2012

I think people tend to overestimate the value of nines to the user. It's chiefly a management/VC/busybody metric that has gained importance mainly due to it being a high level and easy to understand abstraction. "Well how much was it down?" Then they spend zillions on failover software, hardware and talent that could be supplanted by one fewer nine and a simpler architecture.

And really, just to get a dig in here, I believe Arrington shares a big part of the blame for this state of affairs with all of his Dvorak-caliber ignorant harping about Twitter back in the day.

autotravis · on Sept 15, 2012

"There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!"

Seriously, why would a status page need to query a db?

gsibble · on Sept 15, 2012

I assume that the status server is not actively checking every Github server/service whenever someone pings it. It probably polls the servers every X seconds. The best place to store that type of data is in a DB.

Where else would you put it?

aaronblohowiak · on Sept 15, 2012

> It probably polls the servers every X seconds.

And then you could write out a new static file, just once, and send it to your edge server of choice.

wonnage · on Sept 15, 2012

You could just as easily store the result in a plain file somewhere... a database seems like overkill.

jyap · on Sept 15, 2012

"As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90."

Wait, why isn't there some caching layer? eg. Generate a static page or use Varnish.

This part makes no sense at all.

At most you're then firing up another 5 dynos (or none) to handle the traffic. 90 is ridiculous.

jluxenberg · on Sept 14, 2012

"16 of these repositories were private, and for seven minutes from 8:19 AM to 8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the repository's list of collaborators or team members"

ouch!

_f75i · on Sept 15, 2012

One of those repos was mine. :( Fortunately it was a fresh Rails app without anything important. However, it does make me rethink the security of storing my code on github.

mckoss · on Sept 15, 2012

I store proprietary code on github, but I would never recommend storing actual secrets (like keys or passwords).

code0 · on Sept 15, 2012

I am really curious about the technical reasons how this might have happened.

dumbluck · on Sept 15, 2012

This was the awesome kind of explanation about what went wrong and what was learned that I wish everyone would do.

donavanm · on Sept 15, 2012

Update strategy of master first is interesting. I've always seen the other way with update standby, flip to standby, verify, update original master. Auto inc db keys once again cause horribleness. Nothing new there I suppose. And as mentioned the multi dyno + DB read status page is craaaazy. Why oh why isnt this a couple static objects. Automagically generate and push if you want. Give 'em a 60 second TTL and call it a day. Put them behind a different CDN & DNS then the rest of your site for bonus points.

akoumjian · on Sept 14, 2012

I would love to know more about this two pass migration strategy.

jnewland · on Sept 14, 2012

We use https://github.com/soundcloud/large-hadron-migrator/

cschep · on Sept 14, 2012

Interesting to read about github using MySQL instead of Postgres. Anyone know why? I am just curious because of all the MySQL bashing I hear in the echo chamber.

technoweenie · on Sept 14, 2012

Mostly because of legacy reasons, at this point.

boundlessdreamz · on Sept 15, 2012

That sounds like you would have chosen differently if you had to choose now. Is that so?

lonnyk · on Sept 14, 2012

Do you have a source for this information?

technoweenie · on Sept 15, 2012

I have the source code, yes :)

jaggederest · on Sept 15, 2012

https://github.com/technoweenie

He's on the Github team. I assume he's speaking Ex Cathedra.

autotravis · on Sept 15, 2012

They use both, according to Zach Holman (http://zachholman.com/talk/unsucking-your-teams-development-...)

technoweenie · on Sept 15, 2012

The only postgres we use is from internal Heroku apps. We use Mongo in a few places too.

gbog · on Sept 15, 2012

Genuine question: github is built upon git, which is a rock solid system for storing dataand in these reports we read that github relies a lot on MySQL, so... Did the github guys ponder using git as their data store? Just an example, in git one can add comments on commits, would it be possible to use it for the github comment function? Or maybe it is?

holman · on Sept 15, 2012

Generally, Git will be way too slow for that. Git is typically our bottleneck, since you're dealing with so much overhead and disk access to perform functions.

Databases are best for, well, performing relational queries. In the case of commenting on a commit, if you store them only in the repository it becomes non-trivial to ask "show me all of the comments by this user" unless you have an intermediary cache layer (in which case you're back where you started).

gbog · on Sept 16, 2012

Thanks for answering. Tell me if I'm wrong but MySQL would be behind a caching layer anyway, so the choice would be between cached git or cached git + mysql.

In git, logging commits on a file from an author is also a kind of join, and it is surprisingly fast, so using git as a data store is a weird idea that I cannot take out of my head.

lokotecla1 · on Sept 15, 2012

para que sirve esta pagina soy nuevo