I know that they have to be apologetic like this, but the simple fact is that GitHub's uptime is fantastic.
I run http://CircleCi.com, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.
I'd like to welcome the github ops/dbas to the club of people who've learned the hard way that automated database failover usually causes more downtime than it prevents.
Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.
Automated database failover is absolutely mandatory for HA environments (as in, there is no way to run a 5 9s system without it) but, poorly done, results in actually reducing your uptime (which is a separate concept from HA).
I've been in a couple of environment in which developers have successfully rolled out automated database failover, and, my takeaway, is that's it usually not worth the cost - and with very, very few exceptions, most organizations can take the downtime of several minutes to do manual failover.
In general, when rolling out these operational environment, they are only ready when you've found, and demonstrated 10-12 failure cases, and come up with workarounds.
In other words - if you can't demonstrate how your environment will fail, then it's not ready for an HA deployment.
With the possible exception of life safety systems, credit card processing, stock exchanges, and other "High $ per second applications" - I just don't see getting HA right on transactional databases as worth the effort. Properly rehearsed, a good Ops/DBA (and, in the right environment) NOC team can execute a decent failover in just a few minutes - and there aren't that many environments (with the exceptions listed above) - that can't take two or three 5 minute outages a year.
The alternative is your HA manager decides to act wacky on you, and your database downtime is extended.
For some reason - this rarely (almost never, in my practical experience) is a problem with HA systems in networking. With just a modicum of planning, HA Routers, Switches, and Load Balancers Just Seem to Work (tm).
Likewise, HA Storage arrays are bullet proof to the point at which a lot of reasonably conservative companies are comfortable picking up a single array/frame.
But HA transactional databases - still don't seem to be there.
automated failover in the case of too much load is usually not what you want to do. automated failure in the case of hw/network failure is usually what you want to do. differentiating the former from the latter is left as an exercise for the reader.
Here are the makings of a bad week (Monday of all things)
- MySQL schema migration causes high load, automated HA solution causes cascading database failure
- MySQL cluster becomes out of sync
- HA solution segfaults
- Redis and MySQL become out of sync
- Incorrect users have access to private repositories!
Cleanup and recovery takes time, all I can say is, I'm glad it was not me who had that mess to clean up. I'm sure they are still working on it too!
This brings to mind some my bad days.. OOM killer decides your Sybase database is using too much memory. Hardware error on DRBD master causes silent data corruption (this took a lot of recovery time on TBs of data). I've been bitten by the MySQL master/slave become out of sync. That is a bad place to be in.. do you copy your master database to the slaves.. that takes a long time even of a fast network.
The lack of any negative response on this thread is a testament both to the thoroughness of the post-mortem, and the outstanding quality of GitHub in general.
In GitHub we trust. I can't imagine putting my code anywhere else right now.
We host our status site on Heroku to ensure its availability
during an outage. However, during our downtime on Tuesday
our status site experienced some availability issues.
As traffic to the status site began to ramp up, we increased
the number of dynos running from 8 to 64 and finally 90.
This had a negative effect since we were running an old
development database addon (shared database). The number of
dynos maxed out the available connections to the database
causing additional processes to crash.
Ninety dynos for a status page? What was going on there?
At the time of the outage, the status site was seeing upwards of 30,000/req minute.
AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.
Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.
What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?
Anyone tested S3's static page hosting under heavy load? I would think you could just update the static file as a result of some events fired by your internal monitoring process.
We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's been nothing but rock solid. We essentially create a static site and push up JSON blobs. See here:
This is really interesting -- thanks for sharing. It seems to me that you could probably have nginx running on a regular box and then CloudFront as a caching CDN to avoid the S3 update delay.
Probably could figure that out, yeah. But we didn't want to take any chances given how important it was to get our live blog situation under control.
[edit]
Which is to say, we wanted a rock solid network and to essentially be a drop in a bucket of traffic, even at the insane burst that The Verge live blog gets.
Could you say more about using both the Cache-Control and a query string of epoch time? In particular the query string has me puzzled. On it's face it seems to decrease your cache hit ratio, with no/little benefit.
Im assuming the epoch time is the clients local time. The clock skew across the client population increases the number of cache keys active at any one time.
The incrementing query string also forces a new cache key once per second. Those would force a cache miss and complete request to S3 even when content has not changed. It's even worse with the skew as you now force a cache miss per second for each unique epoch time in your client clocks.
Without the query string the cache could do a conditional GET for live.json. That would save latency & bytes as the origin could respond with a 304 instead of the complete 200.
Great point. I don't speak for the guys that made the decision to append the timestamp to the query, but I assume our concern is in intermediate network caches that don't honor low TTLs. Though I don't know how founded that is, we won't ever have to deal with the issue if we take control of it with the url string.
It'd be interesting to see how wide the key space is due to clock skew. I suppose we could specify some number and consider it a global counter that is incremented every second, then when someone comes in for the first time they can by synced in with the global incrementing counter. That counter is used to ensure a fresh cloudfront hit.
I think at end of the day, these issues haven't been a huge concern for a one month emergency project, but they are good points.
S3 is great for static content. I was taking the AWS ops course and the instructor mentioned some very large organizations redirect their site to S3 when under DDOS so they can remain on-line. In fact, he said that AWS recommended this solution to them?! Can you fathom someone who is under DDOS, and you tell them, hey, just redirect that our way ;)
Well, I have to say... replication related issues like this are why I/we are now using a Galera backed DB cluster. No need to worry about which server is active/passive. You can technically have them all live all the time. In our case we have two live and one failover that only gets accessed by backup scripts and some maintenance tasks.
Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.
Sure. Maybe I should do a writeup for it on my blog at some point in the near future :).
The two main issues we encountered both had to do with search for products/categories on our sites. The first was that Galera/WSREP doesn't support MyISAM replication (It has beta support, but I wouldn't trust it). This meant that we had to transition our fulltext data to something else. The something else in this case was Solr which has been a much better solution anyway (fulltext based search was legacy anyway so this I can kind of count as a win).
The second issue and the one that was causing random OOM crashes was partly due to a bug, partly due to the way the developer responsible for the search changes implemented things. The bug part is that galera doesn't specifically differentiate between a normal table and a temp table. When you have very very small/fast temporary tables that are created and truncated before the creation of the table is replicated across the cluster it can leave some of these tables open in memory (memory leak whoo!). We were able to fix for this and have been happy ever since.
If there's any interest I can do a larger writeup about actual implementation of the cluster, caveats and the like.
If Github hasn't gotten their custom HA solution right, will you?
Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.
How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?
There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!
I think people tend to overestimate the value of nines to the user. It's chiefly a management/VC/busybody metric that has gained importance mainly due to it being a high level and easy to understand abstraction. "Well how much was it down?" Then they spend zillions on failover software, hardware and talent that could be supplanted by one fewer nine and a simpler architecture.
And really, just to get a dig in here, I believe Arrington shares a big part of the blame for this state of affairs with all of his Dvorak-caliber ignorant harping about Twitter back in the day.
"There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!"
Seriously, why would a status page need to query a db?
I assume that the status server is not actively checking every Github server/service whenever someone pings it. It probably polls the servers every X seconds. The best place to store that type of data is in a DB.
"16 of these repositories were private, and for seven minutes from 8:19 AM to 8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the repository's list of collaborators or team members"
One of those repos was mine. :( Fortunately it was a fresh Rails app without anything important. However, it does make me rethink the security of storing my code on github.
Update strategy of master first is interesting. I've always seen the other way with update standby, flip to standby, verify, update original master.
Auto inc db keys once again cause horribleness. Nothing new there I suppose.
And as mentioned the multi dyno + DB read status page is craaaazy. Why oh why isnt this a couple static objects. Automagically generate and push if you want. Give 'em a 60 second TTL and call it a day. Put them behind a different CDN & DNS then the rest of your site for bonus points.
Interesting to read about github using MySQL instead of Postgres. Anyone know why? I am just curious because of all the MySQL bashing I hear in the echo chamber.
Genuine question: github is built upon git, which is a rock solid system for storing dataand in these reports we read that github relies a lot on MySQL, so... Did the github guys ponder using git as their data store? Just an example, in git one can add comments on commits, would it be possible to use it for the github comment function? Or maybe it is?
Generally, Git will be way too slow for that. Git is typically our bottleneck, since you're dealing with so much overhead and disk access to perform functions.
Databases are best for, well, performing relational queries. In the case of commenting on a commit, if you store them only in the repository it becomes non-trivial to ask "show me all of the comments by this user" unless you have an intermediary cache layer (in which case you're back where you started).
Thanks for answering. Tell me if I'm wrong but MySQL would be behind a caching layer anyway, so the choice would be between cached git or cached git + mysql.
In git, logging commits on a file from an author is also a kind of join, and it is surprisingly fast, so using git as a data store is a weird idea that I cannot take out of my head.
I run http://CircleCi.com, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.