Google App Engine Broken For 4 Hours And Counting

jacquesm · on July 2, 2009

this is a bit like flying vs driving. If you're in the drivers seat you have control - you hope - and your destiny is in your own hands, if you're in a plane it is someone else driving (unless you are an airline pilot). The accident rate is lower for planes per mile flown but if it goes wrong then it usually does so in ways that make the headlines. Still, more people die driving than flying.

When the 'cloud' goes down (or at least some part of it) then you'll notice this immediately because of the large number of sites going down all at once. But when you compare it with the accumulated downtime of all those users had they not been 'cloud users' but hosted on their own kit then it is very well possible that the balance is still in favour of hosting in the cloud.

nickb · on July 2, 2009

Nice analogy but like all analogies, it's too simplistic and flawed. You left out one important and critical part: the hypothetical passenger in your example is tied to a specific plane/airline. If you don't like your pilot or plane type, you cannot move to a different airline or request a different plane or a different pilot since you're chained to the specific plane.

Due to Google App Engine's API lock-in, you're stuck with them as a provider... quite possibly forever due to heavy BigTable dependency.

Even though I'm a huge fan of cloud computing, I'd rather use a strategy that uses platforms/planes that are built from reusable parts and allow you to switch your plane/airline provider as you please. Don't like Delta? Just go to AA counter and you don't have to change your luggage, clothing etc.

Until there's a second, GAE-compatible, ISV provider that offers full compatibility with GAE, I'd avoid GAE like a plague.

drcode · on July 2, 2009

I'm sorry sir, but if my pilot is stuck in a thunderstorm and I don't think he knows what he's doing, I can't "request a different plane."

jerf · on July 2, 2009

It's a metaphor, not a description or some sort of iron law of physics. If I was on Google App Engine right now and there was a competitor that I could switch too, then I damn well could be up, especially if I took the opportunity to keep both options actively available for myself. No matter how hard "switching planes in midair" might be, it's just a metaphor.

eggnet · on July 2, 2009

Yes but if you survive the flight you can switch after you land.

netsp · on July 3, 2009

That's not a fair comparison. You can't switch mid flight or mid cloud crash. But you can evaluate safety records every time you feel inclined and swap your airlines at some point.

If GAE fails to live up to the better-then-DIY-on-average promise, you can't leave.

imajes · on July 2, 2009

or at least carry a parachute. business continuity plans should really be part of the spec for everyone who's making money from their app.

nickb · on July 2, 2009

Does that account for 6 hours of downtime with minimal information as to what's going on? Good luck with that!

rjurney · on July 2, 2009

Actually, if this thing works then there is no longer an API lock-in with Google App Engine: http://code.google.com/p/appscale/

Assuming you can still dump your data out of Googlage?

jshen · on July 2, 2009

all understanding of the real world is simplistic and flawed. Even your analogy because there is no good cloud as service provided that offers anything as good as big table for distributed storage plus map reduce.

rapind · on July 3, 2009

I think a better analogy for cloud computing is electricity. For mission critical applications investing the time and effort into a power / app backup is probably a good idea, but the onus is on us.

vorador · on July 3, 2009

Actually, as django abstracts the GAE-api, there's still a way to escape.

asmithmd1 · on July 2, 2009

Great analogy. I would rather have all the big Brains at Google working to solve the problem instead of my puny brain working trying to restart my server

enjo · on July 3, 2009

More specifically:

We have no real network administrators. Within our engineering team we collectively have the skill to be an effective at systems administration, but the hardware side is really a complete mystery.

Co-location mostly solves that, but the cloud takes it a step further. By running in a virtualized environment we can handle what we're good at it and let others build complex data centers to scale our traffic.

When your bootstrapping a start-up, that's just huge.

vaksel · on July 2, 2009

exactly, shit happens, if you host on your own, you are just as vulnerable to power outages etc. + at least when a cloud goes down, they have hundreds of pros trying to fix it

imajes · on July 2, 2009

hope not. too many cooks?

wvenable · on July 2, 2009

I do my own hosting, but I agree with this completely. Downtime at any scale is inevitable, but with Google's massive infrastructure you're far better off.

gizmo · on July 2, 2009

A lot of sites get 99.99% uptime easily. It's not difficult when you know what you're doing and have plenty redundancy. When you depend on any cloud stuff you can't do any kind of graceful degradation. Google goes down, you go down. Google goes bankrupt, you go bankrupt.

Considering how cheap dedicated servers are moving a service to the cloud makes little sense (exceptions notwithstanding).

arockwell · on July 2, 2009

99.99% is only 52 minutes of downtime a year... I don't think any of S3, Rackspace, Goole AppEngine, or even www.amazon.com have uptime that good this year.

Getting that kind of uptime is much harder than it sounds and for a lot of websites not worth the extra cost. How much money would pay to go from 99.9% uptime (~9 hours /year) to 99.99% (52 minutes)?

sho · on July 2, 2009

I disagree. For a small to medium size site, getting 99.99% uptime should basically be the default. If you have a competent staff and a decent provider, about the only thing that will take you down is a power/equipment failure at the DC - which does happen, from time to time, admittedly. Yeah, Rackspace had some problems this year, but most of the other tier 1 hosts have been rock solid.

Obviously it's more complex for a large site but for the vast majority I would say that level of uptime is the rule rather than the exception.

Three nines is pretty unacceptable in this day and age. Providers might only guarantee that level of uptime but if they really were down that much I'd run a mile. Netcraft is your friend!

update: edited post to better reflect reality

lief79 · on July 3, 2009

How are you defining a site and uptime? Are you talking about dynamically switching systems to allow for upgrades without downtime?

I'm mostly wondering how it compares with the setup at my current location.

I have worked on systems that were designed for this, but I'm not sure if it's cost effective for most web applications. Most things can wait on occasion.

Thank you.

ErrantX · on July 3, 2009

and what happens when an engineering fluke hits your KV store and locks it up. And it takes you 10hours to fix :)

You see your not actually paying for the 3 9's or the 5 9's. That is just corporate bullshit. Your paying for the promise that whatever the problem someone will be able to fix it in a few hours at most.

ryanvm · on July 2, 2009

The hardware may be inexpensive, but "knowing what you're doing" and being able to pull 5 nines is definitely not cheap.

moe · on July 2, 2009

5 nines is not something you get on paper worth the ink normally. SLAs in the civil sector generally top out at 3 nines. Very, very rarely you talk about 4 and those are dealt out by the insurance company, not by the service provider.

I'm talking about real SLAs with compensation here, mind you, not the toilet paper you get from every cheapo ISP.

skorgu · on July 2, 2009

The cloud is just an enormous SPOF. I like having my own kit somewhere in the loop if only to throw a proper error message.

fauigerzigerk · on July 2, 2009

The bad thing is we can do nothing but wait. The good thing is we don't have to do anything but wait ;-)

drcode · on July 2, 2009

Now fixed again, as per my own app and as per http://groups.google.com/group/google-appengine-downtime-not...

drcode · on July 2, 2009

Not sure why I was downvoted: It just got fixed a few minutes ago, as stated on the link and as per my own app.

drcode · on July 2, 2009

I guess anything I post in this thread is going to be downvoted :)

defied · on July 2, 2009

At least their communication is good: http://groups.google.com/group/google-appengine-downtime-not...

jbox · on July 2, 2009

I disagree. The GAE status page was down for hours: http://code.google.com/status/appengine

What's the point of having a status page it's only up as long as your service? We shouldn't have to hunt around a Google Group for information about what's going on.

jpeterson · on July 2, 2009

What's the point of hosting in a cloud if there's still a single point of failure like this? I realize that it's currently free, but I thought one of the main advantages of moving to the cloud was redundancy and fault tolerance.

tybris · on July 2, 2009

Cloud outages may not be frequent, but they sure are noticeable.

bryanwoods · on July 2, 2009

I'm so tired of reading about web servers/services going down.

jrockway · on July 2, 2009

You are supposed to say something clever like, "Too bad TechCrunch isn't hosted on App Engine..."

peter123 · on July 2, 2009

This is worse than the 8-hr outage of S3 sometime ago... most apps could still respond without S3 static assets. If your entire app is hosted on AppEngine, you're screwed for 4 hrs and counting...

slig · on July 2, 2009

> This is worse than the 8-hr outage of S3 sometime ago

I don't think so. Javascript files hosted os s3 would hang the page loading and without css/images the app would be useless too.

peter123 · on July 2, 2009

But you could quickly recover by hosting those JS files yourself and relinking them. If your app is coupled tightly with AppEngine APIs, then there is nowhere else you can host your app.

tlrobinson · on July 3, 2009

Theoretically you could fire up AppScale on EC2: http://code.google.com/p/appscale/

drusenko · on July 3, 2009

What about your data? Most web applications have persistent data of some sort that is vital to the user experience -- without it, you don't have much of a site.

[Edit: And keeping a hot copy of that data is a lot harder than it sounds]

danw · on July 3, 2009

GAE still had read-only access to data so it would be possible to backup and move elsewhere

sschueller · on July 2, 2009

At least we got a detailed explanation from AWS as to what happened and what was put in place to prevent it from happening again.

ezmobius · on July 2, 2009

multi-tenant architectures are the geocities of cloud computing. This is the main problem with something like gae, if they have an internal problem it takes down their entire cloud and all the apps with it.

grandalf · on July 2, 2009

which means that a lot more people are mad and it gets fixed sooner. Have you ever had a problem with cable tv or phone service that just effects your home? It can take weeks. If it effects a whole city, it is fixed within hours.

drcode · on July 2, 2009

hmm... My appengine site is running just fine

Update: Actually, my app is in the "read only mode" they described... the moment I tried to update anything it went to hell :)

andrewljohnson · on July 2, 2009

I concur, my blog on App Engine is down.