Hacker News new | past | comments | ask | show | jobs | submit login
Heroku is down (heroku.com)
117 points by jackmoore on June 7, 2012 | hide | past | favorite | 98 comments



When will heroku have a multizone offering?

I know most sites hosted on heroku are pet projects with no real need that type of uptime, but this type of downtime makes it impossible to use them as enterprise customers.


Agreed. We are finally scaling from "small scale production" to having actual customers we want to take care of. Worse, we began a huge marketing push this week and had been seeing promising results. Now we'll probably miss a boatload of potential customers.

Heroku had been a joy to work with until now. I ask of you all (as most of you are far more experienced in these matters than I), what is the traditional practice to mitigate this sort of risk? Paying for hosting with two separate companies?

EDIT: I understand the benefits of cloud hosting over hiring a sysadmin. At the same time, I'm interested to learn about what possible solutions there are. I'm at the point where I don't know what I don't know, and even the name of a topic or technology would be a huge help.


Heroku is really good. You will probably have more downtime than them unless you hire a whole team of operations and admins and pay a lot of money for hosting. It may also take a while to build out the deployment and development tools they have.


Honestly, that's pious bullshit. It's the thing people keep saying every time Heroku gets mentioned is that it'd do better than you would yourself for availability, but you'd have to be borderline incompetent to have as much downtime with a more traditional VPS / hardware hosting to match the combined Heroku / AWS downtime.

We moved all but one of our Rails apps off of Heroku precisely because of the frequent downtime -- or, rather, that was the last straw; there were other issues, notably the difficulty in debugging production issues, that had us already debating such. Heroku has gotten somewhat better, but it's still down far more than anything else that we use. (And we have services spread across Linode, Rackspace, AWS and the mentioned one app on Heroku.)

You'll have better uptime in most cases with a standard nginx / passenger setup on a $20 VPS than you will with Heroku.


They are at 99.97% uptime. It's not that easy to get to even 99.5%, just the flakiness of a standard uplink will put you below that.


Huh? The flakiness of a standard uplink will have you down for 44 hours a year? On what planet? With our non-AWS hosting providers we tend to see 2-4 hours of network issues per year. Heroku's uptime also hasn't historically been anywhere near 99.97%. They were down for several days last year in The Great AWS Failure.


This is in the context of hosting it yourself, not using a different large-scale business provider.


I don't think anyone ever considers hosting it on a desktop in the closet on DSL at home/office.

The competition for something like Heroku is EC2, VPS, or dedicated servers, in commercial colocation facilities. A good hosting facility is going to be a lot closer to 99.995% uptime for network and power to the box, but you can of course screw up past that point on your own.


It's not that easy to get to even 99.5%

99.9% is very doable. If your sites have less then you should consider hiring better staff.

Also the heroku figure (99.97%) is a lie - just skim their status-page.


Having administered several "standard uplinks" from 100 Mbit/sec all the way to multiple ten Gbit/sec links, I can safely say that if your transit isn't 99.999%, you need new transit.


Yeah, I think Heroku is better than other options for being a fast/easy way to deploy, and great UI and some useful tools, but the combination of lack of visibility into internals for debugging, Heroku outages, and EC2 outages is the reason I don't use it for anything in production.

I'd seriously consider Heroku if they were in multiple regions (fuck AZs, those are a lie). There are huge advantages to a PaaS in terms of speed and lack of hassle, and a well run PaaS is better than most developers at operations, patching, security, etc. (We're kind of unique in that we're better at operations than development, though.)


On June 1st (this month) they've had routing errors. On June the 5th (the day before yesterday) they again had routing errors.

My apps triggered errors left and right and I was bashing my head against the keyboard because I noticed the errors earlier than their status update and I believed it was our fault.

Plus this last week the latency of the requests have been incredibly high, while the traffic to my apps have not increased. Again I assumed it was my fault.

I'm contemplating moving back to EC2 or Linode.


We've recently begun to move our entire infrastructure to external providers (vps and cloud) because we felt that we couldn't guarantee a sufficiently high level of quality and had to devote too much time and money to operations.

In over four months we've probably already had more downtime than in the past 5 years. Despite paying quite a lot for "redundant", clustered offerings. There was one nasty bug in the hypervisor that killed the entirety of one of our providers' vps infrastructure - for over a day.

Our experience with "the cloud" was a little bit better, but we aren't entirely satisfied.

I don't know, maybe we should just order services from three different providers mirror our applications ourselves as a fail-over mechanism.


Heroku status is currently showing production uptime as 99.97%. That's not too shabby, and definitely not as bad as you make out.


He implied that Heroku is not his sole provider. I'm skeptical of that figure, too, as it's probably something on the Heroku sales site (99.7% to the 30-byte health check!), or it's manually updated, in which case it's probably rounded.


Yes. I recommend having a look at either ProxMox (proxmox.com) which can simultaneously do OpenVZ and KVM; or, XCP for Xen-based stuff - see http://www.xen.org/products/cloudxen.html


http://AppFog.com has ruby, python and node, multiple regions today: Ireland, Singapore and US, adding Rackspace and HP soon. All free and backed by CloudFoundry


> maybe we should just order services from three different providers mirror our applications ourselves as a fail-over mechanism.

Yes, yes a thousand times yes. And dedicated servers don't cost much more than VPS's.


They went down literally the moment our demo thing started. I swear this is because of me.


Same here! This happened last year too! Annoying.


What luck.


As a newbie, I've really enjoyed deploying on Heroku. However these outages really terrify me. Is there an easy way to host a redundant version of your app outside of Heroku?


Ask a simple question, get a simple answer: No. This is no knock on Heroku, either. No hosting solution can make that easy: it is an enterprise requirement which implies six figures of investment and a dedicated ops team with no newbies on it.


I'm sorry, but that's completely untrue. I can't speak for using Heroku specifically, but in the end hosting is just running an app. It's fairly trivial to make an app multi-provider these days through a plethora of methods. DNS and a low TTL is on the easiest-to-approach end if your app is designed for it and aware of the complications, which implies thinking about your database and other supporting architecture from the perspective of a multihomed setup. If your app isn't idempotent nor designed to be multihomed, it'll be harder, but six figures of investment is insanely high even in that awful case.

Perspective: I could throw an app on Rackspace and Amazon, with a replicated database, for under $200, in about a day.


One of the problems with using multiple providers is you can't use any of the specific features of a provider.

For example, I really like AWS's security groups and ELBs. Those serve as my firewall and my load balancer and SSL terminator.

Replicating the application to another service means configuring and testing all that on my own.

If I use heroku and use their logging system, then replicating it to another provider means I need to be an rsyslog expert.

I don't really want to be an expert on rsyslog, postgresql configuration, floating IPs for HA LB, the best IO scheduler for file systems, etc. As someone who is in charge of all the sysadmin duties, and is solely responsible for writing all the business and db logic for several e-commerce sites, I want to spend my time on writing code. Not fucking around with figuring out the syntax for iptables.


Ultimately this is why it would be nice for the code/configuration to be independent of the operations. I agree it is unreasonable to expect a random developer to build AND MAINTAIN the entire stack for every project. PaaS makes a lot of sense, especially as a starting place.

The solution is either to have a PaaS provider who ruthlessly eliminates single points of failure (there isn't one, currently), or use some standardized software system which can be operated by multiple independent operators with nothing shared. Unfortunately, the only vendor-independent infrastructure is the physical server, various forms of VPS, etc. -- it's all at the IaaS level. As far as I know there's no PaaS type thing with a common interface which arbitrary providers can operate, with some kind of marketplace for users to pick operators independently from the technology.


The solution is either to have a PaaS provider who ruthlessly eliminates single points of failure (there isn't one, currently)

This isn't possible by definition, right? The PaaS provider itself becomes the single point of failure.

Rings a bit like a "Who created God?" argument. "What single entity can I use to defend against failures by a single entity?"


Theoretically, sure, but it's an engineering and economic thing.

You can mitigate specific risks, and you try to prioritize those based on cost, frequency, and severity. If there were a great redundant provider with good authentication on accounts, a strong balance sheet and business, and sane policies on managing accounts, you would be fairly safe using just that provider. After all, you could always get a court order to cease providing services, yourselves, like if you do something some troll has patented. It kind of depends on your application, too -- if I were doing a wikileaks, a bitcoin exchange or torrent site or some other legally at risk business, I'd want country-level separation across multiple providers, at least as a cold backup. Casual game for facebook or mobile, not really much of a concern.


That's exactly why you don't want to use those features, for what it's worth.


That means that I need to know about (and program chef to do):

logging

security hardening

firewalls

backups (both point in time and complete backups, both for database and for any other artifacts like image uploads)

testing backup recovery (both the point in time and complete backups)

replication

HA

LB

SSL termination

postgresql configuration

monitoring (security, system stats, application availability, individual process running)

performance analysis (both at application and system level)

method for deploying updates (to the application and to the system)

and the list goes on... you could easily make a career out of focusing on postgresql configuration, for example. these aren't simple things to do.

Different providers have different limitations on what you can do here. For example, AWS doesn't support multicast, which limits your options for doing High Availability.


Know about? Program Chef to do? How about: be on call 24/7 to handle random problems in? The best, most competent ops teams can all tell you horror stories about black swan outages traceable to some of the least likely components in the stack you just outlined.

Almost all of the things you outlined are things that have in the past blown up in real world deployments, often at cloud hosting providers where you didn't know about it it because a well-paid, well-trained ops team hid the drama for you.

So yeah, definitely factor that in to the cost of hosting at a cloud provider. You're absolutely right.


I think before you put an app on the public Internet for people to consume, you should be comfortable with everything you listed plus more. You might not be doing it for the app at hand, but you should know it at least. Especially if you're doing a solo founder thing, you might be having to implement all of these things before you can afford ops. If you're deploying services without some of the knowledge in your list, too, you're setting yourself up for trouble.

It's not an insurmountable barrier, for sure. Knowing how to operate your own code is a great feeling.


You do realize that it's a 6-figure investment to hire an engineer comfortable with all of those things plus more to deploy a multihomed setup?


A) My comment was very specifically focused on you, acting as a solo developer/founder, not a person you'd hire.

B) A multihomed setup is not a black box of mystery, and is nowhere near as expensive as people are saying from the hip.


> My comment was very specifically focused on you, acting as a solo developer/founder, not a person you'd hire.

Then "you," having acquired years of experience in designing and implementing distributed services, are making at least a 6-figure investment in opportunity cost.


I am intrigued. Do you have any quick links on hand, or pointers on how one would go about learning how to set up something like this? Or just learning about how to distribute services in general?

That is to say, I know my way around Linux, have set up single/standalone servers for all sorts of services, but have never known remotely where to begin on the distributed side of things, much less understand what is required to makes something "designed to be multihomed".


I learned from trial and error and also on the job. In my experience, how-tos are typically antiquated or not terribly clear. My best advice would be to dig into ServerFault and Google what you don't understand.


I did too; for instance, I learned how to terminate DS1s and PRIs by trial-and-error punching down every possible combination of little colored wires, and to this day remember that it's ESF/B8ZS and not AMI because that setting change, at 10:30PM on a Friday night, was when I got the Livingston box to light up properly.

Fun? Yes. Miss it? Fuck no. Recommend people relive the 2012 version of the experience? Huh? Get back to work writing code. Re-read the Wikipedia page on "comparative advantage" first if you need to motivate yourself.


The context of the original question was improving the availability of a product deployed to Heroku or other PaaS, which implies a lack of dedicated operations staff.

An experienced ops team willing to be on call 24x7 is easily six figures by itself.


If you are running an app that is utilized by more people than just yourself, there is never "a lack of dedicated operations staff". That situation simply does not exist in any known realm. In the case of a solo developer, you are the operations staff. You are on call, 24x7.

With that in mind, everything I talk about in my reply is doable by one person, or a solo developer. A cursory understanding of administration goes far, and if you're deploying to Heroku, you should know enough about administration to not be completely in the dark when things fall apart.

Deploying a mass-consumable Web service implies that you have accepted the fact that you are now a developer and administrator.


If you use Cloudflare it will serve a static version of the site until it is back online.


I always use Cloudflare in front of Heroku; mainly because of that and the cheap & simple SSL set-up that they have.


Do note that Cloudflare is throwing mallware/captcha notices to legitimate users, which is extremely annoying.

When I get such a screen I immediately close the website I'm visiting. And no, contrary to what they tell me, I do not have "a virus" nor is my computer part of a botnet.


Also, careful with Cloudflare SSL. It only secures between the browser and Cloudflare's data center. The connection between your server and Cloudflare remains non-SSL.


You actually have the option to enable SSL both on (customer->cloudflare) and (cloudflare->your server).


And, as I understand it you, you can do:

customer=> Cloudflare SSL

then

cloudflare=> Heroku SSL through Heroku Piggyback


If status.heroku goes down due to 500 errors (overloading) use https://status-old.heroku.com/


the incident specific page is up often when the main page is not, in this case

https://status.heroku.com/incidents/372


It seems like they are quick and post a lot of updates on this status page each time when something's wrong. (I am not using Heroku but had the 'joy' of trying to run a site on Dreamhost for a while, so I have seen the worst.)


RedHat was timing its blog postings in anticipation of this:

https://openshift.redhat.com/community/blogs/new-openshift-r...


There's no big announcement there that I see. They release a blog post like this at least once a week.



What is it suppose to do?


I think it's similar to http://isitchristmas.com/ . It's a simple answer that's always the same, to drive the point home: you are always the one responsible for your uptime, no matter whether you choose dedicated hosting, the cloud, your own closet, etc. You can't outsource responsibility.


It's so scary to think that our entire startup almost died with this glitch. Our enterprise customers use our product in the mornings... this happened at 9-fucking-am right during our peak hours.

Obviously it's our fault for not having a redundant system / for trusting heroku. Though I still can't help but be a little pissed as I email 100 people about how sorry I am for their service disruption. Heroku didn't email me.


You can subscribe to notifications on http://status.heroku.com.


Thanks so much! Though I can't imagine anyone who wouldn't want these notifications...


Interesting to see that www.heroku.com runs on their own platform, and is just as susceptible to platform downtime.

On the other hand its surprising to me that when heroku goes down, it goes down hard, namely that all hosted sites including their own are unavailable.


Totally agree with this assessment. I don't know how their routing mesh works exactly, but it seems that they're coupled too closely. Either everybody is mostly working fine, or everybody's down.


What would really help in these situations is a failover to our maintenance pages or some other static page we could provide instead of showing the Heroku "Application Error".

This is probably harder then it seems, especially when the outage is related to their routing infrastructure.


Well, there is an option to have error pages on your heroku domains, but that is probably dependent on at least a minimum level of the routing layer working.

https://devcenter.heroku.com/articles/error-pages


Yeah, I have that setup but its not being used at the moment with this outage - if full redundancy with multiple regions is a huge challenge, perhaps some interim basic routing redundancy would go a long way so we could at least display a branded error page during an outage.


I agree. There is a very simple way to do this which I just posted (since it apparently hasn't occurred to Heroku). All they need to do is change the default app error message once they have confirmed they have a platform issue. So within a few minutes of an outage, a better message is displayed to our customers that does NOT make it look like we screwed up. http://blog.pardner.com/2012/06/dear-heroku-dammit-quit-blam...


How would that help?


Status update on https://status.heroku.com/incidents/372 :

"We have confirmed widespread errors on the platform. Our engineers are continuing to investigate."


Also can follow @herokustatus on twitter: https://twitter.com/#!/herokustatus

Edit: Also IRC #heroku

Edit 2: heh... I just noticed the "subscribe to notifiations" link on the incident page: https://status.heroku.com/incidents/372



What's crap about these outages is that custom 500 error pages that you've created aren't shown.


I checked 5 sites, all single dyno: 3 down, 2 up


2 of 3 up, all single dyno staticish sites; but it could be cloudflare propping them up with their cache.


update: all 5 down


I have 4 apps on 20 paid dynos. All of them down.


Heroku: Awesome when it's up, but when it's down, it's really really down...


Heroku is a great platform and has come a really long way -- but it could really use some competition. Some of the main issues that would be resolved quickly with competition are:

- it would be possible to have the non-DB part of your heroku app be spread across multiple availability zones.

- worker pricing would be much lower or based on actual CPU cycles used.

- add-on providers would be vetted more thoroughly before getting a spot in the add-on store.

- they'd reopen the #heroku irc channel for informal support


you mean like AWS, EngineYard, appfog, dotCloud?


IaaS != PaaS


Amazon excluded those are PaaS.


I haven't looked at DotCloud, but the others are not really competitors.



Eventually, they need to offer multiple independent regions, so if you're inclined (like Netflix does with AWS) you can develop a way to fail-over to another region.

At least then, it won't be "Heroku is down", it will be "Heroku West is down" or some such thing.


http://AppFog.com has multiple regions today: Ireland, Singapore and US, addin Rackspace and HP soon. All free and backed by CloudFoundry


What's their usual outage duration? I'm supposed to demo a project in a few hours.


They keep a historical archive of incidents at https://status.heroku.com/past

Last time I can recall this happening to us it was down for several hours.


502 Bad Gateway . . . is there a page that displays the status of the status page?


Goes to show that the five nines is a bunch of crap. More like two nines.


5 nines is 5 minutes of downtime per year. 4 nines is 52 minutes of downtime per year.

See: http://en.wikipedia.org/wiki/High_availability#Percentage_ca... for a handy chart and follow along with any of your favorite *aaS providers!


What kind of service guarantees does Heroku have?


Confirmed. One of my apps is down. I hope they resolve it soon, but I don't pay anything at the moment so I am not too upset.


My apps are finally back up, how's everyone else doing?


things seem to be working for me, but Heroku seems a little off still. I got an error when I tried to view my app logs though, which is a little scary


still getting sporatic errors


my demo site is still down.


and it's back up....

just a glitch in the matrix. it occurs when they switch over local control.


We are still down: http://warsocial.com

Edit: Came back for a second. Went down again. Came back. Went down.


I was going to get details from status.heroku.com but that's down too.


it's up sometimes :)

Potential Platform Issues 5m+ We have confirmed widespread errors on the platform. Our engineers are continuing to investigate.

this is the message for Production and Development


status.heroku.com is up for me. My apps aren't though.


It is up for me:

Potential Platform Issues Jun 7, 2012 15:55 UTC - Update

We have confirmed widespread errors on the platform. Our engineers are continuing to investigate. Posted Jun 7, 2012 15:58 UTC Issue

Our automated systems have detected potential platform errors. We are investigating. Posted Jun 7, 2012 15:55 UTC




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: