I know most sites hosted on heroku are pet projects with no real need that type of uptime, but this type of downtime makes it impossible to use them as enterprise customers.
Agreed. We are finally scaling from "small scale production" to having actual customers we want to take care of. Worse, we began a huge marketing push this week and had been seeing promising results. Now we'll probably miss a boatload of potential customers.
Heroku had been a joy to work with until now. I ask of you all (as most of you are far more experienced in these matters than I), what is the traditional practice to mitigate this sort of risk? Paying for hosting with two separate companies?
EDIT: I understand the benefits of cloud hosting over hiring a sysadmin. At the same time, I'm interested to learn about what possible solutions there are. I'm at the point where I don't know what I don't know, and even the name of a topic or technology would be a huge help.
Heroku is really good. You will probably have more downtime than them unless you hire a whole team of operations and admins and pay a lot of money for hosting. It may also take a while to build out the deployment and development tools they have.
Honestly, that's pious bullshit. It's the thing people keep saying every time Heroku gets mentioned is that it'd do better than you would yourself for availability, but you'd have to be borderline incompetent to have as much downtime with a more traditional VPS / hardware hosting to match the combined Heroku / AWS downtime.
We moved all but one of our Rails apps off of Heroku precisely because of the frequent downtime -- or, rather, that was the last straw; there were other issues, notably the difficulty in debugging production issues, that had us already debating such. Heroku has gotten somewhat better, but it's still down far more than anything else that we use. (And we have services spread across Linode, Rackspace, AWS and the mentioned one app on Heroku.)
You'll have better uptime in most cases with a standard nginx / passenger setup on a $20 VPS than you will with Heroku.
Huh? The flakiness of a standard uplink will have you down for 44 hours a year? On what planet? With our non-AWS hosting providers we tend to see 2-4 hours of network issues per year. Heroku's uptime also hasn't historically been anywhere near 99.97%. They were down for several days last year in The Great AWS Failure.
I don't think anyone ever considers hosting it on a desktop in the closet on DSL at home/office.
The competition for something like Heroku is EC2, VPS, or dedicated servers, in commercial colocation facilities. A good hosting facility is going to be a lot closer to 99.995% uptime for network and power to the box, but you can of course screw up past that point on your own.
Having administered several "standard uplinks" from 100 Mbit/sec all the way to multiple ten Gbit/sec links, I can safely say that if your transit isn't 99.999%, you need new transit.
Yeah, I think Heroku is better than other options for being a fast/easy way to deploy, and great UI and some useful tools, but the combination of lack of visibility into internals for debugging, Heroku outages, and EC2 outages is the reason I don't use it for anything in production.
I'd seriously consider Heroku if they were in multiple regions (fuck AZs, those are a lie). There are huge advantages to a PaaS in terms of speed and lack of hassle, and a well run PaaS is better than most developers at operations, patching, security, etc. (We're kind of unique in that we're better at operations than development, though.)
On June 1st (this month) they've had routing errors. On June the 5th (the day before yesterday) they again had routing errors.
My apps triggered errors left and right and I was bashing my head against the keyboard because I noticed the errors earlier than their status update and I believed it was our fault.
Plus this last week the latency of the requests have been incredibly high, while the traffic to my apps have not increased. Again I assumed it was my fault.
We've recently begun to move our entire infrastructure to external providers (vps and cloud) because we felt that we couldn't guarantee a sufficiently high level of quality and had to devote too much time and money to operations.
In over four months we've probably already had more downtime than in the past 5 years. Despite paying quite a lot for "redundant", clustered offerings. There was one nasty bug in the hypervisor that killed the entirety of one of our providers' vps infrastructure - for over a day.
Our experience with "the cloud" was a little bit better, but we aren't entirely satisfied.
I don't know, maybe we should just order services from three different providers mirror our applications ourselves as a fail-over mechanism.
He implied that Heroku is not his sole provider. I'm skeptical of that figure, too, as it's probably something on the Heroku sales site (99.7% to the 30-byte health check!), or it's manually updated, in which case it's probably rounded.
Yes. I recommend having a look at either ProxMox (proxmox.com) which can simultaneously do OpenVZ and KVM; or, XCP for Xen-based stuff - see http://www.xen.org/products/cloudxen.html
http://AppFog.com has ruby, python and node, multiple regions today: Ireland, Singapore and US, adding Rackspace and HP soon. All free and backed by CloudFoundry
As a newbie, I've really enjoyed deploying on Heroku. However these outages really terrify me. Is there an easy way to host a redundant version of your app outside of Heroku?
Ask a simple question, get a simple answer: No. This is no knock on Heroku, either. No hosting solution can make that easy: it is an enterprise requirement which implies six figures of investment and a dedicated ops team with no newbies on it.
I'm sorry, but that's completely untrue. I can't speak for using Heroku specifically, but in the end hosting is just running an app. It's fairly trivial to make an app multi-provider these days through a plethora of methods. DNS and a low TTL is on the easiest-to-approach end if your app is designed for it and aware of the complications, which implies thinking about your database and other supporting architecture from the perspective of a multihomed setup. If your app isn't idempotent nor designed to be multihomed, it'll be harder, but six figures of investment is insanely high even in that awful case.
Perspective: I could throw an app on Rackspace and Amazon, with a replicated database, for under $200, in about a day.
One of the problems with using multiple providers is you can't use any of the specific features of a provider.
For example, I really like AWS's security groups and ELBs. Those serve as my firewall and my load balancer and SSL terminator.
Replicating the application to another service means configuring and testing all that on my own.
If I use heroku and use their logging system, then replicating it to another provider means I need to be an rsyslog expert.
I don't really want to be an expert on rsyslog, postgresql configuration, floating IPs for HA LB, the best IO scheduler for file systems, etc. As someone who is in charge of all the sysadmin duties, and is solely responsible for writing all the business and db logic for several e-commerce sites, I want to spend my time on writing code. Not fucking around with figuring out the syntax for iptables.
Ultimately this is why it would be nice for the code/configuration to be independent of the operations. I agree it is unreasonable to expect a random developer to build AND MAINTAIN the entire stack for every project. PaaS makes a lot of sense, especially as a starting place.
The solution is either to have a PaaS provider who ruthlessly eliminates single points of failure (there isn't one, currently), or use some standardized software system which can be operated by multiple independent operators with nothing shared. Unfortunately, the only vendor-independent infrastructure is the physical server, various forms of VPS, etc. -- it's all at the IaaS level. As far as I know there's no PaaS type thing with a common interface which arbitrary providers can operate, with some kind of marketplace for users to pick operators independently from the technology.
Theoretically, sure, but it's an engineering and economic thing.
You can mitigate specific risks, and you try to prioritize those based on cost, frequency, and severity. If there were a great redundant provider with good authentication on accounts, a strong balance sheet and business, and sane policies on managing accounts, you would be fairly safe using just that provider. After all, you could always get a court order to cease providing services, yourselves, like if you do something some troll has patented. It kind of depends on your application, too -- if I were doing a wikileaks, a bitcoin exchange or torrent site or some other legally at risk business, I'd want country-level separation across multiple providers, at least as a cold backup. Casual game for facebook or mobile, not really much of a concern.
That means that I need to know about (and program chef to do):
logging
security hardening
firewalls
backups (both point in time and complete backups, both for database and for any other artifacts like image uploads)
testing backup recovery (both the point in time and complete backups)
replication
HA
LB
SSL termination
postgresql configuration
monitoring (security, system stats, application availability, individual process running)
performance analysis (both at application and system level)
method for deploying updates (to the application and to the system)
and the list goes on... you could easily make a career out of focusing on postgresql configuration, for example. these aren't simple things to do.
Different providers have different limitations on what you can do here. For example, AWS doesn't support multicast, which limits your options for doing High Availability.
Know about? Program Chef to do? How about: be on call 24/7 to handle random problems in? The best, most competent ops teams can all tell you horror stories about black swan outages traceable to some of the least likely components in the stack you just outlined.
Almost all of the things you outlined are things that have in the past blown up in real world deployments, often at cloud hosting providers where you didn't know about it it because a well-paid, well-trained ops team hid the drama for you.
So yeah, definitely factor that in to the cost of hosting at a cloud provider. You're absolutely right.
I think before you put an app on the public Internet for people to consume, you should be comfortable with everything you listed plus more. You might not be doing it for the app at hand, but you should know it at least. Especially if you're doing a solo founder thing, you might be having to implement all of these things before you can afford ops. If you're deploying services without some of the knowledge in your list, too, you're setting yourself up for trouble.
It's not an insurmountable barrier, for sure. Knowing how to operate your own code is a great feeling.
> My comment was very specifically focused on you, acting as a solo developer/founder, not a person you'd hire.
Then "you," having acquired years of experience in designing and implementing distributed services, are making at least a 6-figure investment in opportunity cost.
I am intrigued. Do you have any quick links on hand, or pointers on how one would go about learning how to set up something like this? Or just learning about how to distribute services in general?
That is to say, I know my way around Linux, have set up single/standalone servers for all sorts of services, but have never known remotely where to begin on the distributed side of things, much less understand what is required to makes something "designed to be multihomed".
I learned from trial and error and also on the job. In my experience, how-tos are typically antiquated or not terribly clear. My best advice would be to dig into ServerFault and Google what you don't understand.
I did too; for instance, I learned how to terminate DS1s and PRIs by trial-and-error punching down every possible combination of little colored wires, and to this day remember that it's ESF/B8ZS and not AMI because that setting change, at 10:30PM on a Friday night, was when I got the Livingston box to light up properly.
Fun? Yes. Miss it? Fuck no. Recommend people relive the 2012 version of the experience? Huh? Get back to work writing code. Re-read the Wikipedia page on "comparative advantage" first if you need to motivate yourself.
The context of the original question was improving the availability of a product deployed to Heroku or other PaaS, which implies a lack of dedicated operations staff.
An experienced ops team willing to be on call 24x7 is easily six figures by itself.
If you are running an app that is utilized by more people than just yourself, there is never "a lack of dedicated operations staff". That situation simply does not exist in any known realm. In the case of a solo developer, you are the operations staff. You are on call, 24x7.
With that in mind, everything I talk about in my reply is doable by one person, or a solo developer. A cursory understanding of administration goes far, and if you're deploying to Heroku, you should know enough about administration to not be completely in the dark when things fall apart.
Deploying a mass-consumable Web service implies that you have accepted the fact that you are now a developer and administrator.
Do note that Cloudflare is throwing mallware/captcha notices to legitimate users, which is extremely annoying.
When I get such a screen I immediately close the website I'm visiting. And no, contrary to what they tell me, I do not have "a virus" nor is my computer part of a botnet.
Also, careful with Cloudflare SSL. It only secures between the browser and Cloudflare's data center. The connection between your server and Cloudflare remains non-SSL.
It seems like they are quick and post a lot of updates on this status page each time when something's wrong. (I am not using Heroku but had the 'joy' of trying to run a site on Dreamhost for a while, so I have seen the worst.)
I think it's similar to http://isitchristmas.com/ . It's a simple answer that's always the same, to drive the point home: you are always the one responsible for your uptime, no matter whether you choose dedicated hosting, the cloud, your own closet, etc. You can't outsource responsibility.
It's so scary to think that our entire startup almost died with this glitch. Our enterprise customers use our product in the mornings... this happened at 9-fucking-am right during our peak hours.
Obviously it's our fault for not having a redundant system / for trusting heroku. Though I still can't help but be a little pissed as I email 100 people about how sorry I am for their service disruption. Heroku didn't email me.
Interesting to see that www.heroku.com runs on their own platform, and is just as susceptible to platform downtime.
On the other hand its surprising to me that when heroku goes down, it goes down hard, namely that all hosted sites including their own are unavailable.
Totally agree with this assessment. I don't know how their routing mesh works exactly, but it seems that they're coupled too closely. Either everybody is mostly working fine, or everybody's down.
What would really help in these situations is a failover to our maintenance pages or some other static page we could provide instead of showing the Heroku "Application Error".
This is probably harder then it seems, especially when the outage is related to their routing infrastructure.
Well, there is an option to have error pages on your heroku domains, but that is probably dependent on at least a minimum level of the routing layer working.
Yeah, I have that setup but its not being used at the moment with this outage - if full redundancy with multiple regions is a huge challenge, perhaps some interim basic routing redundancy would go a long way so we could at least display a branded error page during an outage.
I agree. There is a very simple way to do this which I just posted (since it apparently hasn't occurred to Heroku). All they need to do is change the default app error message once they have confirmed they have a platform issue. So within a few minutes of an outage, a better message is displayed to our customers that does NOT make it look like we screwed up. http://blog.pardner.com/2012/06/dear-heroku-dammit-quit-blam...
Heroku is a great platform and has come a really long way -- but it could really use some competition. Some of the main issues that would be resolved quickly with competition are:
- it would be possible to have the non-DB part of your heroku app be spread across multiple availability zones.
- worker pricing would be much lower or based on actual CPU cycles used.
- add-on providers would be vetted more thoroughly before getting a spot in the add-on store.
- they'd reopen the #heroku irc channel for informal support
Eventually, they need to offer multiple independent regions, so if you're inclined (like Netflix does with AWS) you can develop a way to fail-over to another region.
At least then, it won't be "Heroku is down", it will be "Heroku West is down" or some such thing.
I know most sites hosted on heroku are pet projects with no real need that type of uptime, but this type of downtime makes it impossible to use them as enterprise customers.