Heroku Dynos Unable to Start

sparkman55 · on June 23, 2014

I know it's popular to be a "developer-centric" organization these days, but please, please, don't schedule risky maintenance operations for 10 AM on a Monday, when your tools are in heavy usage.

Every SaaS product I've built has analyzed traffic, and performed migrations/deployments at off-hours (generally, after midnight in our dominant time zone). In this case, the outage may have resulted in hundreds (thousands?) of paged administrators across the world, but at least fewer end-users would have been affected.

Also, at scale, it's a good idea to deploy to a single cluster/zone first, and check error rates before deploying to the larger environment.

It's pretty scary that the 'professionals' to whom I've trusted my business aren't more savvy when it comes to high availability...

kawsper · on June 23, 2014

Are you sure that Heroku is peaking 10 AM on a Monday?

I believe they have off-hours, but as an international product it could be interesting to know when. I know a lot of European products with European users that all depend on Heroku.

sparkman55 · on June 23, 2014

I'm not sure that Heroku is peaking at 10 AM on Monday, but I'm reasonably sure it's not a traffic nadir for them.

For example, 6 PM on Saturday PDT is 1 AM on Sunday GMT and 9 AM on Saturday in Hong Kong. This would almost certainly be a time of lower deployments / requests globally.

zzen · on June 23, 2014

They have a separate EU hosting region that has separate maintenance (and uptime that's actually significantly better then US region): https://status.heroku.com/uptime?region=EU

So yeah, it was a lame time to schedule maintenance.

alrs · on June 23, 2014

I disagree, strongly.

If you've architected for IAAS you have sufficient redundancy and a plan for graceful degradation. I'd rather be dealing with an outage after lunch than at 4am.

sparkman55 · on June 23, 2014

This is exactly the "developer-centric" mindset that is, frankly, misguided.

Sure, I would rather deal with an outage during business hours, but that means that many of my Customers are also dealing with an outage. If the outage were at 4 AM, most of my Customers would be asleep, and wouldn't even notice the outage.

alrs · on June 23, 2014

If you want NASA reliability, build a NASA-grade 24/7 operation across three timezones.

If your plan is just to have the nerds skip sleep periodically, you're an asshole.

sparkman55 · on June 23, 2014

Exactly! Heroku should be building that 24/7 operation so I don't have to. Isn't that the whole point of PaaS?

alrs · on June 23, 2014

I'm sure that's the plan. But if you really really really need faultless 24/7 you're looking for something akin to "NASA as a service", and I assume that such a service wouldn't be able to invoice using something as pedestrian as a credit card.

GVIrish · on June 23, 2014

You don't need to make your developers work a 24 hour cycle to do off hours maintenance. You plan ahead, give the team doing the update the day off before and give them a reasonable window to work with.

Maybe I'm wrong but if a deployment goes wrong at peak hours it's going to amount to a lot more lost money for the customer than it will if a deployment goes wrong at off hours. If you cost your customers money with an outage they're not going to care that you scheduled a developer-centric maintenance window.

pardner · on June 23, 2014

There is no reason they cannot perform maintenance for US zones and EU zones at different times and avoid M-F mid-day for everyone. So it's probably "just easier" for Heroku this way.

pardner · on June 23, 2014

Compounding the issue of irresponsibly scheduling maintenance for US zones mid-day in the US, they ALSO broke ability to scale worker processes to zero, so there was literally NO way to wind your app down and prevent worker jobs from firing off.

When the platform started getting wonky we shut off our worker dynos so the system would not fire off emails while the system is in a known-screwed-up state.

The console said workers were set to zero.

But... in our logs we watched the (supposedly off) workers continue to fire off.

Nice FUBAR all around, Heroku.

sprite · on June 23, 2014

Started getting tons of emails from end users. Went to investigate and was greeted with this:

! ! Heroku has temporarily disabled this feature, please try again shortly. ! See http://status.heroku.com for current Heroku platform status.

Hopefully they will be backup up soon. Also wonder if we will get any sort of reimbursement? I currently spend around $3k/month with Heroku.

silasb · on June 23, 2014

Makes zero sense to do an 2 hour update on a Monday at 10 AM PDT.

Horrible horrible timing. I'll likely be getting blamed for this since I recommended Heroku.

alrs · on June 23, 2014

Do you want to start working night shifts? The people who know how do this stuff command six-figure salaries and are highly in demand.

Every startup is looking for the design/dev/ops unicorn. You want to start trying to find nocturnal unicorns?

stevepike · on June 23, 2014

These don't need to be night shifts. Back when I worked in bigco, we had very competent ops teams around the world - we'd even send developers abroad for 3-12 month periods to train with them and share information.

Smaller companies use Heroku so we don't have to build the same expertise in house. They're able to charge a premium not because their developers have to work crazy hours, but because there's a certain base cost to having these kinds of distributed teams.

guywithabike · on June 23, 2014

This was caused by their scheduled maintenance: https://status.heroku.com/incidents/641

I think they jinxed themselves: "Running apps will not be affected."

sprite · on June 23, 2014

Seems everything is returning to normal now. Here is a New Relic screenshot from the outage: http://i.imgur.com/WB7U2mz.png

shravan · on June 23, 2014

Request queueing on New Relic is going haywire on our app right now.

maxisnow · on June 23, 2014

We had a similar problem on our app, but were able to restart with more dynos.

stefangomez · on June 23, 2014

PX dynos seem to have not been affected

sprite · on June 23, 2014

I can't even get in to New Relic.