I know it's popular to be a "developer-centric" organization these days, but please, please, don't schedule risky maintenance operations for 10 AM on a Monday, when your tools are in heavy usage.
Every SaaS product I've built has analyzed traffic, and performed migrations/deployments at off-hours (generally, after midnight in our dominant time zone). In this case, the outage may have resulted in hundreds (thousands?) of paged administrators across the world, but at least fewer end-users would have been affected.
Also, at scale, it's a good idea to deploy to a single cluster/zone first, and check error rates before deploying to the larger environment.
It's pretty scary that the 'professionals' to whom I've trusted my business aren't more savvy when it comes to high availability...
Are you sure that Heroku is peaking 10 AM on a Monday?
I believe they have off-hours, but as an international product it could be interesting to know when. I know a lot of European products with European users that all depend on Heroku.
I'm not sure that Heroku is peaking at 10 AM on Monday, but I'm reasonably sure it's not a traffic nadir for them.
For example, 6 PM on Saturday PDT is 1 AM on Sunday GMT and 9 AM on Saturday in Hong Kong. This would almost certainly be a time of lower deployments / requests globally.
They have a separate EU hosting region that has separate maintenance (and uptime that's actually significantly better then US region): https://status.heroku.com/uptime?region=EU
So yeah, it was a lame time to schedule maintenance.
If you've architected for IAAS you have sufficient redundancy and a plan for graceful degradation. I'd rather be dealing with an outage after lunch than at 4am.
This is exactly the "developer-centric" mindset that is, frankly, misguided.
Sure, I would rather deal with an outage during business hours, but that means that many of my Customers are also dealing with an outage. If the outage were at 4 AM, most of my Customers would be asleep, and wouldn't even notice the outage.
I'm sure that's the plan. But if you reallyreallyreally need faultless 24/7 you're looking for something akin to "NASA as a service", and I assume that such a service wouldn't be able to invoice using something as pedestrian as a credit card.
You don't need to make your developers work a 24 hour cycle to do off hours maintenance. You plan ahead, give the team doing the update the day off before and give them a reasonable window to work with.
Maybe I'm wrong but if a deployment goes wrong at peak hours it's going to amount to a lot more lost money for the customer than it will if a deployment goes wrong at off hours. If you cost your customers money with an outage they're not going to care that you scheduled a developer-centric maintenance window.
There is no reason they cannot perform maintenance for US zones and EU zones at different times and avoid M-F mid-day for everyone. So it's probably "just easier" for Heroku this way.
Every SaaS product I've built has analyzed traffic, and performed migrations/deployments at off-hours (generally, after midnight in our dominant time zone). In this case, the outage may have resulted in hundreds (thousands?) of paged administrators across the world, but at least fewer end-users would have been affected.
Also, at scale, it's a good idea to deploy to a single cluster/zone first, and check error rates before deploying to the larger environment.
It's pretty scary that the 'professionals' to whom I've trusted my business aren't more savvy when it comes to high availability...