Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

a cleaner solution is to deploy configuration changes from scratch, by deploying clean-slate instances with those changes made.

All of us who have built big cloud-server clusters have dreamed of this plan at least once. But there are big practical problems.

Relaunching infrastructure is easy in theory, but from time to time it becomes very difficult. There is nothing like being blocked on a critical upgrade because your Amazon region has temporarily run out of your size of instance, or because the control layer is having a bad day, or because you've accidentally hit your instance limit in the middle of a deployment, or...

A much bigger issue is that bandwidth is finite, so "big" data is hard to move. This is a matter of physical law. It's all well and good to declare that you're never going to apply a MySQL patch in place: You're just going to launch a new instance with the new version and then switch over. But however fast you manage to launch the new instance (and you will be hard put to launch an instance faster than you can apply a patch and restart a daemon...) you will be limited by the need to copy over the data. Have you ever tried copying half a terabyte of data over a network in an emergency while the customer is on the phone? It is very annoying. Because it is often physically impossible to do it quickly: Cloud infrastructure isn't generally built for that, and when it is it costs money that your customer will not want to spend for the luxury of faster, cleaner patch-application.

A solution to this is to use cloud storage like EBS. Now your data sits in EBS and you just detach its drive and reattach it to a new instance. That actually works okay, provided you're happy with the bandwidth and reliability of EBS, which lots of people aren't – and, as those people will cheekily point out, you have now solved the "relaunches are slow" problem by replacing it with an "everything is uniformly slow" problem. Moreover, detaching and reattaching EBS volumes isn't instantaneous either. You have to cleanly shut down and cleanly detach and cleanly restart, and there's like 12 states to that process, and all of them occasionally fail, and if you don't want your service to go down for thirty seconds every time you apply a patch you need a ton of engineering.

Which brings us to the other problem: Complexity. Most programmers are not running replicated services with three-nines-reliable failover that never breaks replication. But even if you are, because you've got the budget for excellent infrastructure and a great team, it will always - for values of "always" measured in several more years, anyway - be more complicated and risky to fail over a critical production service than to apply, say, a security patch to 'vi' in place on a running server. 'vi' is not in your critical path. If you accidentally break 'vi' on a live server (and you won't, because vi is older than dirt and solid as a rock), you will have a good laugh and roll it back. Why risk a needless failover, which always has a chance of failure, when you could just apply the damn patch and thereby mitigate risk?

At Google scale that argument probably stops applying. But most people don't run at that scale and it will take decades to migrate everyone to a system that does, if that even happens.

So, "dustbin of history", maybe, someday, but in the long run we are all retired, and I will be retired before our dream becomes reality. ;)



The bulk of your comment - your second, third and fourth paragraphs - focus on issues of speed, bandwidth and reliability in a third party hosting/cloud-based architecture, which are a design-time tradeoff, so I don't see them as strictly relevant (though anecdotally informative).

Your fifth paragraph describes problems related to operations process, which are entirely avoidable.


Well, okay. Give my regards to Saint Peter and all the angels!


In the parlance of our times: "Do you even lift?"




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: