yea that is surprising that you can't roll back when you first start with terraform but as you gain more experience with it, you realize that not rolling back means you can resume instead. And if you need to roll back, you can do so by just running destroy. It's actually a feature, not a bug.
no it’s not.
i want to leave my infrastructure in a consistent state.
i am in state A and want to move to state B. I want it to work. I don’t want a half-assed attempt to make it work.
what does terraform bring to the table? I have to use HCL to describe my infrastructure in terms that are NOT cloud agnostic (therefore introducing another layer) and in the face of adversity it throws its hands in the air and now you’ve got to figure out what went wrong, manually, by yourself. This is what I call True Devops (TM).
I have seen Terraform crap out and it cannot recover. It cannot move forward, it cannot rollback, it cannot destroy. It’s stuck. At that point you start praying that someone really understands the underlying cloud + knows the shenanigans terraform plays to fix it now and also make terraform happy moving forward.
we’re talking basic stuff here.
i don’t want to go into more advanced issues like: losing network connectivity, terraform process crashing (think oom conditions) or being killed or non-responsive cloud apis.
not to mention that destroying infrastructure you’ve created almost never works (unless it’s trivial infrastructure).
based on what I’ve seen up until now I would not use terraform in a production environment.
If I had experienced what you just described, I would probably have the same opinion - but after the initial learning curve, I haven't really had any of the problems you've listed. The only times I've had to go manually modify cloud resources to fix something was always because I was doing it wrong in the first place.
On the other hand, CloudFormation is not perfect either. The rollback does not work 100% of the time and I've had it roll back a set of templates that took 45 minutes to deploy because there was some inconsequential timeout that could have been ignored. I've also had pre-built templates developed by AWS outright fail, which is strange considering AWS themselves built it.