Hi, I'm Mike and I work in Engineering at Atlassian. Here's our approach to back...

Rantenki · on April 14, 2022

Hey Mike; Not dumping on you personally, but the RTO claims to be 6 hours. I can understand that being a target, but we're at 32X that RTO target, with a communicated target date of another 12 or so days IIRC. That's literally two orders of magnitude longer than the RTO. I don't think any rational person would take that document seriously at this point.

I'll also ask (since nobody else has answered, I may as well ask you as well):

1. Are the customers actually being restored from backups (and additionally, by a standard process)?

2. Will the recovery also include our integrations, API keys, configuration and customization?

miketria · on April 15, 2022

Hi Ranteki, you're right that the RTO for this incident is far longer than any of the ones listed on the doc I linked above. That's because our RPO/RTO targets are set at the service level and not at the level of a "customer". This is part of the problem and demonstrates a gap both in what the doc is meant to express and a gap in our automation. Both will be reviewed in the PIR. Also, the answer to (1) and (2) is yes.

atlasgone · on April 14, 2022

A friend in Atlassian engineering said the numbers on the trust site are closer to wishful thinking than actual capabilities, and that there has been an engineering wide disaster recovery project running because things were in such bad shape. The recovery part hasn't even started. If Atlassian could actually restore full products in under six hours, they should have been able to restore a second copy of the products exclusively for the impacted customers.

bradknowles · on April 16, 2022

Nah. The RTO/RPO assumes that only one customer that has a failure big enough to require a restore.

When the entire service is hosed, that's a totally different set of circumstances, and you have to look at what the RTO/RPO are for basically restoring the entire service for all customers. And since the have more than a thousand customers, it totally makes sense that it would take orders of magnitude longer to restore the entire service.

palijer · on April 14, 2022

I think this document and incident is a decent example of common DR planning failure patterns.

It is explained here that Atlassian runs regular DR planning meetings with the engineers spending time planing out potential scenarios, as well as quarterly tests of backups and tracking findings from them.

So, with those two things happening, I the imagine recovery time objectives of <6 hours was taking a typical "we deleted data from a bad script run affecting a lot of customers" scenario into account with the metrics from the quarterly backup tests.

That doesn't even come close to the recovery time we are currently seeing now however. We're coming up on 2 orders of magnitude more than that.

The above doc seems pretty far our of line with what is currently happening.

sizzle · on April 13, 2022

How’s the atmosphere internally Mike? Must be crazy times there. I know this isn’t your fault, so hang in there. Cheers!

MuffinFlavored · on April 14, 2022

Whose fault is it? Is it any one person/team’s fault? Management? Culture?

“Corporations are people too”

mdoms · on April 14, 2022

400 tenants doesn't seem like that much scale though...? What will happen if there's an incident affecting more than 0.18% of tenants?

mulletbum · on April 14, 2022

It's 400 tenants scattered across all their servers. So they are most likely having to build out servers to pull the data then put it in place. 10x the problem that just restoring a single server would be.

encryptluks2 · on April 13, 2022

You mean your poor practices and bad design. The only way to prevent this type of issue in the future is to admit the failures.