Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi, I'm Mike and I work in Engineering at Atlassian. Here's our approach to backup and data management: https://www.atlassian.com/trust/security/data-management - we certainly have the backups and have a restore process that we keep to. However, this incident stressed our ability to do this at scale, which has led to the very long times to restore.


Hey Mike; Not dumping on you personally, but the RTO claims to be 6 hours. I can understand that being a target, but we're at 32X that RTO target, with a communicated target date of another 12 or so days IIRC. That's literally two orders of magnitude longer than the RTO. I don't think any rational person would take that document seriously at this point.

I'll also ask (since nobody else has answered, I may as well ask you as well):

1. Are the customers actually being restored from backups (and additionally, by a standard process)?

2. Will the recovery also include our integrations, API keys, configuration and customization?


Hi Ranteki, you're right that the RTO for this incident is far longer than any of the ones listed on the doc I linked above. That's because our RPO/RTO targets are set at the service level and not at the level of a "customer". This is part of the problem and demonstrates a gap both in what the doc is meant to express and a gap in our automation. Both will be reviewed in the PIR. Also, the answer to (1) and (2) is yes.


A friend in Atlassian engineering said the numbers on the trust site are closer to wishful thinking than actual capabilities, and that there has been an engineering wide disaster recovery project running because things were in such bad shape. The recovery part hasn't even started. If Atlassian could actually restore full products in under six hours, they should have been able to restore a second copy of the products exclusively for the impacted customers.


Nah. The RTO/RPO assumes that only one customer that has a failure big enough to require a restore.

When the entire service is hosed, that's a totally different set of circumstances, and you have to look at what the RTO/RPO are for basically restoring the entire service for all customers. And since the have more than a thousand customers, it totally makes sense that it would take orders of magnitude longer to restore the entire service.


I think this document and incident is a decent example of common DR planning failure patterns.

It is explained here that Atlassian runs regular DR planning meetings with the engineers spending time planing out potential scenarios, as well as quarterly tests of backups and tracking findings from them.

So, with those two things happening, I the imagine recovery time objectives of <6 hours was taking a typical "we deleted data from a bad script run affecting a lot of customers" scenario into account with the metrics from the quarterly backup tests.

That doesn't even come close to the recovery time we are currently seeing now however. We're coming up on 2 orders of magnitude more than that.

The above doc seems pretty far our of line with what is currently happening.


How’s the atmosphere internally Mike? Must be crazy times there. I know this isn’t your fault, so hang in there. Cheers!


Whose fault is it? Is it any one person/team’s fault? Management? Culture?

“Corporations are people too”


400 tenants doesn't seem like that much scale though...? What will happen if there's an incident affecting more than 0.18% of tenants?


It's 400 tenants scattered across all their servers. So they are most likely having to build out servers to pull the data then put it in place. 10x the problem that just restoring a single server would be.


You mean your poor practices and bad design. The only way to prevent this type of issue in the future is to admit the failures.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: