Gmail had a vaguely similar outage years ago. [1] tl;dr:
1. Different root cause. There was a bug in a refactoring of gmail's storage layer (iirc a missing asterisk caused a pointer to an important bool to be set to null, rather than setting the bool to false), which slipped through code review, automated testing, and early test servers dedicated to the team, so it got rolled out to some fraction of real users. Online data was lost/corrupted for 0.02% of users (a huge amount of email).
2. There were tape backups, but the tooling wasn't ready for a restore at scale. It was all hands on deck to get those accounts back to an acceptable state, and it took four days to get back to basically normal (iirc no lost mail, although some got bounced).
3. During the outage, some users could log in and see something frightening: an empty/incomplete mailbox, and no banner or anything telling them "we're fixing it".
4. Google communicated more openly, sooner, [2] which I think helped with customer trust. Wow, Atlassian really didn't say anything publicly for nine days?!?
Aside from the obvious "have backups and try hard to not need them", a big lesson is that you have to be prepared to do a mass restore, and you have to have good communication: not only traditional support and PR communication but also within the UI itself.
Even though you are no longer there...I had a friend who recently had her gmail inbox mysteriously emptied, all emails seemingly permanently deleted. She paid for Google One to be able to talk to support, and they said that the data is gone. Do you know if there's a way to recover this data? She is quite heartbroken at all the attachments that she will never get to see again.
1. Different root cause. There was a bug in a refactoring of gmail's storage layer (iirc a missing asterisk caused a pointer to an important bool to be set to null, rather than setting the bool to false), which slipped through code review, automated testing, and early test servers dedicated to the team, so it got rolled out to some fraction of real users. Online data was lost/corrupted for 0.02% of users (a huge amount of email).
2. There were tape backups, but the tooling wasn't ready for a restore at scale. It was all hands on deck to get those accounts back to an acceptable state, and it took four days to get back to basically normal (iirc no lost mail, although some got bounced).
3. During the outage, some users could log in and see something frightening: an empty/incomplete mailbox, and no banner or anything telling them "we're fixing it".
4. Google communicated more openly, sooner, [2] which I think helped with customer trust. Wow, Atlassian really didn't say anything publicly for nine days?!?
Aside from the obvious "have backups and try hard to not need them", a big lesson is that you have to be prepared to do a mass restore, and you have to have good communication: not only traditional support and PR communication but also within the UI itself.
[1] https://static.googleusercontent.com/media/www.google.com/en...
[2] https://gmail.googleblog.com/2011/02/gmail-back-soon-for-eve...