They claim they test backups quarterly yet they don't have a procedure in place to restore the operation. We all know your backup is not tested until you restored everything successfully.
This is not an engineering mistake, it is a flat out lie.
Well, their explanation makes sense. These are multi-tenant environments where not every tenant was affected; sensibly, the backups appear divided by environment, not tenant. You can’t blindly revert to an environment’s last backup in this scenario, although you’d think they would have done it before.
You can imagine problems restoring one individual tenant's data to an otherwise active database with many tenants; i.e. any cross-tenant primary keys which will have shifted since one tenant's last backup. Separating the backups wouldn't help with the restoration.
It’s actually a real pain to handle “database per tenant”. Now for Postgres, for example, that’d mean database connections per tenant, which is wildly unscalable with lots of (particularly small) tenants per server.
1) Its super common even in multitenant systems to have a common database with configuration information (for example) which serves all tenants, and tenant-specific databases used alongside that to host their private data.
2) Back when sharding started to be a popular scaling pattern, tenants were not always split up by the tenant boundary but by some other reliable key. Obviously this isn't true multitenancy and I think most DBAs would discourage the pattern today. However, given the age of the products at Atlassian (and assuming a fast-and-loose engineering culture, which has been alluded to elsewhere) its entirely possible that parts of these products as well as the entire product itself may use this kind of sharding.
Bottom line, we can only hypothesize unless and until someone from Atlassian actually details their architecture (which may have happened? I dunno, I haven't been paying that much attention to it…)
1) Its super common even in multitenant
systems to have a common database with
configuration information (for example)
which serves all tenants, and tenant-specific
databases used alongside that to host their
private data.
Yeah, for sure. This is definitely what I'd expect to see, but I would also expect that to make individual client restores pretty easy, assuming the individual client backups themselves weren't trashed.
One wouldn't imagine that the shared config database would have a dependency on any of the individual client databases and that they could therefore be moved/dropped/restored at will, independently of the shared config database.
2) Back when sharding started to be a popular
scaling pattern, tenants were not always split
up by the tenant boundary but by some other
reliable key.
I guess that makes sense. I mean, after all, it does allow large/demanding clients to span multiple databases I guess.
Why not restore the tenants to a different environment that is not otherwise active? At Atlassian's scale you would expect them 1) not to be running all the things on one server anyway, 2) have some existing ability to move tenants between environments for legal or performance reasons, 3) have the ability to backup/restore single tenants, and so on. I am not arguing that the predicament they are in now is real, for them, now. But I have worked at much smaller outlets where this worked fine, not because they were smaller, but because they had that particular shit in order, which is entirely a matter of priorities.
Nothing, and I mean absolutely nothing, that Atlassian has to offer is rocket surgery-kind of hard... yet, here we are... not being particularly surprised at all.
I can imagine plenty of issues but it's definitely a limitation in their design and I'd be surprised if that haven't ran into it before. Surely they've had a tenant destroy their instance and request a restore before.
They're in control of the architecture: rollback, backup, and recovery should all be considerations
It is a complex problem but it’s one worth solving. Just spit balling but I think you could reduce some of the difficulty maintaining it by shifting it away from ops to development. Keep the disaster recovery level database backups for that rainy day but make customer level backup/restores an integrated feature developed and maintained like other services.
I wouldn't be surprised if a lot of the time was spent just waiting on the ops team to perform restores. If it's a manual/labor intense process, it's likely take them a while to work through the entire list
I don't have a ton of experience doing this, but with a lot of multitenant you just give each tenant their own database in the first place. It solves multiple classes of problems.
On a single Postgres instance you can (at least theoretically) have 4 billion databases per instance.
Most of the multi-tenant SaaS products I've worked with do NOT have per tenant databases. I'm sure some do, but the bulk of the multi-tenant products use one (or several) larger databases.
> I'm sure some do, but the bulk of the multi-tenant products use one (or several) larger databases.
In that case, the tradeoff between isolation and ease of development is made. That said, having a schema per user (even if in the same physical database) seems like a nice approach, if you can stomach the overhead and added ops complexity.
It also creates multiple classes of problems. There are known issues with PostgreSQL's handling of databases containing tens of thousands of tables. (you will need a disproportionate amount of memory to handle that use case on a busy db server).
Using a single database with tenant isolation by a discriminator key (preferably enforced by row level security) is a lot more efficient.
Of course you can't blindly restore, but it seems that's what they 'test'. Either they are completely incompetent or they don't test the real procedure.