They claim they test backups quarterly yet they don't have a procedure in place ...

iancarroll · on April 13, 2022

Well, their explanation makes sense. These are multi-tenant environments where not every tenant was affected; sensibly, the backups appear divided by environment, not tenant. You can’t blindly revert to an environment’s last backup in this scenario, although you’d think they would have done it before.

theteapot · on April 13, 2022

Not having per tenant backups is sensible? Seems like a bit of an oversight. Doesn't really matter if the hosting is multi-tenant or not.

iancarroll · on April 13, 2022

You can imagine problems restoring one individual tenant's data to an otherwise active database with many tenants; i.e. any cross-tenant primary keys which will have shifted since one tenant's last backup. Separating the backups wouldn't help with the restoration.

JohnBooty · on April 14, 2022

I don't have Postgres/MySQL scaling experience beyond wrangling some largeish monoliths with < 100 databases.

But,

    You can imagine problems restoring one 
    individual tenant's data to an otherwise 
    active database with many tenants

    any cross-tenant primary keys

Why would multiple tenants share a database? Sharing a database server, yes, but sharing databases and mingling primary keys and such?

That's such a recipe for disaster; giving each client their own database seems like the easiest win in the world.

But I'm not intimate with Atlassian products. Maybe they have some products where that's not practical for some reason.

jtc331 · on April 14, 2022

It’s actually a real pain to handle “database per tenant”. Now for Postgres, for example, that’d mean database connections per tenant, which is wildly unscalable with lots of (particularly small) tenants per server.

karmajunkie · on April 14, 2022

Couple of scenarios come to mind...

1) Its super common even in multitenant systems to have a common database with configuration information (for example) which serves all tenants, and tenant-specific databases used alongside that to host their private data.

2) Back when sharding started to be a popular scaling pattern, tenants were not always split up by the tenant boundary but by some other reliable key. Obviously this isn't true multitenancy and I think most DBAs would discourage the pattern today. However, given the age of the products at Atlassian (and assuming a fast-and-loose engineering culture, which has been alluded to elsewhere) its entirely possible that parts of these products as well as the entire product itself may use this kind of sharding.

Bottom line, we can only hypothesize unless and until someone from Atlassian actually details their architecture (which may have happened? I dunno, I haven't been paying that much attention to it…)

JohnBooty · on April 14, 2022

    1) Its super common even in multitenant 
    systems to have a common database with 
    configuration information (for example) 
    which serves all tenants, and tenant-specific 
    databases used alongside that to host their 
    private data.

Yeah, for sure. This is definitely what I'd expect to see, but I would also expect that to make individual client restores pretty easy, assuming the individual client backups themselves weren't trashed.

One wouldn't imagine that the shared config database would have a dependency on any of the individual client databases and that they could therefore be moved/dropped/restored at will, independently of the shared config database.

    2) Back when sharding started to be a popular 
    scaling pattern, tenants were not always split 
    up by the tenant boundary but by some other 
    reliable key.

I guess that makes sense. I mean, after all, it does allow large/demanding clients to span multiple databases I guess.

sverhagen · on April 14, 2022

Why not restore the tenants to a different environment that is not otherwise active? At Atlassian's scale you would expect them 1) not to be running all the things on one server anyway, 2) have some existing ability to move tenants between environments for legal or performance reasons, 3) have the ability to backup/restore single tenants, and so on. I am not arguing that the predicament they are in now is real, for them, now. But I have worked at much smaller outlets where this worked fine, not because they were smaller, but because they had that particular shit in order, which is entirely a matter of priorities.

Nothing, and I mean absolutely nothing, that Atlassian has to offer is rocket surgery-kind of hard... yet, here we are... not being particularly surprised at all.

jimbokun · on April 14, 2022

Depending on the size of the tenant’s data, moving to a new environment could take a while.

But the idea of doing a full restore to a new environment, then only enabling accounts for the impacted tenants, is a good one.

nijave · on April 14, 2022

I can imagine plenty of issues but it's definitely a limitation in their design and I'd be surprised if that haven't ran into it before. Surely they've had a tenant destroy their instance and request a restore before.

They're in control of the architecture: rollback, backup, and recovery should all be considerations

brazzledazzle · on April 14, 2022

It is a complex problem but it’s one worth solving. Just spit balling but I think you could reduce some of the difficulty maintaining it by shifting it away from ops to development. Keep the disaster recovery level database backups for that rainy day but make customer level backup/restores an integrated feature developed and maintained like other services.

nijave · on April 14, 2022

I wouldn't be surprised if a lot of the time was spent just waiting on the ops team to perform restores. If it's a manual/labor intense process, it's likely take them a while to work through the entire list

jimbokun · on April 14, 2022

How do you implement per tenant backups? Not every db system cleanly separates where each tenant’s data is stored.

JohnBooty · on April 14, 2022

I don't have a ton of experience doing this, but with a lot of multitenant you just give each tenant their own database in the first place. It solves multiple classes of problems.

On a single Postgres instance you can (at least theoretically) have 4 billion databases per instance.

modoc · on April 14, 2022

Most of the multi-tenant SaaS products I've worked with do NOT have per tenant databases. I'm sure some do, but the bulk of the multi-tenant products use one (or several) larger databases.

KronisLV · on April 14, 2022

> I'm sure some do, but the bulk of the multi-tenant products use one (or several) larger databases.

In that case, the tradeoff between isolation and ease of development is made. That said, having a schema per user (even if in the same physical database) seems like a nice approach, if you can stomach the overhead and added ops complexity.

bzzzt · on April 14, 2022

It also creates multiple classes of problems. There are known issues with PostgreSQL's handling of databases containing tens of thousands of tables. (you will need a disproportionate amount of memory to handle that use case on a busy db server). Using a single database with tenant isolation by a discriminator key (preferably enforced by row level security) is a lot more efficient.

terom · on April 14, 2022

4 billion databases and 10k connections - I sense a problem :D

And I don't think even pgbouncer would help.

madmulita · on April 14, 2022

Of course you can't blindly restore, but it seems that's what they 'test'. Either they are completely incompetent or they don't test the real procedure.

devoutsalsa · on April 13, 2022

They'd test monthly, but a full restore take a quarter XD