Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They claim they test backups quarterly yet they don't have a procedure in place to restore the operation. We all know your backup is not tested until you restored everything successfully. This is not an engineering mistake, it is a flat out lie.


Well, their explanation makes sense. These are multi-tenant environments where not every tenant was affected; sensibly, the backups appear divided by environment, not tenant. You can’t blindly revert to an environment’s last backup in this scenario, although you’d think they would have done it before.


Not having per tenant backups is sensible? Seems like a bit of an oversight. Doesn't really matter if the hosting is multi-tenant or not.


You can imagine problems restoring one individual tenant's data to an otherwise active database with many tenants; i.e. any cross-tenant primary keys which will have shifted since one tenant's last backup. Separating the backups wouldn't help with the restoration.


I don't have Postgres/MySQL scaling experience beyond wrangling some largeish monoliths with < 100 databases.

But,

    You can imagine problems restoring one 
    individual tenant's data to an otherwise 
    active database with many tenants

    any cross-tenant primary keys
Why would multiple tenants share a database? Sharing a database server, yes, but sharing databases and mingling primary keys and such?

That's such a recipe for disaster; giving each client their own database seems like the easiest win in the world.

But I'm not intimate with Atlassian products. Maybe they have some products where that's not practical for some reason.


It’s actually a real pain to handle “database per tenant”. Now for Postgres, for example, that’d mean database connections per tenant, which is wildly unscalable with lots of (particularly small) tenants per server.


Couple of scenarios come to mind...

1) Its super common even in multitenant systems to have a common database with configuration information (for example) which serves all tenants, and tenant-specific databases used alongside that to host their private data.

2) Back when sharding started to be a popular scaling pattern, tenants were not always split up by the tenant boundary but by some other reliable key. Obviously this isn't true multitenancy and I think most DBAs would discourage the pattern today. However, given the age of the products at Atlassian (and assuming a fast-and-loose engineering culture, which has been alluded to elsewhere) its entirely possible that parts of these products as well as the entire product itself may use this kind of sharding.

Bottom line, we can only hypothesize unless and until someone from Atlassian actually details their architecture (which may have happened? I dunno, I haven't been paying that much attention to it…)


    1) Its super common even in multitenant 
    systems to have a common database with 
    configuration information (for example) 
    which serves all tenants, and tenant-specific 
    databases used alongside that to host their 
    private data.
Yeah, for sure. This is definitely what I'd expect to see, but I would also expect that to make individual client restores pretty easy, assuming the individual client backups themselves weren't trashed.

One wouldn't imagine that the shared config database would have a dependency on any of the individual client databases and that they could therefore be moved/dropped/restored at will, independently of the shared config database.

    2) Back when sharding started to be a popular 
    scaling pattern, tenants were not always split 
    up by the tenant boundary but by some other 
    reliable key. 
I guess that makes sense. I mean, after all, it does allow large/demanding clients to span multiple databases I guess.


Why not restore the tenants to a different environment that is not otherwise active? At Atlassian's scale you would expect them 1) not to be running all the things on one server anyway, 2) have some existing ability to move tenants between environments for legal or performance reasons, 3) have the ability to backup/restore single tenants, and so on. I am not arguing that the predicament they are in now is real, for them, now. But I have worked at much smaller outlets where this worked fine, not because they were smaller, but because they had that particular shit in order, which is entirely a matter of priorities.

Nothing, and I mean absolutely nothing, that Atlassian has to offer is rocket surgery-kind of hard... yet, here we are... not being particularly surprised at all.


Depending on the size of the tenant’s data, moving to a new environment could take a while.

But the idea of doing a full restore to a new environment, then only enabling accounts for the impacted tenants, is a good one.


I can imagine plenty of issues but it's definitely a limitation in their design and I'd be surprised if that haven't ran into it before. Surely they've had a tenant destroy their instance and request a restore before.

They're in control of the architecture: rollback, backup, and recovery should all be considerations


It is a complex problem but it’s one worth solving. Just spit balling but I think you could reduce some of the difficulty maintaining it by shifting it away from ops to development. Keep the disaster recovery level database backups for that rainy day but make customer level backup/restores an integrated feature developed and maintained like other services.


I wouldn't be surprised if a lot of the time was spent just waiting on the ops team to perform restores. If it's a manual/labor intense process, it's likely take them a while to work through the entire list


How do you implement per tenant backups? Not every db system cleanly separates where each tenant’s data is stored.


I don't have a ton of experience doing this, but with a lot of multitenant you just give each tenant their own database in the first place. It solves multiple classes of problems.

On a single Postgres instance you can (at least theoretically) have 4 billion databases per instance.


Most of the multi-tenant SaaS products I've worked with do NOT have per tenant databases. I'm sure some do, but the bulk of the multi-tenant products use one (or several) larger databases.


> I'm sure some do, but the bulk of the multi-tenant products use one (or several) larger databases.

In that case, the tradeoff between isolation and ease of development is made. That said, having a schema per user (even if in the same physical database) seems like a nice approach, if you can stomach the overhead and added ops complexity.


It also creates multiple classes of problems. There are known issues with PostgreSQL's handling of databases containing tens of thousands of tables. (you will need a disproportionate amount of memory to handle that use case on a busy db server). Using a single database with tenant isolation by a discriminator key (preferably enforced by row level security) is a lot more efficient.


4 billion databases and 10k connections - I sense a problem :D

And I don't think even pgbouncer would help.


Of course you can't blindly restore, but it seems that's what they 'test'. Either they are completely incompetent or they don't test the real procedure.


They'd test monthly, but a full restore take a quarter XD




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: