Hacker News new | past | comments | ask | show | jobs | submit login
How Figma's databases team lived to tell the scale (figma.com)
568 points by pinser98 10 months ago | hide | past | favorite | 223 comments



1. They mention that the largest tables ran into several TBs, and they would have soon topped the max IOPS supported by RDS. RDS for PostgreSQL peaks at 256,000 IOPS for a 64 TB volume. For a multi-AZ setup, this costs ~$70K/mo.

2. Let's assume the final outcome was a 5-way shard with each shard supporting ~50,000 IOPS and ~12 TB data. For a multi-AZ setup, this costs ~$100K/mo.

3. It took 9 months to shard their first table. Since it required application changes as well, let's assume this was 9mo * 20 work days/mo * (3 DB engineers + 2 app engineers) = 900 work days. Even at $100K avg. annual pay for an engineer, this is ~$400K.

4. A PostgreSQL-compatible NewSQL like YugabyteDB should cost ~$15K/mo to match top-of-the-line RDS performance. So Figma spent ~25x ($400K/$15K) to implement horizontal sharding in-house, and is still on RDS which costs ~6x ($100K/$15K)


> A PostgreSQL-compatible NewSQL like YugabyteDB should cost ~$15K/mo to match top-of-the-line RDS performance.

"to match" is doing a lot of work here. It's extremely unwise that a "compatible" database will have the same performance characteristics and no performance cliffs.

> So Figma spent ~25x ($400K/$15K)

They nearly got acquired for $20B, I don't think they give a hoot about 400K if it means keeping their stack the same and getting to keep all of the existing organizational knowledge about how to keep the thing online.


They're also using sticker pricing, which as we know is likely far from what they actually paid.

(And the rest of the argument sounds a lot like our former employer with their...choice of document DB :) which I'll defend too!)


That’s the problem with case studies, isn’t it? What works for one company in one industry might be a death sentence for another organization with slimmer margins…


Figma, worth $10billion, was migrating what seems like their core production data. They probably didn't want to bet the company on a comparatively small software vendor like Yugabyte.

Most likely the engineering cost was much much higher than your quotes, but still insignificant compared to the potential risks. And migrating from RDS to not-RDS could easily not have been cheap in engineering time either, depending on how much software they've built around it.


Past a certain data size, migrations are always a nightmare. For a much longer time than what you initially estimated, you are managing two systems with all the related operational costs and complexity, as well as all of the IOPs and bandwidth migrating the data.

I imagine scaling out RDS instead mitigated a lot of those costs.


Their rationale for this choice is covered in the article somewhat extensively near the top.

> Additionally, over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house. While migrating, we would have had to rebuild our domain expertise from scratch. Given our very aggressive growth rate, we had only months of runway remaining. De-risking an entirely new storage layer and completing an end-to-end-migration of our most business-critical use cases would have been extremely risky on the necessary timeline. We favored known low-risk solutions over potentially easier options with much higher uncertainty, where we had less control over the outcome.

TL;DR they were working in a short timeline, with a limited team size, and wanted to minimise any risks to the business.

Clearly cost is an issue for Figma, but downtime, or worse data loss, would have a ginormous impact on their business and potential future growth. If your product is already profitable, your user base growing fast, and with your ARR. Why would risk that growth and future ARR just to save a few $10Ks a month? A very low risk DB migration that lets you keep scaling and raking in more money, is way better than a high risk migration that might save some cash in the long term, but also risks killing your primary business if it goes wrong.


Ok, what risk? Cockroachdb is already proven technology and costs marginally more (if you use their serverless setup, it's free until you hit real scale). At the startups I've been at that hit scale, scaling sql was always a massive undertaking and affected product development on every single time.

If you don't want downtime, don't use databases that require downtime to do a migration?

Netflix, roblox, every single online gambling website all use cockroachdb.


Sounds like their discomfort was in the migration path to 'any other database' alongside not having the experience with another database to mitigate any unknown unknowns.

> During our evaluation, we explored CockroachDB, TiDB, Spanner, and Vitess. However, switching to any of these alternative databases would have required a complex data migration to ensure consistency and reliability across two different database stores.


> ensure consistency and reliability across two different database stores.

This is main known known. And this is hard thing to attain.

My favorite story on that is testing of tendermint consensus implementation [1]. The testing process found a way to break the consensus and the reason was that protocol implementation and KV store controlled by protocol used different databases.

[1] https://jepsen.io/analyses/tendermint-0-10-2


Never used cockroach so pardon my ignorance, but are there no operational challenges with running/using them? Or are they the same challenges? And how compatible is it from an application developer perspective?


The managed service is hassle free and it's auto sharded so you don't have traditional scaling issues. You do need to think about how your index choices spread writes and reads on the cluster to avoid hotspots. It's almost completely compatible with postgres wire protocol but it doesn't support things like extensions for the most part.


There are TONS of operational issues running cockroach. At the last company I was at cockroach was probably over used as a magical way to run multiple DCs and keep things consistent without high developer overhead, but it was #1 source of large outages. So much so that we’d run a cockroach segmented out for a single microservice to limit the blast radius when it eventually failed.

That and its comically more expensive than Postgres, if you think IOPs are expensive wait till you see the service contract.


CRDB is Postgres compliant so the wire protocol and SQL syntax is all Postgres. It should be a 1 to 1.


Are all the corresponding latencies for every query one to one too?


In ye olden times I used to stop bosses from throwing away the slowest machine we had, and try to get at least one faster machine.

It’s still somewhat the case, but at the time the world was rotten with concurrent code that only worked because an implicit invariant (almost) always held. One that was enforced by the relative time or latency involved with two competing tasks. Get new motherboards or storage or memory and that invariant goes from failing only when the exact right packet loss happens, to failing every day, or hour, or minute.

Yes, it’s a bug, but it wasn’t on your radar and the system was trucking along yesterday and now everything is on fire.

The people who know this think the parent is a very interesting question. The people who don’t, tend to think it’s a non sequitur.


This is the important question.

We evaluated several horizontally scalable DBs and Cockroach was by far the slowest for our access patterns.


Except for the un-implemented features which they might need.

It also uses serializable isolation and in their implementation reads are blocked by writes unlike in Postgres. Those are both significant changes that can have far reaching application impacts


IOPS isn’t a linear thing here. The vacuums needed to prevent transaction wraparound (vaccum Freeze)can’t be throttled and are much more expensive than regular vacuums. By splitting the tables they are likely reducing the need for those vacuums (by a large margin) and significantly reducing IOPS needs.


bhai, use the western numbering system, for clarity's sake. Thanks for the summary


One thought that comes up: Wouldn’t it be easier to have each customer in their own (logical) database? I mean, you don’t need transactions across different customers, right? So you’re essentially solving a harder problem than the one you’ve got.

Not sure postgres (logical) databases would scale that well, but don’t see a principal reason why it couldn’t. Has anyone explored this further?


Yes, we've been doing that at my place basically since the start. Each tenant is a schema in postgres. Works perfectly fine on the one hand, i.e. your tables don't grow to 'infinity' just because you're adding more and more tenants. If there's a particular tenant that has lots of data, only that tenant's indexes and tables grow huge and become slower because of that particular reason etc. If a tenant leaves, you keep the schema around for some time, so they can come back and then at some point you just drop the schema!

It does mean having to upgrade each schema individually, which also makes it both easier and harder. Easier because the tables are smaller, so any schema changes that require things like say a table lock, are locking for a smaller amount of time and won't affect more than the one tenant at a given time. It also means that you can get into an inconsistent state of course, where some of your tenants have all the latest DB upgrades, while it failed on another subset.

At some point Postgres's internal tables become a bit of a "problem", as you want to run as many of these updates in parallel as you can for speed, which could lead to contention on Postgres' administrative side. You'll also still need to shard across multiple actual RDS instances, because you still have many tenants running against a single physical piece of hardware that will show its limitations if too many large or active tenants happen to be on the same shard.

And then you have the problem of keeping a record of which physical RDS instance (i.e. shard) the tenant is on. Your application code will need to look that up (and cache that info for some time ;)) and you have some choice there as well. I.e. do you shard those as well and juggle load balancing as basically a 1:1 mapping to shards or do you have your application layer connect to all database shards and handle any tenants? One is more complicated I would say while the other could run out of connections depending on how you need to scale the application layer and what kind of RDS instance you have.


This is a very common approach and scales quite well. I worked for a company that had thousands of customers and each had their own schema. A single master database that kept track of which customer is on what physical db cluster, and this was globally replicated (EU,ANZ, NA).

Certainly needs a bunch of tooling, but worked well. Some apps were stateless and could connect to any physical cluster. Some others were sticky and only connected to a subset.

Similar architecture in my current company as well and we serve nearly a thousand customer instances served across 4 physical clusters.

We do have some basic tools to provision new customers on the emptiest cluster, move customers from one cluster to another etc


I recall a popular rails gem[1] once upon a time that provided multi-tenancy via postgres schemas.

As it turns out, even the company the initially developed the gem ended up ditching the approach due to some of the issues you outlined above.

Managing separate schemas feels like one of those nefarious decisions that make things simple _initially_ but get you into a world of hurt when you need to scale. The company is since defunct but they have an article where they discuss why they ditched the approach [2], TL;DR it's too difficult to maintain and scale

[1] https://github.com/influitive/apartment#tenants-on-different... [2] https://web.archive.org/web/20201108191323/https://influitiv...


This is the conclusion I came to when faced with the same quandary.


Let's address these one by one based on our experience (the part of the journey that I've been there at least as the implementation of the solution predates me but I live with it).

Migrations

    Things slow down past 100 [...] No one wants friction in their deployment process, especially as we’re attempting to deploy daily or more frequently.
We have more than a thousand schemas per database shard. That is why I said you want things to run in parallel for speed, yes. However, we deploy way more than daily and it's not really an issue in that sense. Schema updates are not that frequent but you do of course need to take into account that you will have to make the schema updates as a separate PR, wait and check that it worked and then deploy your actual change making use of the changes in application code. Which honestly isn't much different from ensuring that your BE changes and FE changes are compatible or made in the right order so that you don't get failed requests because an old FE happens to call a new BE node or vice versa :shrug:

Database resources

    "Need too large of an instance" r3.4xl
We seem to have m5.large. That has 2 virtual cores. Some r5.larges etc. Their r3.4xl has 16?! So not sure what kind of load pattern they have :shrug:

Client Memory Bloat

    Ruby ActiveRecord something
Yeah well, we don't have that, so not sure what to say :shrug: Definitely not a generalized reason to say "can't do, is too complicated for scaling".

Record Identification

    One major drawback of schemas as tenants is that your sequence generators will reside independently in each tenant. 
I respectfully disagree. This is a great advantage because it means just because you get more and more tenants (some of which churn and you throw away their data anyway) your identifiers don't grow past limits as easily. In fact, in most cases the identifiers never run out of runway at all.

They complain about "what if you need to join this data somewhere else, then you need to also add the tenantId". Yeah, so? We also have a data warehouse we we do just that. Not a problem at all. We also have other services than our main one, which do use different database technologies where we use tenant as part of the key (for loads that actually benefit from being in a NoSQL type DB) and there we do not have sharding other than what the NoSQL database does by itself so to speak by dividing the keyspace.

That's it. End of article. Basically, we have none of these issues. They don't mention their actual scale. The only number they mention is the "hundreds of schemas". We have more than ten times that number per physical shard and have tens of thousands of tenants total. Again :shrug:


That's still 2 orders of magnitude smaller than the scale of Figma—they would need to somehow manage millions of Postgres schemas. I don't think it's a realistic possibility


https://news.ycombinator.com/item?id=39711815 Shopify seems to have millions of customers. Granted, they have MySQL and it might not use database schemas and something more MySQL specific. In general though ...


"We can attach a shop_id to all shop-owned tables and have it serve as a sharding key. Moving a shop from one shard to another involves selecting all records from all tables that have the desired shop_id and copying them to another MySQL shard. For this post, it’s helpful to think of each pod as a MySQL shard."

Sounds like this is just normal sharding. Completely different from having a totally separate schema like GP proposes (with that approach you don't use sharding keys at all).


I don't see how it follows from that. "Attach" seems very generic.

But I do see now, searching for that part of the info specifically, they mention

    Peppy Peppers’ orders can be migrated by running SELECT FROM orders WHERE shop_id = 1 FOR UPDATE on Pod 1’s database

Yeah that's just having the tenant id in the table directly. Oh well!


FWIW, we at bytebase building the batch change feature to make this less painful https://www.bytebase.com/docs/change-database/batch-change/


> a single physical piece of hardware that will show its limitations if too many large or active tenants happen to be on the same shard

Shopify has pretty much mastered this https://shopify.engineering/mysql-database-shard-balancing-t...


They have since migrated to Vitess


Actually, Apple does this for iCloud! They use FoundationDB[1] to store billions of databases, one for each user (plus shared or global databases).

See: https://read.engineerscodex.com/p/how-apple-built-icloud-to-...

Discussed on HN at the time: https://news.ycombinator.com/item?id=39028672

[1]: https://github.com/apple/foundationdb https://en.wikipedia.org/wiki/FoundationDB


> store billions of databases

This is sort of true and sort of false. When you think of a "database", if you're thinking of a Postgres database, you're way off the reality of what "database" means here.

FoundationDB has a concept called "layers", and essentially they have created a layer that looks like a separate database on top of a layer that is separately encrypted groups of keys. They don't have billions of FoundationDB clusters or machines, and at the infra level, i.e. the instances of the FoundationDB server software, it's unaware of individual "databases".

A closer analogy would be like having billions of tables, but even that isn't accurate because in relational databases a table is usually a notably more static concept than data in a table. The closest analogy would be that each of the billions of users has a bunch of rows with a user-id field on them, and there's a proxy that filters everything such that you can view the table as if it only had one user's data in it.

To be clear, FoundationDB is awesome and Apple have done some really cool stuff with it, but it's less crazy/impressive than it sounds.


This sounds like a pretty standard multitenant datastore. Everything has a user/group Id on it, and a logical layer that locks a connection to a specific group.


FoundationDB is really just a set of structures and tools to build any type of database you want on top of a solid foundation.

"FoundationDB decouples its data storage technology from its data model. FoundationDB’s core ordered key-value storage technology can be efficiently adapted and remapped to a broad array of rich data models. Using indexing as an example, FoundationDB’s core provides no indexing and never will. Instead, a layer provides indexing by storing two kinds of key-values, one for the data and one for the index."

https://apple.github.io/foundationdb/layer-concept.html

Then existing standard layers like the Record layer, providing "(very) roughly equivalent to a simple relational database" providing structured types, index, complex types, queries, etc.

https://github.com/FoundationDB/fdb-record-layer

Or one for documents, which speaks the MongoDB wire protocol

https://github.com/FoundationDB/fdb-document-layer


Yeah, the advantage or difference here is that these "layers" are a common design pattern with FoundationDB, several ship in FDB by default, and you're encouraged to make more, so the database certainly has better support than just adding a column for TenantID, but still you're right that it's not too out there.


The problem - conceptually - is made much simpler this way; we make use of this at work.

However you will still have shared resource problems - some rogue query destroys IOPS in one tenant now ends up bringing down all tenants etc. There are in theory databases that solve this as well, but my experience has been that at that point what you buy into is a bad version of resource sharing - ie what an operating system does - and you’re better off using OS mechanisms

In other words: yes, but you still have noisy neighbours, and may be better off running lots of small fully separated DBMSes than a big logically separated one


If tenants are on separate databases how would that be an issue?


Because database, in Postgres terms, doesn’t mean physical node. It’s more akin to a VM than anything else. The term for an installation of Postgres is database cluster, which can contain N databases, each of which can contain M schemas.


Thanks! Is there a good primer to this terminology that clarifies these terms on various popular database and cloud platforms?

It seems there is good potential for confusion unless we use the same terms consistently when discussing architecture and design across teams.

Even the term RDS (Relational Database Service) is sometimes said to be inaccurate since it is a "Relational Database SERVER as a Service"

A few related terms that cause confusion:

"Schema" could refer to a Database Schema or in some contexts, a single table's logical data structure. (Or a single data set's data structure -- or a standard for one, like JSON Schema)

Data Catalog products like "AWS Glue Data Catalog" which only store the metadata or schemas of the table they crawl ... refer to the entities they store as "Databases" and "Tables" (but no "Schemas") and documentation includes guides talk about "creating a database"[1] and "creating a table"[2] in AWS Glue. There has to be a better way to refer to all these entities without using the word schema with so many meanings -- or calling both physical tables and their metadata as "tables". Otherwise this is needlessly confusing and hard for beginners.

---

EDIT: even more madness. This StackOverflow discussion [3] has more examples of confusing usage:

> On Oracle .... A Schema is effectively a user. More specifically it's a set of tables/procs/indexes etc owned by a user. Another user has a different schema (tables he/she owns) however user can also see any schemas they have select priviliedges on.

> MySQL: database/schema :: table

> SQL Server: database :: (schema/namespace ::) table

> Oracle: database/schema/user :: (tablespace ::) table

[1]: https://docs.aws.amazon.com/glue/latest/dg/define-database.h...

[2]: https://docs.aws.amazon.com/glue/latest/dg/tables-described....

[3]: https://stackoverflow.com/questions/298739/what-is-the-diffe...


Not AFAIK. In MySQL, the only time “cluster” is used is to refer to NDB Cluster, which is a distributed DB product. Schema means a logical grouping of tables, the same (more or less) as Postgres.

As to schema itself, yes, it’s heavily overloaded, and you just to grok it from context. I can talk about a table’s schema or a DB’s schema, or as you mentioned, JSON schema. Although the latter is helped by simply not using JSON in a relational DB.

You must remember that SQL is an ancient language, and the relational model is even older. There are going to be oddities.


Separate logical databases, within the same RDBMs, so sharing CPU, disks, RAM etc


This seems to be an architecture Cloudflare is aiming to support with their SQLite service. One database per customer, each database located in the customer’s primary region.


I think there's quite a few people chasing similar ideas, like Azure's Durable Entities.

I've been calling it the Lots of Little Databases model vs the Globe Spanning Gorilla.

Like the Spanner paper points out, even if your distributed database semantically appears like a single giant instance, in practice performance means developers avoid using distributed joins, etc, because these can lead to shuffling very large amounts of intermediate results across the network. So the illusion of being on a single giant machine ends up leaking through the reality, and people end up writing workarounds for distributed joins like async materialization.

If we give up the single machine illusion we get a lot of simplification, at the cost of features devs were unlikely to use anyhow. I see having consistent distributed commit but without cross shard joins as a really interesting alternative.

And besides scalability I like the extra security rope of fine grained partitioning from the start.

I'll write a blog post along these lines if I get anything worthwhile done.


An advantage worth noting is that having actually separated databases means you physically can't make these expensive operations, so a junior dev can't write incredibly inefficient code that would bring down your entire infra.


Also makes it a lot harder for devs to do some footgun and leak data across domains.


Bring down the infra or foot you a 6 figures bill at the end of the month


"Lots of Little Databases" reminded me of https://www.actordb.com/ which does lots of server-side sqlite instances, but the project now looks defunct.


Interesting. Durable Entities strikes me as closer to Cloudflare's Durable Objects (both in name and in design as actors backed by persistent storage).


That would be fantastic. Unfortunately it's not true. D1 doesn't support the one database per customer approach unless you have just a handful of customers that you can set up manually.

You have to create each database manually using wrangler or the website. Then you have to create a binding for each database in wrangler.toml so that the database becomes accessible as a variable in your Workers code. Then you have to change your Worker source code to do something with that variable. Then you redeploy.

The issue is that Workers cannot create or list databases. There's no API for it.


They have said they're working on some sort of solution (which they intend to have available by the time D1 exists beta) that will allow some sort of dynamic bindings specifically to address this use case.

In the meantime, though it's not a great solution, one can create and query databases through the REST API instead https://developers.cloudflare.com/api/operations/cloudflare-...


>They have said they're working on some sort of solution

That's great to hear, thanks!


This works great until (1) your largest customer outgrows the largest available DB (happens sooner than you'd think for large companies) or (2) you do need transactions across different customers, say to facilitate some kind of sharing. Going all-in on the isolated tenant strategy means when you hit one of these cases it's a nightmare to unwind and rearchitect your entire DB layer.


I'd respond by saying (1) is more rare than you're asserting.

There is a huge long tail of companies with datasets that won't fit on a single machine but can be handled by a dozen or so, and where no single customer or service is near the limits of an individual node.

These customers are poorly served at the moment. Most of the easy to implement SaaS options for what they need would be hugely costly vs a small fleet of db servers they administer. Meanwhile, most of the open source options are cargo culting Google or Facebook style architecture, which is a huge burden for a small company. I mean do you really want to run K8S when you have 10 servers in total?

I think there's a lot of interesting stuff happening in this end of the market that's not trying to be a mini Google, like Fly.io.

As for (2), I think a middle ground is supporting cross shard transactions but not joins. This works well enough for VoltDB et all.


> that won't fit on a single machine

It's rarely an issue with the number of bytes and more an issue with hot shards. Whatever shard Google is on (that is, Google the Figma customer) is surely going to be one _hell_ of a hot shard. They have more designers/engineers/PMs actively using Figma than most startups have _users total_. You don't need more than one really really hot customer for this to become more than a hypothetical problem.

When you start to think about it that way, suddenly you need to seriously consider your iops (if you're on RDS) or how much redundancy that physical machine's SSDs have (if you're running it on your own boxes).


Google still only has ~180k employees, and obviously not all of them use figma, and obviously not all of their figma users are performing actions simultaneously. I'd be surprised if it broke 10k QPS (would an org like Google even have 10k peak active user sessions? Seems doubtful). Human generated traffic tends to just not reach that large of scales unless you're trying to fit the entire planet on one instance.

RDS can be absurdly limited with IOPS, granted, but a modern laptop for example ought to be up to the task. Realistically you could probably be fine even on RDS but you might need to pay through the nose for extra IOPS.


Figma is more or less a desktop application that happens to run in a web browser.

If I use Photoshop to edit a .psd file I don’t think “man that psd file should really be stored in a single planet-sized database of all psd files in existence”. It’s just a file on my computer.

Figma requires a little bit more intermingling of data than Photoshop, it has multiuser support for one, so a pure local storage based approach wouldn’t work. But, at its heart it’s still based on the document model. When you open a Figma document it’s its own isolated little universe, the connections with resources outside the document are limited, and that matches user expectations.


Agreed - it's not like you'd ever have to do SELECT * FROM Shapes WHERE 'type' = 'circle'. Could they have stored each document as a file, stored references to the files in the database, opened the file in the backend when someone opened it in the frontend, and written it back when they've stopped editing it?


Can you give an example of when a single customer has outgrown the largest available DB?


A figma customer won't exceed the requirements of an i3.metal...


I worked at a place that did this with MySQL. Every tiny, trial account got their own database. Every huge, actual customer got their own database. Migrations were kinda painful. I would think carefully about doing it this way.


we've been doing this for 20k databases with mysql for the last 10+ years. It solves more problems than it creates. Migrations are trickier, but you get sharding, data isolation and easier backups that way.


I'm not saying it's always a bad idea, you just need to think about what you're doing. This was closer to 15 years ago now. We had to develop a bunch of our own tooling, and make our own modifications to frameworks that are now ancient history.


I imagine it only gets you so far. What do you do about customers like Walmart or Oracle? Hundreds, if not thousands, of users all leaving hundreds of comments on Figma files every day. If you want good latency without giving up strong consistency (which the article says they want) you'll need to keep sharding.


I bet it gets you further than you imagine. Entirely depends on the backend services and what they touch but in this scenario you would be deploying/scaling that service based on the customer seat size. I suspect that even for large enterprise customers, the users actively touching Figma are not reaching he thousands but I am happy to be wrong.

After all, Stackoverflow is running off of a handful of machines.


A single db can handle that load easily


Do you mean an RDBMS running on a single, big-iron, bare-metal server?


They were running their whole business on one RDS instance 4 years ago. Do you think they now have one customer larger than all their customers combined 4 years ago?


> Figma’s database stack has grown almost 100x since 2020

The first sentence from the article seems to suggest its possible?


Nile is a Serverless Postgres that virtualizes tenants/customers. It is specifically built for SaaS companies similar to Figma https://www.thenile.dev/. I am the CEO of Nile.


This, seriously. The long-term maintenance, tribal knowledge & risks associated with this giant hack will be greater than anything they'd ever have expected. Inb4 global outage post-mortem & key-man dependency salaries.

There's no virtually no excuse not spinning up a pg pod (or two) for each tenant - heck even a namespace with the whole stack.

Embed your 4-phases migrations directly in your releases / deployments, slap a py script to manage progressive rollouts, and you're done.

Discovery is automated, blast / loss radius is reduced to the smallest denominator, you can now monitor / pin / adjust the stack for each customer individually as necessary, sort the release ordering / schedule based on client criticality / sensitivity, you can now easily geolocate the deployment to the tenant's location, charge by resource usage, and much more.

And you can still query & roll-up all of your databases at once for analytics with Trino/DBT with nothing more but a yaml inventory.

No magic, no proprietary garbage.


Figma has millions of customers. The idea of having a Postgres pod for each one would be nearly impossible without completely overhauling their DB choice.


You are making a major conflation here. While they do millions of users, they were last reported to only have ~60k tenants.

Decently sized EKS nodes can easily hold nearly 800 pods each (as documented), that'd make it 75 nodes. Each EKS cluster supports up to 13,500 nodes. Spread in a couple of regions to improve your customer experience, you're looking at 20 EKS nodes per cluster. This is a nothingburger.

Besides, it's far from being rocket science to co-locate tenant schemas on medium-sized pg instances, monitor tenant growth, and re-balance schemas as necessary. Tenants' contracts does not evolve overnight, and certainly does not grow orders of magnitude on week over week basis - a company using Figma either has 10 seats, 100 seats, 1000, or 10,000 seats. It's easy to plan ahead for. And I would MUCH rather having to think of re-balancing a heavy hitter customer's schema to another instance every now and then (can be 100% automated too), compared to facing a business-wide SPOF, and having to hire L07+ DBAs to maintain a proprietary query parser / planner / router.

Hell, OVH does tenant-based deployments of Ceph clusters, with collocated/coscheduled SSD/HDD hardware and does hot-spot resolution. And running Ceph is significantly more demanding and admin+monitoring heavy.


Reported where? Does that include a monolithic "free tenant" that would be larger than thousands of their other tenants put together? Every Figma user has their own personal workspace along with the workspace of any organizations they may be invited to


Thousands of directories over thousands of installations? It’s not that far fetched.


It sounds like they actually did something like this. Their shard key selection could be customer, project, folder or something in their data model at a reasonably high logical level in their hierarchy.


I believe physalia [0] explores this concept at production scale quite well.

[0] https://blog.acolyer.org/2020/03/04/millions-of-tiny-databas...


Spanners interleaved tables seem like a similar solution, ie you interleave customer data so it all ends up on the same set of hosts for performance, while still having the ability to create transactions across customers.


Im curious for anyone who has done this: what’s the point of going all the way from one-db to one-db-per-customer? Why not just split the customers to 2 databases, then to 3, etc? Seems like the same level of system complexity but you avoid the lots-of-databases scaling problem.

You probably don’t even need to be able to migrate people between shards… just put everyone on one db until you hit 50% utilization, then spin up a fresh db and put all new customers on that one. Repeat whenever you run out of databases under 50%.


I have pondered about this for quite some time and came to the conclusion that it would make schema migrations more difficult to handle. I think Shopify is using an approach which is similar to what you are describing. The advantage is that you don't end up with hot shards because you can move around large customers independently.

In practice there isn't a big difference, they just colocate several customers according to their sharding key in the same logical database.


I worked in a place that had a database for each tenant and the schema migrations were a real pain. Every time everything goes smoothly except these few databases that have an edge case that screws the whole migration.


I remember Postgres table spaces being used to separate customers at a previous job - I can't remember how migrations were handled (pretty sure they were applied per table space) but I don't think it was a problem (at our scale anyway).


>There is usually not much point in making more than one tablespace per logical file system, since you cannot control the location of individual files within a logical file system. However, PostgreSQL does not enforce any such limitation, and indeed it is not directly aware of the file system boundaries on your system. It just stores files in the directories you tell it to use.

seems like the limitation is the logical file system. Which probably will work for most users.


I've done something like this before, with each customer getting their own schema within a single Postgres instance.


If customers are not of similar size, and you have a lot of customers, managing all the different databases can be a big headache.

Having a more granular partition key makes it easier to shard the data more evenly.


Ongoing product development with migrations, tweaking indexes etc. becomes really hard. Every small database tweak now has to be deployed over 1000s of databases.


Seems they’ve built out a PG version of MySQL’s Vitess

Query rewriting seems interesting, having a layer between your DB and your application would also allow various ACL stuff as well


Given that sharding has become a pretty mature practice, is it still worth considering the NewSQL solutions like CRDB, Yugabyte, and TiDB for the sake of auto sharding, given that these NewSQL databases usually trade throughput and latency for auto-sharding and multi-region support? Another added cost is learning how to operate NewSQL databases, assuming one is already familiar with either MySQL or Postgres.


In the OP they said they built their own sharding solution because it was risky to move to a different (newsql) solution and they already had sharding expertise with PG.

I think if starting from scratch it makes sense to look at these newsql DBs that are built to horizontally scale from the very beginning.


Sharding mysql and postgres has been a shitshow at every company I've worked at.


Thanks. This is kinda of information I look for. Could you give more specifics? Why were they shit shows? If the tables are naturally shardable, say by user ID as described in the Figma's case, would the situation be different?


Why? How did they shard?


Hmm, wonder why they didn't try FoundationDB.

Interesting that they had problems with vacuuming. I always thought that part of Postgres was the worst part. I vaguely remember that you needed twice the space to vacuum successfully, which hopefully has changed in later versions.

An article about why vacuum is needed in pg (as compared to mysql/innodb).

http://rhaas.blogspot.com/2011/02/mysql-vs-postgresql-part-2...


The article mentions they're trying very hard to stick with Postgres. FoundationDB is great but doesn't even have a SQL access layer, let alone a Postgres SQL access layer. :)


Do you know of any performant relational-db layer on top of FoundationDB? It seems there usecase would need at least simple join, which raw FoundationDB lacks.


They want pretty full SQL support, so FoundationDB would be out. I’d think “buy” options for them would be more like Citus, Yugabyte, Cockroach or Spanner.


I cannot help but think this all sounds pretty much like a hack (a clever one, though).

We do not handle, let's say low-level I/O buffering/caching, by ourselves anymore, right? (at least the folks doing web development/saas). We rely instead of the OS APIs, and that's good. I think we are missing something similar but for db sharding. It seems to me that we are still missing some fundamental piece of technology/infrastructure to handle horizontal data sharding.


Citus and Vitess are examples of horzontal data sharding technology for PostgreSQL and MySQL respectively.

At Figma's size there are sometimes reasons to roll your own, which I think they explain pretty clearly in the article. They wanted a solution they could incrementally engineer onto their existing stack without doing a full rewrite or lift-and-shift to something else.


I’m no computer scientist but I think the fundamental problem is that the CAP theorem makes it really tricky to do in a “cost free” way. You fundamentally need to sacrifice one, at least a tiny bit.


This is Hacker News after all


1 postgres

2 postgres replicas+sharding

3 Cassandra / dynamo

If you are at stage 2, you are using bandaids. You need to start architecting a transition of your most heavily used data views to truly scalable databases like Cassandra / dynamo. I have not used foundation, but aphyr liked it.

That transition takes time, evaluation, pro typing, and organizational learning both at the application programmer, support engineer, and management.


"Additionally, over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house". Isn't the whole point of paying AWS the premium is that they do it for you ?


> NoSQL databases are another common scalable-by-default solution that companies adopt as they grow. However, we have a very complex relational data model built on top of our current Postgres architecture and NoSQL APIs don’t offer this kind of versatility.

As I understand it, NoSQL is for people who need a backend that ingests just about any unstructured data, for teams that may not have a complex relational model worked out/stabilized. Postgres has this in its native jsonb datatype, but they wouldn't need to use that much since it sounds like they already have a good data model. What am I missing here?


Using NoSQL might not be the best idea in this case. I've seen it backfire for many companies. They start with NoSQL, but then end up needing relational features as their business grows. This leads to performance issues, redundancy, and data sync problems early on, which shouldn't be happening.

Especially in the early days, NoSQL companies used to market their databases as general-purpose database that scale easily, but that hasn't always been the case obviously.

I usually recommend starting with a relational database like PostgreSQL. If scaling becomes necessary later on, you can invest in sharding the database. Figma's approach seems reasonable given the tools available at the time.

I've helped small companies switch from NoSQL to SQL because the benefits of NoSQL weren't worth the trade-offs at their stage of growth. In case, anyone is in a similar boat: https://mongotosqlmigration.com/


What is your exact question? To me it makes sense that you’d not want to use NoSQL if you’re dealing with data that’s already relational, and heavily leveraging features common in relational DBs that may not come out of the box with NoSQL DBs.

They’re saying basically that NoSQL DBs solve a lot of horizontal scaling problems but aren’t a good fit for their highly relational data, is my understanding. Not that they can’t get NoSQL functionality at eg the query level in relational DBs.


I think they're equating relation database with databases with ACID guarantees, as thats basically a full overlap on a venn diagram.

And we all know that acid has to go at some scale, even if that scale keeps getting pushed further out as our hardware gets better.

(Same with the relational guarantees that eat performance... But only once you've reached a certain amount of throughput and data)


If scaling isn’t an issue, use a relational database.

“NoSQL” only really pays off when you need to scale horizontally. And even then, one of the horizontally scaling SQL solutions might be a better bet.


if you have postgres, just use https://github.com/FerretDB/FerretDB


I see they don't mention Citus (https://github.com/citusdata/citus), which is already a fairly mature native Postgres extension. From the details given in the article, it sounds like they just reimplemented it.

I wonder if they were unaware of it or disregarded it for a reason —I currently am in a similar situation as the one described in the blog, trying to shard a massive Postgres DB.


I have worked on teams that have both sharded and partitioned PostgreSQL ourselves (somewhat like Figma) (Postgres 9.4-ish time frame) as well as those that have utilized Citus. I am a strong proponent of Citus and point colleagues in that direction frequently, but depending on how long ago Figma was considering this path I will say that there were some very interesting limitations to Citus not that long ago.

For example, it was only 2 years ago that Citus allowed the joining of data in "local" tables and data retrieved from distributed tables (https://www.citusdata.com/updates/v11-0). In this major update as well, Citus enabled _any_ node to handle queries, previously all queries (whether or not it was modifying data) had to go through the "coordinator" node in your cluster. This could turn into a pretty significant bottleneck which had ramifications for your cluster administration and choices made about how to shape your data (what goes into local tables, reference tables, or distributed tables).

Again, huge fan of Citus, but it's not a magic bullet that makes it so you no longer have to think about scale when using Postgres. It makes it _much_ easier and adds some killer features that push complexity down the stack such that it is _almost_ completely abstracted from application logic. But you still have be cognizant of it, sometimes even altering your data model to accommodate.


You also benefit from the tailwind of the CitusData team making continued improvement to the extension, whereas an in-house system depends on your company's ability to hire and retain people to maintain + improve the in-house system.

It's hard to account for the value of benefits that have yet to accrue, but this kind of analysis, even if you pretty heavily-discount that future value, tilts the ROI in favor of solutions like Citus, IMO. Especially if your time horizon is 5+ or 10+ years out.

Like you said, if they made this decision 3ish years ago, you would have had to be pretty trusting on that future value. A choice, made today, hinges less on that variable.


Huh, I would have thought the opposite. Companies at Figma size are easily able to hire talent to maintain a core part of their engineering stack. On the other hand, they retain no control of Citus decision making. Those tailwinds could easily have been headwinds if they went in a direction that did not suit Figma.


I think this is true for things higher up the "stack", but doesn't necessarily apply to tech like Postgres [and Citus, IMO].

The line separating "build in-house" vs "use OSS" exists, and it's at a different layer of the stack in every company. IMO, for most companies in 2024, the line puts Citus on the same side as Postgres.

FWIW, I would have assumed that Citus would be on the other end of the line, until I had to look into Citus for work for a similar reason that Figma did. You can pick and choose among the orthogonal ideas they implement that most cleanly apply to the present stage of your business, and I would've chosen to build things the same way they did (TBH, Figma's choices superficially appear to be 1:1 to Citus's choices).


I thought of that as well. The only thing I could think of is that they mentioned that they don't want to move off of RDS, and there is 0% chance of Citus coming to AWS since Microsoft bought them.


Before clicking on the article I assumed it was Citus, and was surprised when it wasn’t.

Maybe because CitusData was bought by Microsoft around the same time, so Microsoft could create “Azure Cosmos DB for Postgres Cluster”, yet another one of Microsoft’s typical product naming crapshoots.


> yet another one of Microsoft’s typical product naming crapshoots.

Well said. I haven't seen any company as terrible as Microsoft at naming things. Anyone know why?


Naming things is hard.

At a previous employer, I saw several cool-ish open source projects instantly doomed to obscurity by picking a name that either completely duplicated the name of an existing OSS project or were guaranteed to have terrible SEO for another reason.

However, Microsoft seems to have a unique crossover of fragmented business units and centralized marketing. That's why you end up with Azure -> Subproject -> Actual Product/Service word soup. Perviously, they did this with the Windows Live brand from 2005-2012, and "Xbox" for a wide range of gaming projects (many of which were on PC).


related, Microsoft on Microsoft marketing:

https://www.youtube.com/watch?v=EUXnJraKM3k


The committee wanted Cosmos, Azure, and Postgres all in the name and wouldn't compromise.


AWS is putting up good fight


Figma uses AWS RDS, RDS doesn't list citus as a supported extension.


This is my guess of why they didn't use Citus. They weren't interested in the options of (1) going multi-cloud [DB in Azure Cosmos / Backend(s) in AWS] (2) going all-in on Azure [DB in Azure Cosmos / Backend(s) in Azure] (3) self-managing Postgres+Citus in EC2.

It'd be interesting to compare the expected capex of developing this in-house solution + the opex of maintaining it vs the same categories of expected costs for option (3) – because I imagine that's probably the most palatable option.

They also may have pre-paid for dedicated RDS instances for the next X years (before this horizontal scaling initiative began, to boot), as AWS allows companies to do this at a pretty steep discount rate, which would probably tilt them away from (3).


Especially because Option 3 lets you go waaaay farther on vertical scaling, since you can get native NVMe drives (they mentioned hitting IOPS limits for RDS), more exotic instance classes with far more RAM, and do stuff like ZFS for native compression and snapshots.


I would love to see a comparison of the major PostgresQL services such as Citus, EDB, Crunchy, Neon, and some OSS distributions/packages


how "massive" is massive in your case?


I've had CitusDB running across 68 bare metal machines (40 vCPU, 768GiB ram, 20TiB of storage each + 40GiB network links) and it ran decently well.

Not sure what your definition of massive is, I think Spanner would easily beat it.

Also, it's very use-case dependent, you can't "just use" Citus for everything, it's not quite as flexible as a bog-standard pgsql install due to the way it's sharding, you have to be a tad more careful with your data model.


Is there a reason there's comparatively little storage in your machines in relation to RAM or even CPUs?

Do your machines do compute heavy loads or something?

For a DB I'd expect a lot more storage per node


NVMe SSDs aren't so large unfortunately.

a 1U server has capacity for 8 drives, we used 2 slots for the OS (RAID1), 2 slots for the WAL volume (2 slots) leaving only 4 slots in RAID10.

So I'm already cheating a little and claiming WAL storage was part of total storage.


Shouldn't you try to get something with PCIe bifurcation in this case?

I doubt you're saturating the PCIe bus bandwidth on any of them?

I imagine your DB is extremely high performance, though!


What is your definition of "decently well", and is your primary cluster (without replicas) above 1PB?


They said 20TiB * 68, which I think is 1.5PB.


That could be all of the nodes, or just the primaries without replicas.


Around ten heavily-updated (50-400k updated rows/min) tables ranging between 500M and 5B rows, with a couple tables over 40B rows each (5TB each IIRC).


Where's the fun in that? I'm not being snarky either. Maybe it's not the best decision business-wise, but I guarantee it was more challenging and more fun. There's something to be said for that.


It looks like figma mostly just implemented the Instagram sharding paper, and then did the obvious next level of semi-intelligent yet delicate and fragile query routing.

Definitely a technically challenging and overall fun / satisfying engineering exercise!

https://instagram-engineering.com/sharding-ids-at-instagram-...


They came up with a really over-engineered, over-complicated attempt at splitting data into multiple databases. I'm not sure this was the best idea or even a good idea, but then I also don't understand how it came to the situation when they only had a few months before their current database gets overflown and the whole system collapses.


Can you offer insight into what a better approach might have been?


As others have mentioned, moving to per tenant databases can really simplify things at scale and doesn't leave a massive amount of engineering complexity and debt in its wake.

I feel sorry for the team managing this 5 years from now.


Moving to a per-tenant database sounds like even more work to me than moving to shards. Moving to per-tenant means rewriting _everything_ - moving to shards has you rewriting a lot less.


What do you mean by "rewriting everything"? Or maybe your definition of "per-tenant database" is different from mine. In our product, it's just a small layer which routes requests to the target organization's DB. When an organization is created, we create a new DB. Most of application code has no idea there are different DBs under the hood.

There are logical DBs (folders/files on a single physical server), and there's a few physical servers. We're currently at the stage where the first physical server hosts most of smaller organizations, and all other physical servers are usually dedicated servers for larger clients with high loads.


If you didn't write your application with multi-tenant in mind from the start I would expect you would need to review almost every line of code that touches a database to make a transition like this.


In our code, the only DB-related piece of code which is aware of multi-tenant databases is the getDBConnection(accountId) function. Once you have the connection, you execute exact same SQL queries as before. The function is hidden deep inside the framework/infrastructure layer, so application code is completely unaware of it.


This is possible to do, but lots of engineering. You can provide the experience of a single DB while each tenant can be placed in their own dedicated Postgres compute. This would help the application to stay the same while tenants are moved to independent computes (you can even move only a few tenants and leave the rest on a shared Postgres compute).


Exactly this. But at this point, I don't even want to give them advice, I don't really like their service. I like Lunacy more.


Could you use Aurora Limitless for this instead? https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-au...


I doubt even VC money can afford this service.

Serverless Aurora is incredibly expensive for most workloads. I have yet to find a use case for any SaaS product that is used >4 hours a day. Since all my products span at least 3 time zones there is at least 12 hours of activity a day.


We found this out the hard way in a small startup. The per query and I/O expense was through the roof.


Did it work though? Did you achieve unlimited scaling? Because if so you should compare the price to the price of a team of great minds such as in this article, working for 2 years to get a solution.

I bet it still would be cheaper to pay people over Amazon, but I'm curious about the numbers


It worked fine, the problem was our workloads were minuscule and it still cost $3,000 a month to support 50ish users on a lightly used platform.

The same load in a non-Serverless instance was around $100/month.


What products


"Limitless" refers to the bill, not just the scale.


At my company, we were given an early talk on Limitless. Never once did the reps mention that it ran on Serverless. Dear lord, that's going to be a hard no from me. Unless they've dramatically changed pricing for Limitless as opposed to normal Serverless, that'll be through the roof.


Haha alright I get the picture. Too expensive.


Yes, but that wasn't available when they did this migration


Hm, looks like it's only available as a preview too. I was wondering why I hadn't seen in mentioned before.


Coming from Google, where Spanner is this magical technology that supports infinite horizontal sharding with transactions and has become the standard storage engine for everything at Google (almost every project not using Spanner was moving to Spanner), I'm curious how Figma evaluated Cloud Spanner. Cloud Spanner does have a postgres translation layer, though I don't know how well it works.

It seems like they've (hopefully only temporarily) given up real transactional support with their horizontal postgres scheme?


Never a good idea to rely on Google proprietary tech (unless you are Google)... it could be sunset at any time without warning. I use GCP but I try my best to stay Google agnostic (avoid GCP-only offerings, etc) so that I can move to AWS if Google pulls the rug out from under me.


I'm biased having worked on GCP, but I think GCP actually has a very good track record of not sunsetting entire products or removing core functionality. When I worked on AppEngine, I would often see apps written 10+ years ago still chugging along.

It is true though that GCP sometimes sunsets specific product functionality, requiring changes on the customers' part. Some of these are unavoidable (eg committing to apply security patches to Python 2.7 given that the rest of the world is mostly not upstreaming these patches anymore), but not all of them.

I would certainly use Cloud Spanner externally now that I've left the company, and IMO spanner has such a compelling featureset that it's a strong reason to use GCP for greenfield development. The problem though is that it's only available on GCP, and could become expensive at large scale.


My previous employer was a GCP customer. Google did pull occasional shenanigans totally breaking us with random upgrades without notifying us. Their support wouldn’t acknowledge it was their fault until we were mad at them.

My newer employer is AWS. Their offerings are a lot more stable and support is helpful.

If you want to build a serious business I would avoid GCP. Google doesn’t really give a shit at winning Cloud. Ads is their core business.


Ugh I'm so sorry, you should definitely have been notified in advance about breaking changes, unless they were accidental bugs and regressions. That was something we took pretty seriously for the products I worked on.

I have definitely enjoyed using AWS support. But I've also encountered some broken/janky functionality (eg using custom domain + api gateway + websockets + Lambda for the apparently niche-task of having a real website use the Lambda-websockets integration) on AWS that I don't think would have made it to production at Google. I also really dislike how they handle project-level logging compared to how it's done on GCP.

Ultimately I do think some GCP products like Bigquery, Cloud Run (I'm biased here), and Spanner as well as general DevEx/reliability factors are compelling enough for GCP to be worth serious consideration, even if AWS offers better support.


>It is true though that GCP sometimes sunsets specific product functionality, requiring changes on the customers' part. Some of these are unavoidable (eg committing to apply security patches to Python 2.7 given that the rest of the world is mostly not upstreaming these patches anymore), but not all of them.

A good example is probably IoT. I've heard first hand anecdotes of very difficult migrations off this service.


GCP products have a much better track record than Google consumer products when it comes to support since there are usually enterprise customers with multi-year contracts worth tens, if not hundreds, of millions of dollars using them.


That’s what AppEngine customers thought.


Didn't they also hammer folks using google maps in a business / cloud context?

I was also an old app engine user, but bailed ages ago (original app engine).


I worked on AppEngine (well, Serverless Compute) at Google for over 4 years and left on Friday. Did something happen to AppEngine in the last week?


I’m talking about the pricing disaster that happened in 2017/2018 where user prices went up from 10x-100x because Google wanted to kill the product without actually killing it.


IoT is one example of a big backbone service that was sunset.


It had barely any usage though from what I can tell from searching about it.

Not that it’s any solace to those affected.


AI Platform is a recent example that’s been deprecated.


Wasn't it just superseded by Vertex AI? According to the banner announcement in the docs: > All the functionality of legacy AI Platform and new features are available on the Vertex AI platform.


My perspective from working both inside and outside of Google:

The external spanner documentation doesn’t seem as good as the internal documentation, in my opinion. Because it’s not generally well known outside of G, they ought to do a better job explaining it and its benefits. It truly is magical technology but you have to be a database nerd to see why.

It’s also pretty expensive and because you generally need to rewrite your applications to work with it, there is a degree of lockin. So taking on Spanner is a risky proposition - if your prices get hiked or it starts costing more than you want, you’ll have to spend even more time and money migrating off it. Spanner’s advantages over other DBs (trying to “solve” the CAP theorem) then become a curse, because it’s hard to find any other DB that gives you horizontal scaling, ACID, and high availability out of the box, and you might have to solve those problems yourself/redesign the rest of your system.

Personally I would consider using Cloud Spanner, but I wouldn’t bet my business on it.


If you really have that much data and traffic, the $ costs start to add up to multiple engineer comp costs. At that point it’s cheaper to move to something you have good control over.

I.e sharding at application layer and connecting to the DB instance replica where the customer data is hosted.


Depends. The cost may pay for itself but the engineers you have already may have higher ROI things to do. It's also nice to have operational stuff managed for you. Personally I'd be happy to pay extra for the kinds of problems Spanner solves to free myself up to do other things (to a point, ofc).

> sharding at application layer and connecting to the DB instance replica where the customer data is hosted.

Spanner does global consistency/replication. If having good performance per-tenant globally is a concern, this helps a lot, and is hard to implement on your own. It can also ultimately save you money by limiting cross-region traffic.


Global consistency is expensive, both latency-wise and cost-wise. In reality most apps don't need global serializability across all objects. For instance, you probably don't need serializability across different tenants, organizations, workspaces, etc. Spanner provides serializability across all objects IIUC - so you pay for it whether you need it or not.

The other side of something like Spanner is the quorum-based latency is often optimized by adding another cache on top, which instantly defeats the original consistency guarantees. The consistency of (spanner+my_cache) is not the same as the consistency of spanner. So if we're back to app level consistency guarantees anyway, turns out the "managed" solution is only partial.

Ideally the managed db systems would have flexible consistency, allowing me to configure not just which object sets need consistency but also letting me configure caches with lag tolerance. This would let me choose trade-offs without having to implement consistent caching and other optimization tricks on top of globally consistent/serializable databases.


While it doesn't help much with the costs of replication, Spanner can be configured with read only replicas that don't participate in voting for commits, so they don't impact the quorum latency.

Reads can then be done with different consistency requirements, e.g., bounded staleness (which guarantees data less stale than the time bound requested).

See https://cloud.google.com/spanner/docs/reference/rest/v1/Tran... or https://cloud.google.com/spanner/docs/reads#read_types and https://cloud.google.com/spanner/docs/create-manage-configur...


See also: "Strict-serializability, but at what cost, for what purpose?"

https://muratbuffalo.blogspot.com/2022/08/strict-serializabi....


I wouldn't say "infinite", its still susceptible to read hotspotting; and while fine-grained locking enables generally higher write throughputs, you can still get in a situation where interconnected updates end up being pretty slow.

That said, its way better than anything else I've used in my career.


Ping time from AWS data centers to GCP ones


The problem is nobody outside Google trusts them to run or operate anything.

Edit: To the Googlers downvoting these comments. Your behavior only reinforces our views.


Google means: good chance discontinued after you have worked out the bugs and have a stable system at last


This is definitely not the case with their core cloud products.

> almost every project not using Spanner was moving to Spanner

This even includes Datastore. Even Datastore moved to Spanner.


Why would anyone marry themselves to Google? That sounds like the most boneheaded move possible.

First, you should never be beholden to a single vendor for the most critical technological underpinnings. You're backing yourself into a corner.

But more importantly, Google can't even figure out how to prioritize their cloud efforts. That's not a good partnership to be in for anyone except Google.

I wouldn't care if my solution was 10x worse than Cloud Spanner from a technology perspective. It'd be 1000x better from a strategy perspective.

You can hire engineers to do consistency at scale. It's your core competency and you can't just handwave and outsource that, lest you wind up stuck in a crevasse. Hire smart engineers and do the work yourself. It'll pay off.


You can't, that's why spanner only exists a Google. That tech is that good, not found anywhere else.


But you don't need it. There are a limitless number of ways to deal with the problems spanner solves. And by choosing your own, you can focus on the subset that matter to your case and not be tied down to Google.


Is it me or did they just recreate Mongodb Shard feature? But now they have to maintain it. They would of been better off going Mongodb or another document database. Then you wouldn't have to worry about scheme changes. Since your scheme lives there n your data model. You could just map changed fields. They should of just used a tenant based design. If a customer was hindering performance on neighboring tenants. They could just migrate them over to their own large server.


How does the economics work with this?

* 21 engineers worked on it

* average salary 200K per year

* 18 months i.e 1.5 years

* 50% of the bandwidth

* 1x extra cost for all the pending features for next 3 years

so 21 * 200000 * 1.5 * 0.5 * 2 = 6.3 Million

* since they mentioned they were running largest instance in 2020, which is r5.24xlarge i.e 15.84 per hour

* they grew 100x from 2020

* assuming they went reserved instances, they would have gotten about 50% discount

* at this scale, they would have got atleast 20-30% enterprise commitment discount from aws, lets consider 25% discount

so 15.84 * 730 * 12 * 100 * 0.5 * 0.75 = 5.2 Million

if they were to go with planetscale and even if planetscale charges them 30% extra, i.e 6.76 Million, which means they are only saving about 1.56 Million per year, so it will take them atleast 4 years more to realise their cap-ex investment into building this in-house, not to mention all the extra maintainence outside the new features they are yet to implement, like reliability, observability, version upgrades etc,...

what am I missing?


"Schema changes must be coordinated across all shards to ensure the databases stay in sync. Foreign keys and globally unique indexes can no longer be enforced by Postgres."

One of my best practices for RDBMS data after executing a few sharding projects is to mandate no use of foreign keys even for v1 of a project.


Citus seems incredibly close to what they built, I wonder why they did not use it


I This is an intriguing article, clearly showing the team's fondness for Postgres. However, Postgres is an OLTP product. I'm curious about what system Figma's data team uses for their data analysis tasks.


They mentioned analyzing query logs in Snowflake in this very article. So... at least they use Snowflake for some things?


How transactions work when you end up querying different shards?


I imagine that's one of the reasons they have "colos" - you can aim to collocate tables that are likely to be part of the same transaction in the same shard by putting them in the same colo group.


IIUC they never have db transactions across shards, because customers are basically data silos, which makes sharding quite straightforward.


Given the list of authors and acknowledgees, what I'd really like to read is the differences between this solution and Dropbox's.


horizontal sharding is such a foundational skillset in working with databases - it's essential to database architecture


Will be interesting to see if Figma open sources this as an alternative to Vitess.


amazing team, but the author doesn't seem to understand the details


"Sorry, you’ve stumbled upon a temporary technical issue. Please try refreshing this page in a moment."

Have they? :) I am excited to read the article when it loads


It worked for me on the first try, but here's an archive link in case it goes down again: https://archive.is/xusR7


This might be another case of “death by HN”


Am I the only one finding the layout of this blog distracting? Kind of disappointing from a UX company. The images are also massive, the page was 42.21mb!

Good article none the less! Always appreciate when companies like Figma document technical challenges.


Odd, it looks perfectly bog-standard for me on Firefox and iOS, aside from the lavender bg which I quite like. I even disabled my ad-blocker to see if it made a difference, no changing bg colors for me...


Maybe someone designed this castle in Figma.


It seems like someone at Figma decided to use the latest CSS tricks they just had discovered. Changing background color? Come on.


I thought this was a bug at first. Does anyone actually want this?


It would have benefitted the article to use less of annoying corporate talk.


Honestly it's depressing that I remember doing this same thing more than a decade ago and we still have to spend developer cycles on this nonsense in 2024.


This is such a familiar story. Company starts with a centralized database, runs into scaling problems, then spends man-years sharding it. Just use a distributed database.


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html


If every startup used every scale-proof method available from the beginning they'd never have the time or resources to build the products to get the customers that would require them to use the scale-proof methods


I agree with your general point but there are relational distributed databases, open source and "serverless", that are compatible with MySQL (planetscale, tidb, vitess...) and postgres (citus, cockroachdb...)

edited for examples.


And many of these are substantially more expensive than RDS, or have terrible INSERT performance, or require making the correct decisions really early about what to use as partition keys. There's no free lunch, sadly.


One reason they don't is unhelpful comments like these: alluding to products without naming them.


From the article: "During our evaluation, we explored CockroachDB, TiDB, Spanner, and Vitess". Three of them are open source if I am not mistaken.


I don't think these were mature products when Figma started to be developed in 2012.


This assumes that using a distributed database from the start doesn't offer it's own penalties/drawbacks (particularly in 2016).

Particularly considering as per the figma note, they consider their data highly relational (i.e. presumably this means lots of joins in queries)

What database would you have chosen?


That's a fair response for then, but not today where you're spoiled for choice. Spanner was released as a service in 2017, which isn't far off.


Well Spanner has it's own disadvantages, like having your entire stack tightly coupled to a google cloud product (i.e. you have to sacrifice being cloud agnostic).


Spanner is quite expensive though.


There are many kinds of apps that use of "distributed" databases is an antipattern...and useless to make them scale.

MOST apps did not need more than "tenant-per-company" and then, maybe, distributed among a small set of servers.

Is it just that the kind of app that shows here (and has this problems) are just niche apps like social networks and complex chat apps.


Doesn't seem a crazy way to start a company. RDS will have scaling problems but is very mature and fairly easy to use early on when you're working on your MVP

I've used CRDB early on at a startup. There was some overhead. You don't get all the nice PSQL features

Although it did save us a giant migration later on




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: