It is amazing how many large-scale applications run on a single or a few large R...

lazide · on April 19, 2022

Well, it's always been that way.

The big names started using no-sql type stuff because their instances got 2-3 orders of magnitude larger, and that didn't work. It adds a lot of other overhead and problems doing all the denormalization though, but if you literally have multi-PB metadata stores, not like you have a choice.

Then everyone started copying them without knowing why.... and then everyone forgot how much you can actually do with a normal database.

And hardware has been getting better and cheaper, which makes it only more so.

Still not a good idea to store multi-PB metadata stores in a single DB though.

jghn · on April 19, 2022

> Then everyone started copying them without knowing why

People tend to have a very bad sense of what constitutes large scale. It usually maps to "larger than the largest thing I've personally seen". So they hear "Use X instead of Y when operating at scale", and all of a sudden we have people implementing distributed datastore for a few MB of data.

Having gone downward in scale over the last few years of my career it has been eye opening how many people tell me X won't work due to "our scale", and I point out I have already used X in prior jobs for scale that's much larger than what we have.

lazide · on April 19, 2022

100% agree. I've also run across many cases where no-one bothered to even attempt any benchmarks or load tests on anything (either old or new solutions), compared latency, optimize anything, etc.

Sometimes making 10+ million dollar decisions off that gut feel with literally zero data on what is actually going on.

It rarely works out well, but hey, have to leave that opening for competition somehow I guess?

And I'm not talking about 'why didn't they spend 6 months optimizing that one call which would save them $50 type stuff'. I mean literally zero idea what is going on, what actual performance issues are, etc.

dmd · on April 19, 2022

Yep. I've personally been in the situation where I had to show someone that I could do their analysis in a few seconds using the proverbial awk-on-a-laptop when they were planning on building a hadoop cluster in the cloud because "BIG DATA". (Their Big Data was 50 gigabytes.)

thraxil · on April 20, 2022

I remember going to a PyData conference in... 2011 (maybe off by a year or two)... and one of the presenters making the point that if your data was less than about 10-100TB range, you were almost certainly better off running your code in a tight loop on one beefy server than trying to use Hadoop or a similar MapReduce cluster approach. He said that when he got a job, he'd often start by writing up the generic MapReduce code (one of the advantages is that it tends to be very simple to implement), starting the job running, and then writing a dedicated tight loop version while it ran. He almost always finished implementing the optimized version, got it loaded onto a server, and completed the analysis long before the MapReduce job had finished. The MapReduce implementation was just there as "insurance" if, eg, he hit 5pm on Friday without his optimized version quite done, he could go home and the MR job might just finish over the weekend.

hathawsh · on April 19, 2022

I suggest a new rule of thumb: if the data can fit on a micro SD card [1], then it's smaller than a thumb, so it can't be big data. ;-)

[1] https://www.amazon.com/SanDisk-Extreme-microSDXC-Memory-Adap...

bombcar · on April 20, 2022

My rule of thumb has been “if I could afford to put the data in RAM it’s not that big a deal”.

dharmab · on April 20, 2022

Also "If the data could fit on a laptop or desktop I can buy at a retail store"

lazide · on April 20, 2022

Uh oh, a lot of ML startups just freaked out! Name brand 1TB microSDXC cards are less than $200 on Amazon.

lostcolony · on April 20, 2022

It's self reinforcing too. All the "system design" questions I've seen have started from the perspective of "we're going to run this at scale". Really? You're going to build for 50 million users -from the beginning-? Without first learning some lessons from -actual- use? That...seems non-ideal.

dzhiurgis · on April 19, 2022

Place I’ve recently left had 10M record MongoDB table without indexes which would take tens of seconds to query. Celery was running in cron mode every 2 second or so meaning jobs would just pile up and redis eventually ran out of memory. No one understood why this was happening so just restart everything after pagerduty alert…

lazide · on April 19, 2022

Yikes. Don’t get me wrong, it’s always been this way to some extent - not enough people who can look into a problem and understand what is happening to make many things actually work correctly, so iterate with new shiny thing.

It seems like the last 4-5 years though have really made it super common again. Bubble maybe?

Huge horde of newbs?

Maybe I’m getting crustier.

I remember it was SUPER bad before the dot-com crash, all the fake it ‘til you make it too. I even had someone claim 10 years of Java experience who couldn’t write out a basic class on a whiteboard at all, and tons of folks starting that literally couldn’t write a hello world in the language they claimed experience in, and this was before decent GUI IDEs.

Nextgrid · on April 20, 2022

> It seems like the last 4-5 years though have really made it super common again. Bubble maybe?

Cloud providers have successfully redefined the baseline performance of a server in the minds of a lot of developers. Many people don't understand just how powerful (and at the same time cheap) a single physical machine can be when all they've used is shitty overpriced AWS instances, so no wonder they have no confidence in putting a standard RDBMS in there when anything above 4GB of RAM will cost you an arm and a leg, therefore they're looking for "magic" workarounds, which the business often accepts - it's easier to get them to pay lots of $$$$ for running a "web-scale" DB than paying the same amount for a Postgres instance, or God forbid, actually opting for a bare-metal server outside of the cloud.

In my career I've seen significant amount of time & effort being wasted on workarounds such as deferring very trivial tasks onto queues or building an insanely-distributed system where the proper solution would've been to throw more hardware at it (even expensive AWS instances would've been cost-effective if you count the amount of developer time spent working around the problem).

jjeaff · on April 21, 2022

Just to give a reference for those that don't know, I rent a dedicated server that has 128gb of ram and 16 core processor (32 threads) and 2tb of local SSD storage and virtually unlimited traffic for $265 USD a month. A comparable VM on AWS would be around $750 a month (if you reserve it long term) and then of course you will pay out the nose for traffic.

dzhiurgis · on April 19, 2022

Technically we were a tech startup with 10+ “senior” engineers which scrape entire web ;D

jstrong · on April 20, 2022

the one of those most likely to be humming along fine is redis in my experience. once ssh'd to the redis box (ec2), which was hugely critical to business: 1 core, instance had been up for 853 days, just chilling and handling things like a boss.

cortesoft · on April 19, 2022

This is funny, because I suffer from the opposite issue... every time I try to bring up scaling issues on forums like HN, everyone says I don't actually need to worry because it can scale up to size X... but my current work is with systems at 100X size.

I feel like sometimes the pendulum has swung too far the other way, where people deny that there ARE people dealing with actual scale problems.

Nextgrid · on April 20, 2022

In this case it might be helpful to mention the solutions you’ve already tried/evaluated and the reasons why they’re not suitable. Without those details you’re no different from the dreamers who think their 10GB database is FAANG-scale so it’s normal that you get the usual responses.

creakingstairs · on April 20, 2022

I mean what percentage of companies are web scale or at your scale? I would guess around 1% being really generous. So it makes sense that the starting advice would be to not worry about scaling.

cortesoft · on April 20, 2022

I get it, and I can't even say I blame the people for responding like that.

I think it is the same frustration I get when I call my ISP for tech support and they tell me to reboot my computer. I realize that they are giving advice for the average person, but it sucks having to sit through it.

lazide · on April 20, 2022

Nothing quite as anger inducing as knowing WHY it is that way, but also knowing you are stuck, it makes no sense for you, and it sucks ass.

My new fav rant is the voice phone systems for Kaiser, which makes me say 'Yes or No' constantly - but literally can only hear me somehow if I'm yelling. And they don't tell you to press a number to say yes or no until after you've failed several times with the voice system.

All human convos have zero issues, not even a little faint.

lazide · on April 20, 2022

Probably true - hopefully you can prefix your question with 'Yes, this is 10 Exabytes - no, I'm not typo'ng it' to save some of us from foot-in-mouth syndrome?

cortesoft · on April 20, 2022

That is probably a good idea, get that out of the way up front.

I feel similar frustrations with commenters saying I am doing it wrong by not moving everything to the cloud… I work for a CDN, we would go out of business pretty quickly if we moved everything to the cloud. Oh well.

jghn · on April 20, 2022

Yes, exactly. When people cite scaling concerns and/or big data, I start by asking them what they mean by scale and/or big. It's a great way to get down to brass tacks quickly.

Now when dealing with someone convinced that their single TB of data is google scale, the harder issue is changing that belief. But at least you know where they stand.

dsr_ · on April 20, 2022

That sounds like you're not giving enough detail. If you don't mention the approximate scale that you have right now, you can't expect people to glark it from context.

staticassertion · on April 20, 2022

Same. I think there's this idea that 5 companies have more than 1PB of data and everyone else is just faking it. My field operates on many petabytes of data per customer.

jghn · on April 20, 2022

Yes, the set of people truly operating "at scale" is more than FAANG and far, far less than the set of people believing they operate "at scale". This means there are still people in that middle ground.

One gotcha here is not all PBs are equal. My field also is a case where multi-PB datastores are common. However for the most part those data sit at rest in S3 or similar. They'll occasionally be pulled in chunks of a couple TB at most. But when you talk to people they'll flash their "do you know how big our storage budget is?" badge at the drop of a hat. It gets used to explain all sorts of compute patterns. Meanwhile, all they need is a large S3 footprint and a machine with a reasonable amount of RAM.

seunosewa · on April 19, 2022

Large solid state drives really changed the game; your working set doesn't have to fit in RAM for decent performance anymore.

openthc · on April 19, 2022

With postgres you'll want to tune those cost paramters however. Eg: lowering the random page cost will change how the planner does things on some queries. But don't just blindly change it -- like modify the value and run the benchmark again. The point is that the SSD is not 10x the cost of RAM (0.1 vs 1.0). In our example a few queries the planner move to, what I always assumed was the slower sequential scan -- but it's only slower depending on your table size (how tall and how wide). I mean, PG works awesome w/o tweaking that stuff but if you've got a few days to play with these values it's quite educational.

more_corn · on April 19, 2022

Ideally you offer developers both a relational data store and a fast key-value data store. Train your developers to understand the pros and cons and then step back.

There’s nothing inherently wrong with a big db instance. The cloud providers have fantastic automation around multi-az masters, read replicas and failover. They even do cross region or cross account replication.

That being said. Multi PB would raise an eyebrow.

derekperkins · on April 21, 2022

Most relational databases are fast enough for smaller key/value. Replace key/value with a blob store (still technically k/v) and I'm onboard

dijit · on April 19, 2022

Even when it’s not a cloud provider, in fact, especially when it’s not a cloud provider: you can achieve insane scale from single instances.

Of course these systems have warm standbys, dedicated backup infrastructure and so it’s not really a “single machine”; but I’ve seen 80TiB Postgres instances back in 2011.

cogman10 · on April 19, 2022

We are currently pushing close to 80tb mssql on prem instances.

The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.

But I agree. While our design is pretty bad in a few ways, the amount of data that we are able to serve from these big DBs is impressive. We have something like 6 dedicated servers for a company with something like 300 apps. A hand full of them hit dedicated dbs.

Were I to redesign the system, I'd have more tiny dedicated dbs per app to avoid a lot of the noisy neighbor/scaling problems we've had. But at the same time, It's impressive how far this design has gotten us and appears to have a lot more legs on it.

Nathanba · on April 19, 2022

Can I ask you how large tables can generally get before querying becomes slower? I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10. Surely you do table partitions or cycle data out into archive tables to keep up the query performance of the more recent table data, correct?

cogman10 · on April 19, 2022

> I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10

Sql server data is stored as a BTree structure. So a 10 -> 100gb growth ends up being roughly a 1/2 query performance slowdown (since it grows by a factor of log n) assuming good indexes are in place.

Filtered indexes can work pretty well for improving query performance. But ultimately we do have some tables which are either archived if we can or partitioned if we can't. SQL Server native partitioning is rough if the query patterns are all over the board.

The other thing that has helped is we've done a bit of application data shuffling. Moving heavy hitters onto new database servers that aren't as highly utilized.

We are currently in the process of getting read only replicas (always on) setup and configured in our applications. That will allow for a lot more load distribution.

AtlasBarfed · on April 19, 2022

The issue with b-tree scaling isn't really the lookup performance issues, it is the index update time issues, which is why log structured merge trees were created.

EVENTUALLY, yes even read query performance also would degrade, but typically the insert / update load on a typical index is the first limiter.

abraxas · on April 19, 2022

If there is a natural key and updates are infrequent then table partitioning can help extend the capacity of a table almost indefinitely. There are limitations of course but even for non-insane time series workloads, Postgres with partitioned tables will work just fine.

jlokier · on April 19, 2022

A lot depends on the type of queries. You could have tables the size of the entire internet and every disk drive ever made, and they'd still be reasonably fast for queries that just look up a single value by an indexed key.

The trick is to have the right indexes (which includes the interior structures of the storage data structure) so that queries jump quickly to the relevant data and ignore the rest. Like opening a book at the right page because the page number is known. Sometimes a close guess is good enough.

In addition, small indexes and tree interior nodes should stay hot in RAM between queries.

When the indexes are too large to fit in RAM, those get queried from storage as well, and at a low level it's analogous to the system finding the right page in an "index book", using an "index index" to get that page number. As many levels deep as you need. The number of levels is generally small.

For example, the following is something I worked on recently. It's a custom database (written by me) not Postgres, so the performance is higher but the table scaling principles are similar. The thing has 200GB tables at the moment, and when it's warmed up, querying a single value takes just one 4k read from disk, a single large sector, because the tree index fits comfortably in RAM.

It runs at approximately 1.1 million random-access queries/second from a single SSD on my machine, which is just a $110/month x86 server. The CPU has to work quite hard to keep up because the data is compressed, albeit with special query-friendly compression.

If there was very little RAM so nothing could be kept in it, the speed would drop by a factor of about 5, to 0.2 million queries/second. That shows you don't need a lot of RAM, it just helps.

Keeping the RAM and increasing table size to roughly 10TB the speed would drop by half to 0.5 million queries/second. In principle, with the same storage algorithms a table size of roughly 1000TB (1PB) would drop it to 0.3 million queries/second, and roughly 50,000TB (50PB) would drop it to 0.2 million. (But of course those sizes won't fit on a single SSD. A real system of that size would have more parallel components, and could have higher query performance.) You can grow to very large tables without much slowdown.

truetraveller · on April 20, 2022

Thanks for the great info. What is this custom database? Why did you reinvent the wheel? Is it as durable as Postgres? Any links?

jlokier · on April 20, 2022

> What is this custom database?

The current application is Ethereum L1 state and state history, but it has useful properties for other applications. It's particularly good at being small and fast, and compressing time-varying blockchain-like or graph data.

As it's a prototype I'm not committing to final figures, but measurement, theory and prototype tests project the method to be significantly smaller and faster than other implementations, or at least competitive with the state of the art being researched by other groups.

> Why did you reinvent the wheel?

Different kind of wheel. No storage engine that I'm aware of has the desired combination of properties to get the size (small) and speed (IOPS, lower read & write amplification) in each of the types of operations required. Size and I/O are major bottlenecks for this type of application; in a way it's one of the worst cases for any kind of database or schema.

It's neither a B-tree nor an LSM-tree, (not a fractal tree either), because all of those are algorithmically poor for some of the operations required. I found another structure after being willing to "go there" relating the application to low-level storage behaviour, and reading older academic papers.

These data structures are not hard to understand or implement, once you get used to them. As I've been working on and off for many years on storage structures as a hobby (yeah, it's fun!), it's only natural to consider it an option when faced with an unusual performance challenge.

It also allowed me to leverage separate work I've done on raw Linux I/O performance (for filesystems, VMs etc), which is how random-access reads are able to reach millions/s on a single NVMe SSD.

> Is it as durable as Postgres?

Yes.

Modulo implementation bugs (because it won't have the scrutiny and many eyes/years of testing that Postgres does).

> Any links?

Not at this time, sorry!

aeorgnoieang · on April 21, 2022

What's the structure that your custom DB uses?

petergeoghegan · on April 20, 2022

The important point is that many (though not all) queries are executed by looking things up in indexes, as opposed to searching through all of the data in the table. The internal pages of a B-Tree index are typically a fraction of 1% of the total size of the index. And so you really only need to store a tiny fraction of all of the data in memory to be able to do no more than 1 I/O per point lookup, no matter what. Your table may grow, but the amount of pages that you need to go through to do a point lookup is essentially fixed.

This is a bit of a simplification, but probably less than you'd think. It's definitely true in spirit - the assumptions that I'm making are pretty reasonable. Lots of people don't quite get their head around all this at first, but it's easier to understand with experience. It doesn't help that most pictures of B-Tree indexes are very misleading. It's closer to a bush than to a tree, really.

WJW · on April 19, 2022

At my old workplace we had a few multi-TB tables with several billion rows in a vanilla RDS MySql 5.7 instance (although it was obviously a sizable instance type), simple single-row SELECT queries on an indexed column (ie SELECT * FROM table WHERE external_id = 123;) would be low single-digit milliseconds.

Proper indexing is key of course, and metrics to find bottlenecks.

cmckn · on April 19, 2022

Well, any hot table should be indexed (with regards to your access patterns) and, thankfully, the data structures used to implement tables and indexes don't behave linearly :)

Of course, if your application rarely makes use of older rows, it could still make sense to offload them to some kind of colder, cheaper storage.

noselasd · on April 19, 2022

Think of finding a record amongst many as e.g. a binary search. It doesn't take 10 times as many tries to find a thing(row/record) amongst 100 as it does amonst 1000.

Vladimof · on April 19, 2022

I don't think that it's amazing.... I think that the new and shinny databases tried to make you think that it was not possible...

eastbound · on April 19, 2022

React makes us believe everything must have 1-2s response to clicks and the maximum table size is 10 rows.

When I come back to web 1.0 apps, I’m often surprised that it does a round-trip to the server in less than 200ms, and reloads the page seamlessly, including a full 5ms SQL query for 5k rows and returned them in the page (=a full 1MB of data, with basically no JS).

Nextgrid · on April 20, 2022

There’s shit tons of money to be made for both startups and developers if they convince us that problems solved decades ago aren’t actually solved so they can sell you their solution instead (which in most cases will have recurring costs and/or further maintenance).

Existenceblinks · on April 20, 2022

Plus, HTML compression on wire is insane, easily 50%+.

markandrewj · on April 19, 2022

Scaling databases vertically, like Oracle DB, in the past was the norm. It is possible to serve a large number of users, and data, from a single instance. There are some things worth considering though. First of all, no matter how reliable your database is, you will have to take it down eventually to do things like upgrades.

The other consideration that isn't initially obvious, is how you may hit an upper bound for resources in most modern environments. If your database is sitting on top of a virtual or containerized environment, your single instance database will be limited in resources (CPU/memory/network) to a single node of the cluster. You could also eventually hit the same problem on bare metal.

That said there are some very high density systems available. You may also not need the ability to scale as large as I am talking, or choose to shard and scale your database horizontally at later time.

If your project gets big enough you might also start wanting to replicate your data to localize it closer to the user. Another strategy might be to cache the data locally to the user.

There are positive and negatives with a single node or cluster. If retools database was clustered they would have been able to do a rolling upgrade though.

the8472 · on April 19, 2022

> scalability

You can scale quite far vertically and avoid all the clustering headaches for a long time these days. With EPYCs you can get 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of low-latency storage, minus whatever you need for your network cards), 4TB of RAM in a single machine. Of course that'll cost you an arm and a leg and maybe a kidney too, but so would renting the equivalent in the cloud.

synicalx · on April 21, 2022

It's all fun and games with the giant boxen until a faulty PSU blows up a backplane, you have to patch it, the DC catches on fire, support runs out of parts for it, network dies, someone misconfigures something etc etc.

Not saying a single giant server won't work, but it does come with it's own set of very difficult-to-solve-once-you-build-it problems.

strictfp · on April 19, 2022

I agree in principle. But one major headache for us has been upgrading the database software without downtime. Is there any solution that does this without major headaches? I would love some out-of-the-box solution.

simonw · on April 19, 2022

The best trick I know of for zero-downtime upgrades is to have a read-only mode.

Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day.

While it's in read-only mode (so no writes are being accepted) you can spin up a brand new DB server, upgrade it, finish copying data across - do all kinds of big changes. Then you switch read-only mode back off again when you're finished.

I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime.

A trick I've always wanted to try for smaller changes is the ability to "pause" traffic at a load balancer - effectively to have a 5 second period where each incoming HTTP request appears to take 5 seconds longer to return, but actually it's being held by the load balancer until some underlying upgrade has completed.

Depends how much you can get done in 5 seconds though!

kgeist · on April 19, 2022

>The best trick I know of for zero-downtime upgrades is to have a read-only mode.

I've done something similar, although it wasn't about upgrading the database. We needed to not only migrate data between different DB instances, but also between completely different data models (as part of refactoring). We had several options, such as proper replication + schema migration in the target DB, or by making the app itself write to two models at the same time (which would require a multi-stage release). It all sounded overly complex to me and prone to error, due to a lot of asynchronous code/queues running in parallel. I should also mention that our DB is sharded per tenant (i.e. per an organization). What I came up with was much simpler: I wrote a simple script which simply marked a shard read-only (for this feature), transformed and copied data via a simple HTTP interface, then marked it read-write again, and proceeded to the next shard. All other shards were read-write at a given moment. Since the migration window only affected a single shard at any given moment, no one noticed anything: for a tenant, it translated to 1-2 seconds of not being able to save. In case of problems it would also be easier to revert a few shards than the entire database.

nuclearnice3 · on April 19, 2022

I like this approach.

I'm picturing your migration script looping over shards. It flips it to read-only, migrates, then flips back to read-write.

How did the app handle having some shards in read-write mode pre-migration and other shards in read-write post-migration simultaneously.

kgeist · on April 19, 2022

Yes, it simply looped over shards, we already had a tool to do that.

The app handled it by proxying calls to the new implementation if the shard was marked as "post-migration", the API stayed the same. If it was "in migration", all write operations returned an error. If the state was "pre-migration", it worked as before.

I don't already remember the details but it was something about the event queue or the notification queue which made me prefer this approach over the others. When a shard was in migration, queue processing was also temporarily halted.

Knowing that a shard is completely "frozen" during migration made it much easier to reason about the whole process.

brentjanderson · on April 19, 2022

Depends on the database - I know that CockroachDB supports rolling upgrades with zero downtime, as it is built with a multi-primary architecture.

For PostgresQL or MySQL/MariaDB, your options are more limited. Here are two that come to mind, there may be more:

# The "Dual Writer" approach

1. Spin up a new database cluster on the new version. 2. Get all your data into it (including dual writes to both the old and new version). 3. Once you're confident that the new version is 100% up to date, switch to using it as your primary database. 4. Shut down the old cluster.

# The eventually consistent approach

1. Put a queue in front of each service for writes, where each service of your system has its own database. 2. When you need to upgrade the database, stop consuming from the queue, upgrade in place (bringing the DB down temporarily) and resume consumption once things are back online. 3. No service can directly read from another service's database. Eventually consistent caches/projections service reads during normal service operation and during the upgrade.

A system like this is more flexible, but suffers from stale reads or temporary service degradation.

jtc331 · on April 19, 2022

Dual writing has huge downsides: namely you're now moving consistency into the application, and it's almost guaranteed that the databases won't match in any interesting application.

aeorgnoieang · on April 21, 2022

I'd think using built-in replication (e.g. PostgreSQL 'logical replication') for 'dual writing' should mostly avoid inconsistencies between the two versions of the DB, no?

jtc331 · on April 28, 2022

Yes, though I've only ever seen people use the term "dual writing" to refer to something at a higher-than-DB-level.

The way I've done this involves logical replication also: https://news.ycombinator.com/item?id=31087197

strictfp · on April 21, 2022

Plus that you need to architect this yourself, with all the black magic involved to not mess something up.

jtc331 · on April 19, 2022

Preconditions:

1. Route all traffic through pgbouncer in transaction pooling mode.

2. Logically replicate from old to new.

For failover:

1. Ensure replication is not far behind.

2. Issue a PAUSE on pgbouncer.

3. Wait for replication to be fully caught up.

4. Update pgbouncer config to point to the new database.

5. Issue a RELOAD on pgbouncer.

6. Issue a RESUME on pgbouncer.

Zero downtime; < 2s additional latency for in-flight queries at time of op is possible (and I've done it at scale).

karmakaze · on April 19, 2022

The way I've done it with MySQL since 5.7 is to use multiple writers of which only one is actively used by clients. Take one out, upgrade it, put it back into replication but not serving requests until caught up. Switch the clients to writing to the upgraded one then upgrade the others.

georgewfraser · on April 19, 2022

This is such a huge problem. It's even worse than it looks: because users are slow to upgrade, changes to the database system take years to percolate down to the 99th percentile user. The decreases the incentive to do certain kinds of innovation. My opinion is that we need to fundamentally change how DBMS are engineered and deployed to support silent in-the-background minor version upgrades, and probably stop doing major version bumps that incorporate breaking changes.

ww520 · on April 19, 2022

The system needs to be architected in certain way to make upgrade without downtime. Something like the Command and Query Responsibility Segregation (CQRS) would work. A update queue serves as the explicit transaction log keeping track of the updates from the frontend applications, while the databases at the end of the queue applies updates and serves as the querying service. Upgrading the live database just means having a standby database with new version software replaying all the changes from the queue to catch up to the latest changes, pausing the live database from taking new changes from the queue when the new db has caught up, switching all client connections to the new db, and shutting down the old db.

AtlasBarfed · on April 19, 2022

Cassandra can do it since it has cell level timestamps, so you can mirror online writes and clone existing data to the new database, and there's no danger of newer mutations being overwritten by the bulk restored data.

Doing an active no-downtime database migration basically involves having a coherent row-level merge policy (assuming you AT LEAST have a per-row last updated column), or other tricks. Or maybe you temporarily write cell-level timestamps and then drop it later.

Or if you have data that expires on a window, you just do double-writes for that period and then switch over.

msh · on April 19, 2022

Migrate the data to a new host having the new version.

jhgb · on April 20, 2022

> It is amazing how many large-scale applications run on a single or a few large RDBMS. It seems like a bad idea at first: surely a single point of failure must be bad for availability and scalability?

I'm pretty sure that was the whole idea of RDBMS, to separate application from data. You badly lose the very moment when some of your data is in a different place -- on transactions, query planning, security, etc. -- so Codd thought "what if even different applications could use a single company-wide database?" Hence the "have everything in a single database" part should be the last compromise you're forced to make, not the first one.

icedchai · on April 19, 2022

Caches also help a ton (redis, memcache...)

nesarkvechnep · on April 19, 2022

Also HTTP caching. It's always funny to me why people, not you in particular, reach for Redis when they don't even use HTTP caching.

thedougd · on April 20, 2022

They didn't build their APIs with an understanding of HTTP verbs (ala RESTful). Mistakes such as POST with a query in body to search for X.

nesarkvechnep · on April 22, 2022

POST with a query as payload is not a problem if the search is a resource.

closeparen · on April 20, 2022

I have been on so many interview loops where interviewers faulted the architecture skill or experience of candidates because they talked about having used relational databases or tried to use them in design questions.

The attitude “our company = scale and scale = nosql” is prevalent enough that even if you know better, it’s probably in your interest to play the game. It’s the one “scalability fact” everyone knows, and a shortcut to sounding smart in front of management when you can’t grasp or haven’t taken the time to dig in on the details.

A1kmm · on April 20, 2022

And a lot of applications can be easily sharded (e.g. between customers). So you can have a read-heavy highly replicated database that says which customer is in which shard, and then most of your writes are easily sharded across RDBMS primaries.

NewSQL technology promises to make this more automated, which is definitely a good thing, but unless you are Google or have a use case that needs it, it probably isn't worth adopting it yet until they are more mature.

rtpg · on April 20, 2022

I would love to have stats of real world companies on this front.

Stuff like “CRUD enterprise app. 1 large-ish Postgres node. 10k tenants. 100 tables with lots of foreign key lookups, 100gb on disk. Db is… kinda slow, and final web requests take ~1 sec.”

The toughest thing is knowing what is normal for multi tenant data with lots of relational info used (compared to more large and popular companies that tend to have relatively simple data models)

syngrog66 · on April 23, 2022

plus caching, indexes and smart code algorithms go a looooooooong way

a lot of "kids these days" dont seem to learn that

by that I mean young folks born into this new world with endless cloud services and scaling-means-Google propaganda

a single modern server-class machine is essentially a supercomputer by 80s standards and too many folks are confused about just how much it can achieve if the software is written correctly