We are currently pushing close to 80tb mssql on prem instances.
The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.
But I agree. While our design is pretty bad in a few ways, the amount of data that we are able to serve from these big DBs is impressive. We have something like 6 dedicated servers for a company with something like 300 apps. A hand full of them hit dedicated dbs.
Were I to redesign the system, I'd have more tiny dedicated dbs per app to avoid a lot of the noisy neighbor/scaling problems we've had. But at the same time, It's impressive how far this design has gotten us and appears to have a lot more legs on it.
Can I ask you how large tables can generally get before querying becomes slower? I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10. Surely you do table partitions or cycle data out into archive tables to keep up the query performance of the more recent table data, correct?
> I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10
Sql server data is stored as a BTree structure. So a 10 -> 100gb growth ends up being roughly a 1/2 query performance slowdown (since it grows by a factor of log n) assuming good indexes are in place.
Filtered indexes can work pretty well for improving query performance. But ultimately we do have some tables which are either archived if we can or partitioned if we can't. SQL Server native partitioning is rough if the query patterns are all over the board.
The other thing that has helped is we've done a bit of application data shuffling. Moving heavy hitters onto new database servers that aren't as highly utilized.
We are currently in the process of getting read only replicas (always on) setup and configured in our applications. That will allow for a lot more load distribution.
The issue with b-tree scaling isn't really the lookup performance issues, it is the index update time issues, which is why log structured merge trees were created.
EVENTUALLY, yes even read query performance also would degrade, but typically the insert / update load on a typical index is the first limiter.
If there is a natural key and updates are infrequent then table partitioning can help extend the capacity of a table almost indefinitely. There are limitations of course but even for non-insane time series workloads, Postgres with partitioned tables will work just fine.
A lot depends on the type of queries. You could have tables the size of the entire internet and every disk drive ever made, and they'd still be reasonably fast for queries that just look up a single value by an indexed key.
The trick is to have the right indexes (which includes the interior structures of the storage data structure) so that queries jump quickly to the relevant data and ignore the rest. Like opening a book at the right page because the page number is known. Sometimes a close guess is good enough.
In addition, small indexes and tree interior nodes should stay hot in RAM between queries.
When the indexes are too large to fit in RAM, those get queried from storage as well, and at a low level it's analogous to the system finding the right page in an "index book", using an "index index" to get that page number. As many levels deep as you need. The number of levels is generally small.
For example, the following is something I worked on recently. It's a custom database (written by me) not Postgres, so the performance is higher but the table scaling principles are similar. The thing has 200GB tables at the moment, and when it's warmed up, querying a single value takes just one 4k read from disk, a single large sector, because the tree index fits comfortably in RAM.
It runs at approximately 1.1 million random-access queries/second from a single SSD on my machine, which is just a $110/month x86 server. The CPU has to work quite hard to keep up because the data is compressed, albeit with special query-friendly compression.
If there was very little RAM so nothing could be kept in it, the speed would drop by a factor of about 5, to 0.2 million queries/second. That shows you don't need a lot of RAM, it just helps.
Keeping the RAM and increasing table size to roughly 10TB the speed would drop by half to 0.5 million queries/second. In principle, with the same storage algorithms a table size of roughly 1000TB (1PB) would drop it to 0.3 million queries/second, and roughly 50,000TB (50PB) would drop it to 0.2 million. (But of course those sizes won't fit on a single SSD. A real system of that size would have more parallel components, and could have higher query performance.) You can grow to very large tables without much slowdown.
The current application is Ethereum L1 state and state history, but it has useful properties for other applications. It's particularly good at being small and fast, and compressing time-varying blockchain-like or graph data.
As it's a prototype I'm not committing to final figures, but measurement, theory and prototype tests project the method to be significantly smaller and faster than other implementations, or at least competitive with the state of the art being researched by other groups.
> Why did you reinvent the wheel?
Different kind of wheel. No storage engine that I'm aware of has the desired combination of properties to get the size (small) and speed (IOPS, lower read & write amplification) in each of the types of operations required. Size and I/O are major bottlenecks for this type of application; in a way it's one of the worst cases for any kind of database or schema.
It's neither a B-tree nor an LSM-tree, (not a fractal tree either), because all of those are algorithmically poor for some of the operations required. I found another structure after being willing to "go there" relating the application to low-level storage behaviour, and reading older academic papers.
These data structures are not hard to understand or implement, once you get used to them. As I've been working on and off for many years on storage structures as a hobby (yeah, it's fun!), it's only natural to consider it an option when faced with an unusual performance challenge.
It also allowed me to leverage separate work I've done on raw Linux I/O performance (for filesystems, VMs etc), which is how random-access reads are able to reach millions/s on a single NVMe SSD.
> Is it as durable as Postgres?
Yes.
Modulo implementation bugs (because it won't have the scrutiny and many eyes/years of testing that Postgres does).
The important point is that many (though not all) queries are executed by looking things up in indexes, as opposed to searching through all of the data in the table. The internal pages of a B-Tree index are typically a fraction of 1% of the total size of the index. And so you really only need to store a tiny fraction of all of the data in memory to be able to do no more than 1 I/O per point lookup, no matter what. Your table may grow, but the amount of pages that you need to go through to do a point lookup is essentially fixed.
This is a bit of a simplification, but probably less than you'd think. It's definitely true in spirit - the assumptions that I'm making are pretty reasonable. Lots of people don't quite get their head around all this at first, but it's easier to understand with experience. It doesn't help that most pictures of B-Tree indexes are very misleading. It's closer to a bush than to a tree, really.
At my old workplace we had a few multi-TB tables with several billion rows in a vanilla RDS MySql 5.7 instance (although it was obviously a sizable instance type), simple single-row SELECT queries on an indexed column (ie SELECT * FROM table WHERE external_id = 123;) would be low single-digit milliseconds.
Proper indexing is key of course, and metrics to find bottlenecks.
Well, any hot table should be indexed (with regards to your access patterns) and, thankfully, the data structures used to implement tables and indexes don't behave linearly :)
Of course, if your application rarely makes use of older rows, it could still make sense to offload them to some kind of colder, cheaper storage.
Think of finding a record amongst many as e.g. a binary search. It doesn't take 10 times as many tries to find a thing(row/record) amongst 100 as it does amonst 1000.
The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.
But I agree. While our design is pretty bad in a few ways, the amount of data that we are able to serve from these big DBs is impressive. We have something like 6 dedicated servers for a company with something like 300 apps. A hand full of them hit dedicated dbs.
Were I to redesign the system, I'd have more tiny dedicated dbs per app to avoid a lot of the noisy neighbor/scaling problems we've had. But at the same time, It's impressive how far this design has gotten us and appears to have a lot more legs on it.