Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The kitchen sink database used by everybody is such a common problem, yet it is repeated over and over again. If you grow it becomes significant tech debt and a performance bottleneck.

Fortunately, with managed DBs like RDS it is really easy to run individual DB clusters per major app.



Management problem masquerading as a tech problem.

Being shared between applications is literally what databases were invented to do. That’s why you learn a special dsl to query and update them instead of just doing it in the same language as your application.

The problem is that data is a shared resource. The database is where multiple groups in an organization come together to get something they all need. So it needs to be managed. It could be a dictator DBA or a set of rules designed in meetings and administered by ops, or whatever.

But imagine it was money. Different divisions produce and consume money just like data. Would anyone imagine suggesting either every team has their own bank account or total unfettered access to the corporate treasury? Of course not. You would make a system. Everyone would at least mildly hate it. That’s how databases should generally be managed once the company is any real size.


Why would you make it a shared resource if you don’t have to?

Decades of experience have shown us the massive costs of doing so - the crippled velocity and soul crushing agony of dba change control teams, the overhead salary of database priests, the arcane performance nightmares, the nuclear blast radius, the fundamental organizational counter-incentives of a shared resource .

Why on earth would we choose to pay those terrible prices in this day and age, when infrastructure is code, managed databases are everywhere and every team can have their own thing. You didn’t have a choice previously, now you do.


You wouldn’t but in any decent sized organization you will have to. If it is an organization that needs to exist there will be some common set of critical data.


In my experience, isolated (repeated) data storage paradigm is even more common at large organizations. They share data via services, ETLs, event buses, etc.


That’s just not true though, I’ve worked at decent sized companies without shared RDBMs, so you don’t have to.

You DO have to share data in other ways, usually datawarehouse or services, but that is not the same thing.


To me this is semantics. So it’s a data warehouse rather than a database. Ok. Or we share data from a common source via “services” - ok but that’s another word for a database and a client (using http to do the talking doesn’t really change anything).

I’m not saying literally every source of data has to be shared and centrally managed. I’m also not saying “rdbms accessed via traditional client and queried via sql” when I say database. I’m just saying a shared database of some shape is inevitable.


Ok, but the OP and the article are talking specifically about a directly shared rdbms scenario, not some nebulous concept of shared data.

Also, operationally it’s not “semantics” at all. You don’t get into (many) operational problems with analysts sharing a datawarehouse. You absolutely do with online apps sharing a rdbms, they aren’t the same thing.


No, neither the article nor the post I originally replied to specify anything more specific than “database”. Database != RDBMS.

A data warehouse is a type of database and is does need to be managed. Your assertion that it is easier to manage is orthogonal to my assertion that there will always be a central database to manage in an organization of decent size.


...I worked at a large software organization where larger teams had their own bank account, and there was a lot of internal billing, etc, mixed with plenty of funny-money to go along with it. That's not a contradiction, though, it perfectly illustrated your point for me.


The moment you have two databases is the moment you need to deal with data consistency problems.

If you can't do something like determine if you can delete data, as the article mentions, you won't be able to produce an answer to how to deal with those problems.


The downside is then you have many, many DBs to fight with, to monitor, to tune, etc.

This is rarely a problem when things are small, but as they grow, the bad schema decisions made by empowering DBA-less teams to run their own infra become glaringly obvious.


Not a downside to me. Each team maintains their own DB and pays for their own choices.

In the kitchen sink model all teams are tied together for performance and scalability, and some bad apple applications can ruin the party for everyone.

Seen this countless times doing due diligence on startups. The universal kitchen sink DB is almost always one of the major tech debt items.


> Not a downside to me. Each team maintains their own DB and pays for their own choices.

This is how you end up with the infamous "jira and confluence have two different markdown flavors" issue.


I don't think Jira and Confluence different markdown setup is due to them not sharing their databases. It is just poor product management from Attlasian.


My point is that forcing these arbitrary decisions is poor product management.


I’m a DBRE, which means it’s somehow always my fault until proven otherwise. And even then, it’s usually on me to work around the insane schema dreamt up by the devs.

Multi-tenant DBs can work fine as long as every app has its own users, everyone goes through a connection pooler / load balancer, and every user has rate limits. You want to write shitty queries that time out? Not my problem. Your GraphQL BFF bullshit is trying to make 10,000 QPS? Nope, sorry, try again later.

EDIT: I say “not my problem,” but as mentioned, it inevitably becomes my problem. Because “just unblock them so the site is functional” is far more attractive to the C-Suite than “slow down velocity to ensure the dev teams are doing things right.”


Or, you just avoid doing multi tenet from the start and none of those become your problem to unblock. What’s the downside?


Done that as well; it still becomes my problem because teams without RDBMS knowledge eventually break it, and… then I get paged.

Full Stack is a lie, and the sooner companies accept that and allow people to specialize again, and to pay for the extra headcount, the better off everyone will be.


I disagree I guess. Multiple companies I’ve worked at have broken up their shared db into many dbs that individual teams own the operations of, and it works just fine. At significant scale in traffic and # of eng. No central dbas needed - smaller databases require much less skills to manage. The teams that own them learn enough.


I agree. My gripe was everybody in the same schema with a global “app” user.


You forgot the modern mantra - dev team is always right!


Bad schema decisions are made regardless of whether you’re one database or 50. At least with many databases the problems are localized.


But then the DB Team – if you have one – is responsible for 50 databases, each full of their own unique problems.

This will undoubtedly go over poorly, but honestly I think every data decision should be gated through the DB Team (again, if you have them). Your proposed schema isn’t normalized? Straight to jail. You don’t want to learn SQL? Also straight to jail. You want to use a UUIDv4 as a primary key? Believe it or not, jail.

The most performant and referentially sound app in the world, because of jail.


No single team should not be responsible for all databases. If such team exists they will either become bottleneck for every other team (by auditing carefully each schema change) or become bloated and not utilized 90% of time, or (most common) they will become nearly useless or even harmful - they will not be really responsible and they will act as dumb proxy - they will introduce latency to the schema updates, but they will not bother to check them very well (why would they? they are not responsible for the whole product, just for the database), some DB refactoring/migrations will be totally abandoned because DB team will make them too painful.

DB team could act as an auditor and expert support, but they should never be fully responsible for DB layer.


> If such team exists they will either become bottleneck for every other team (by auditing carefully each schema change)

That’s the point. Would you send a backend code review to a frontend team? Why do DBs not deserve domain expertise, especially when the entire company depends on them?

> they are not responsible for the whole product, just for the database

I assure you, that’s a lot to be responsible for at scale.

> DB team could act as an auditor and expert support, but they should never be fully responsible for DB layer.

Again, the issue here is when the DB gets borked enough that a SME is required to fix it, they effectively do become responsible, because no CTO is going to accept, “sorry, we’ll be down for a couple of days because our team doesn’t really know how this thing works.”

And if your answer is, “AWS Premium Support,” they’ll just tell you to upsize the instance. Every time. That is not a long-term strategy.


What's the best non serial option for PKs in your view? Or do you prefer dual PK approach?


What’s wrong with uuidv4 as PK?


Serial integers always work better than any uuid as PKs, but the thing with uuid4 is that it disrupts any kind of index or physical ordering you decide to put on your data.

Uuids are really for external communication, not in-system organization.


FWIW this isn’t true anymore with newer uuid schemes like v7 that are roughly time sortable.


Serial index forces a synchronisation point on every entity that can create records. If this is only ever a single database that’s fine but plenty of apps can’t scale this way.


They don't. Clustered databases deal with parallel generation of them just fine.

They require periodic synchronization. What isn't a big deal at all and is required by many other database features.


If you have a sharded DB, each instance can get its own range of ints, which are periodically refreshed.

PlanetScale uses int PKs [0], and they seem to have scaled just fine.

[0]: https://github.com/planetscale/discussion/discussions/366


Anything non-k-sortable in a B[+,-]tree will cause a ton of page splits. This is a more noticeable performance impact in RDBMS with a clustered index (MySQL's InnoDB, MS SQL Server) [0], but it also impacts Postgres [1] in multiple [2] ways.

[0]: https://www.percona.com/blog/uuids-are-popular-but-bad-for-p...

[1]: https://www.cybertec-postgresql.com/en/unexpected-downsides-...

[2]: https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-pa...


It's because I hate databases and programming separately. I would rather slow code then have to dig into some database procdure. Its just another level of separation thats too mentally hard to manage. Its like... my queries go into a VM and now I have to worry about how the VM is performing.

I wish and maybe there is a programming language with first class database support. I mean really first class not just let me run queries but almost like embedded into the language in a primal way where I can both deal with my database programming fancyness and my general development together.

Sincerely someone who inherited a project from a DBA.


The language you’re talking about is APEX. I believe it comes from Oracle and is the backend language for Salesforce development. You’ll like the first class database support but that’s about it.


> I mean really first class not just let me run queries but almost like embedded into the language

Not quite embedded into the OS, but Django is a damn good ORM. I say that as a DBRE, and someone obsessed with performance (inherent issues with interpreted languages aside).


The closest thing to what you're describing is Prisma in Node. It generates a Typescript file from your schema so you get code completion on your data. And it exists somewhere between a query builder and a traditional ORM.

I have worked in many languages with many ORMs and this has been my personal favorite.


Until Prisma can manage JOINs [0] there is no way I can recommend it.

[0]: https://github.com/prisma/prisma/discussions/12715


The support for JOINs is coming, currently under a feature flag [0]

[0]: https://github.com/prisma/prisma/issues/5184#issuecomment-18...


But the migration stuff is a horrible joke. No way to just rollback a broken migration. https://www.prisma.io/docs/orm/prisma-migrate/workflows/gene...


Lots of interesting comments on this one. Anyone have any good resources for learning how not to fuck up schema/db design for those of us who will probably never have a DBA on the team?


Good question. We don't have a DBA either. I've learned SQL as needed and while I'm not terrible, it's still daunting when making the schema for a new module that might require 10-20 tables or more.

One thing that has worked well for us is to alway include the top-most parent key in all child tables down yhe hierarchy. This way we can load all the data for say an order without joins/exists.

Oh and never use natural keys. Each time I thought finally I had a good use-case, it has bitten me in some way.

Apart from that we just try to think about the required data access and the queries needed. Main thing is that all queries should go against indexes in our case, so we make sure the schema supports that easily. Requires some educated guesses at times but mostly it's predictable IME.

Anyway would love to see a proper resource. We've made some mistakes but I'm sure there's more to learn.


Not to pick on you, but is SQL not basic knowledge for every software engineer these days? Or have times changed?


Perhaps I undersold myself a little. By the time I got my first job I was fairly well versed in SQL querying, and these days I feel comfortable writing what I'd consider complex queries. That is with various window functions, nested queries, recursion (though I try to avoid that) etc, and I have a good handle on what the query optimizer likes and doesn't like.

But schema design is something else. I still take my time doing that.

Especially since our application is written with backwards compatibility in mind, so changing schema after it's deployed is something we try very hard to avoid.

But yeah, when hiring we require they are comfortable writing "normal" SQL queries (multiple joins, aggregation etc).


Times have changed. If you have C# programmers and they can't do it in Entity Framework/LINQ, then they can't do it.


This seems like a stereotype from 2010s and disconnected from reality today.


Nope. None of my below 30 colleagues know SQL. They use ORM in REPL or visual tools.


LINQPad is awesome and EF Core is just this good so I can see why some would just choose not to deal with SQL.

With that said, this still sounds like a strange situation - most colleagues, acquaintances and people I consulted know they way around SQL and dropping down to 'dbset.FromSql($"SELECT {...' is very commonplace out of the need to use sprocs, views or have tighter control over the query.


I had not updated LINQPad in a while and just saw the price this year. Eeesh. I now live in a .NET Interactive (Jupyter like) environment.


> not to fuck up schema/db design

The neat thing is, you don't. Nobody ever avoids fucking up db design.

The best you can do is decide what is really important to get right, and not fuck that part up.


Wow, what an astute comment! Thank you!

P.S. to the original person concerned about this though… for your own sake and your successors, please keep trying.


Assuming that was sarcastic, you are free to try, I guess everyone needs to try it once.

Just do the exercise of deciding what is really important first, so you can make sure you succeed for that stuff.


Sigh. Sorry. Not sarcastic, I actually really appreciated your comment. Been working in this space 20+ years and you speak truth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: