When posts like these come up, I'd like to remind people that context matters when making technical decisions. What works for large companies with huge scale (GitHub, Google, Facebook) may not work for you.
As a counter point to the linked issue, I operate a few small applications. Foreign-keys (and constraints in general) are great at ensuring that invalid data doesn't find its way into your database. Yes, they have a performance cost. Yes, they make sharding more difficult. In my experience, at smaller scale the trade-offs are worth it. YMMV
In the same vein, I'd like to remind people that you are probably not a "temporarily low-scale big-data company", in the same vein as a temporarily embarrassed millionaire. In lots of cases going for the very long term scalable solution will be an impediment to your growth, and I'd suggest dealing with those issues when the chance that you need them is on the horizon, rather than across the globe.
CQRS is one of the biggest examples I've seen personally for this - your flexibility will go down, work to do "basic" things will go up, and you will really lose a lot of speed solving for 10 y/o company problems with technology rather than 1-5 y/o company problems in terms of product market fit.
> In lots of cases going for the very long term scalable solution will be an impediment to your growth, and I'd suggest dealing with those issues when the chance that you need them is on the horizon, rather than across the globe.
My favorite example is a bootstrapped startup that I co-founded. We had a couple thousand users and less than a gigabyte of data. The app was a run-of-the-mill CRUD app. I built the entire back-end API as a single service that could be built and deployed on heroku in minutes.
At some point, I left the project, and my co-founder hired an expensive consultant to review the system and provide feedback. One of his suggestions was to break up the entire service into a suite of microservices. And this was for a service that barely averaged a handful of active users at any one time. And a team of ~2 engineers working on the back end code. Microservices.
> At some point, I left the project, and my co-founder hired an expensive consultant to review the system and provide feedback.
People get pretty unhappy if they hire a consultant and don't get some drastic change recommendations. "Everything is good" doesn't sit well when handing over cash.
As a consultant I have never been hired by someone whose system was working. I am willing to bet the consultant was hired to add a feature, could not figure the installed code, and proposed to redo it the only way he or she was used to.
I have been hired a few times to review working systems. Basically get an external set of eyes on specific things. And “yes, this all looks good, you might want to tweak a little here and keep an eye on that once you grow significantly.” is an entirely accepted outcome of such reviews.
I thought people hire consultants so that they can sell unpopular changes as recommendations from an external authority instead of letting the blame hit management directly.
That's often more what you'd bring management consultants in for. A technology consultant is more likely just to tell you to dump whatever tech you are using in favor of whatever the flavor of the month is.
> People get pretty unhappy if they hire a consultant and don't get some drastic change recommendations. "Everything is good" doesn't sit well when handing over cash.
I suggest that this is largely untrue when the consultant being hired is a security consultant.
There is demand out there for security consultants who will rubber-stamp your existing software.
The biggest issue with CQRS I've seen is people thinking CQRS means you need multiple, duplicate data structures, mappers, a few Kafka topics and a PhD, when IN REALITY all it means is you put methods that return data without modifying it in one interface/class and methods that have side effects in another interface/class - which is really just a good application of interface segregation.
Moreover, you now have a great rule of thumb for your brand new junior developers: "If you're handling a GET request, use this class, and if not use this other class" making code reviews easier to do and helping to enforce better structure in your code. Plus tons of other benefits.
The rest of that stuff that isn't CQRS you might eventually need if you scale, but is easy to add once you have the bones in place.
Like you, I've found the general idea of CQRS/ES[0] incredibly valuable because it both enforces a functional approach to state (e.g. 'current state' is a fold over events), and it forces people to really think about the domain. (E.g. which bits are really atomic units, or what can be fixed up if it goes wrong in some way.)
It also forces people to think about something that's usually glossed over: Consistency. If you have an RDBMS backing you, you tend to not think about the fact that what the user on a web page sees is already out of date when they see it, so any responsible application should track when that data was read and reject updates if the backing data has changed since it was read... but very few applications even attempt to do this (it's really hard and a huge amount of boilerplate with most APIs). With CQRS/ES you are really forced to think about these things up front -- and you can actually avoid most of the issues. Whether that increases or decreases 'productivity' I don't know, but I do know that thinking about these things increases correctness.
For me, correctness is paramount. If you don't mind bugs, I can give you an infinitely fast solution.
I’ll tell you this - it decreases it, especially for small teams. You have to do a lot of work to make your query engine actually good and reliable, and reporting is also a huge bear to deal with. You have to cache somewhere to get your queries anywhere near real-time, and it takes a lot of reinvention.
I don't understand this response. Most of our use cases have always done fine with a simple JSON document store -- no reinvention or anything. Most of our Queries are just a simple PostgreSQL table with an Aggregate ID and a JSON column.
Remember: For smallish applications you can still have your application itself be a monolith, so no need for complicated setups with message queues (Kafka, whatever), etc. etc. In this case you can basically rely on near-instant communication from the Command side of things to the Query side of things. You can also have the frontend (Web) subscribe to updates.
(We have our own in-house library to do all of this so YMMV. I'm not sure what the commodity CQRS/ES libraries/frameworks are doing currently.)
> any responsible application should track when that data was read and reject updates if the backing data has changed since it was read... but very few applications even attempt to do this (it's really hard and a huge amount of boilerplate with most APIs).
it is done very simple by versioning. Usually implemented transparently for the data layer API clients.
It's simple. I'm just saying that (for most frameworks) it's a lot of work and boilerplate. It's also easy to miss individual cases during code review, etc.
You kind of miss the part where he says "Usually implemented transparently for the data layer API clients." Transparently as in no work, no boilerplate.
I guess your mileage might vary, but Java has JPA/Hibernate, and .NET has Entity Framework, and they both make it easy, so I'm going to be surprised if any major framework or language doesn't make this easy.
Exactly, CQRS gets such a bad rep because of all the pieces that (typically) surround it. But really as just a way of organizing code, it's so handy.
The freedom and flexibility to create your read models and write models the way they are needed just completely sidesteps a lot of design issues that creep up. Plus, it makes the code so easy to follow when everything follows the pattern; picking out where state changes happen becomes trivial for example.
To be clear, I'm referring to the (in my circles) usual parlance of not "pure" CQRS, which is as you describe and could entirely be in a OOP "Model" layer, but what usually goes with it - things like event sourcing for auditing, cached read services (eventual consistency), etc. Should've been more specific but was just pattering off the first big complexity thing that came into my mind.
This seems to overstate the difficulty in implementing CQRS, and by that I assume you mean event sourcing. Kafka is definitely not needed, or even necessarily a good idea, nor is a PhD.
I'm not sure what you experience with CQRS has been but CQRS is not a solution for scale it is a solution for long-term maintenance costs of a growing codebase by enforcing the decoupling of services.
I like where you're going for that. Why don't we just call it 'temporarily embarrassed big data company'?
On one project where we seemed to handle engineering strategy well, one of the considerations we'd make is whether paying the tech debt would accompany an inflow of cash or not. If it didn't we'd better tackle it now so we don't bleed out.
It's probably another way to say "spend money to make money". Management is more willing to spend out of a problem when they've had a fresh fix.
The trick is, though, can you pay down that debt fast enough that the customers don't get upset while waiting. If someone's firing up a big ad push you'd better be prepared. And people, like me, who get caught flat-footed once (or have friends who were), don't want it to happen a second time. So they get opinions on 'last responsible moment' that might differ from the idealists.
If your requests come in hard and fast then you’ll have a big problem with tech debt, but if you can schedule it properly you can do it as a part of adding functionality and people tend to be happy, because once the functionality is in it can be enhanced much more quickly.
>In lots of cases going for the very long term scalable solution will be an impediment to your growth, and I'd suggest dealing with those issues when the chance that you need them is on the horizon, rather than across the globe.
My attitude is:
"Man if this takes off and I have to make some changes on how we do this....I should celebrate!"
Thing I've tried and mostly failed to get across. There is usually plenty of money and resources to throw at products that are making money. Vs the amount for those that aren't.
As long as you're not painting yourself into a corner a lot of things can be fixed 'later'. Later when your not desperate to get the thing off the ground. Later when you can throw two engineers at it for six months. Later when the pain points are well understood.
> I'd like to remind people that you are probably not a "temporarily low-scale big-data company", in the same vein as a temporarily embarrassed millionaire.
This point is valid in the same way as telling a startup founder to give up the startup and invest in the S&P 500. It's trivially true, but ultimately useless advice.
Building technology is hard precisely because there are so many tradeoffs. Speed to market vs scalability is one of many. It's silly to pretend that there is some kind of rote obviousness to never caring about scale in the early stage, anymore than there is rote obviousness to just investing in the S&P.
Ex post, most startup founders/investors should have just invested in the S&P, and most early stage tech leads should have just used Wordpress or Rails.
I didn't say never care about scale in the early stage. I've found it a more common problem that people in startups are building systems with too much complexity to solve problems they faced at previous employers, in order to "save themselves time" in the future that ends up never coming. There are certainly some technology companies for which scalability is the most important thing - Instagram being the biggest example I can think of. The point of me saying so was to remind people - despite what you may think, you are not an Instagram, and more than likely your business doesn't require you to be one.
Plenty of early stage tech leads DO use wordpress and rails because their company doesn't need something more. You just don't hear about them often because they won't be sexy enough for HN. YAGNI.
OK so then does YAGNI apply to investment? Why not let existing firms handle it in the S&P?
> YAGNI (continued)
over-engineering is definitely a problem, but so is hindsight bias.
I think that most non-trivial tech builds involve some smart investments and some dumb ones. It's essentially a portfolio of decisions made in the face of uncertainty. YAGNI is cynicism and hindsight bias, and over-generalization.
I think this is less obvious than it should be because of the popularity of the highly cynical "agile" approaches that consider a project that does not plan beyond the next sprint as somehow not taking on code debt.
>> OK so then does YAGNI apply to investment? Why not let existing firms handle it in the S&P?
Yes, don't rebuild your own balanced portfolio or pay someone to "beat the market" for you; just buy a COTS ETF, put it in the server closet and forget about it for a decade.
> CQRS is one of the biggest examples I've seen personally for this: your flexibility will go down, work to do "basic" things will go up
The only time I've worked on a codebase that had implemented CQRS, it was when I was given to maintain a crud application to manage maybe 20 different records. Rarely used. The people who had developed it in .net mvc were of such technical prowess that, besides the CQRS pattern, the frontend still had a functional shopping cart (yes, you clicked on "save" and your "order" went to the shopping cart, relic of the mvc demo app).
Sometimes I wonder if I've just been unlucky or the world is all like this.
As someone currently fighting a battle with an LoB application written without foreign-key constraints with hundreds of thousands of rows of corrupted data because of bugs in sprocs that assigned the wrong value to the wrong foreign key column because they were similarly named - THIS!
The reply in the GitHub thread we’re talking about makes it clear that they still perform FK validation - it’s just performed in the application code rather than the DBMS.
I note there is another alternative: deferred constraints - or just run a query to check for invalid rows at 3am every morning.
I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app. And the “Rails Way” is to create these constraints in the model. Foreign key constraints weren’t a first-class member in Rails until v4 (if memory serves correctly).
I was once a full-time Rails dev and really loved the framework (I don’t write as many user facing applications these days). Most of the Omakase trade offs didn’t bother me. But I never understood the disdain for foreign keys. For 99.99% of web apps, you want them (dare I say, need them).
Even in Rails 3 I would add them by hand in the migration files. Very few applications will ever actually care about sharding. We were pulling in millions at the last Rails company I worked for and were just fine with a master and a single replica.
If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition. Your goal should be to outgrow what MySQL (or Postgres) can do for you in master/replica mode. If you do, you’ll likely be independently wealthy...
If data integrity matters at all, model-based checks (or periodic queries for orphaned data) will not suffice. Just put a foreign key in the table where it belongs and let your DB do what it does best. ACID is an amazing thing...
> I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app
Maybe originally, but lack of foreign key usage is certainly not Rails specific today. Large MySQL shops generally don't use foreign keys, full stop, for the exact reasons Shlomi described in the original comment.
Facebook does not use foreign keys either. In my experience, same thing is true at all the other large MySQL-based companies. And those companies make up a majority of the largest sites on the net btw -- including most of the consumer-facing social networking and user-generated content products/apps out there.
This does not mean that, for example, Facebook is full of data that would violate constraints. There are other asynchronous processes that detect and handle such problems.
> If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition.
Do you mean, stay with your current RDBMS of choice and shard while also removing the FKs? Or do you mean transition to some other system? (and if so, what?)
The former path is bad: sharding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.
The latter path is also bad: only very recent NewSQL DBs have sharding built-in, and you probably don't want to bet your existing rocket ship startup on your ability to transition to one under pressure. Especially given much higher latency profiles of those DBs, meaning a very thorough caching system is also required, and now you have all the fun of figuring out multi-region read-after-write cache consistency while also transitioning to an entirely new DB :)
“[S]harding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.”
I typically espouse app logic to check state and use foreign keys to ensure enforcement (eg, race conditions that are very hard to ensure at the app level but are built into many RDBMS). Foreign key failures are just treated like any other failure mode.
But honestly, I haven’t been part of a company that have really hit those upper limits that require sharding. They exist, yes. But most companies will never need to worry about it. Which is my point.
I agree that most companies won't ever need to shard, and pre-sharding (or worrying about extreme scalability challenges in general) is usually unwise premature optimization. But it does depend on the product category.
Social networking and user-generated content products (including GitHub) need to be built with scale somewhat in mind: if the product reaches hockey-stick growth, a lot of scalability work will need to be completed very quickly or the viral opportunity is lost and the company will fail. I'm not saying they should pre-shard, but it does make sense to skip FKs with future sharding in mind.
This was nearly a decade ago, but in one case I helped shard my employer's DBs (and roll out the related application-level changes) with literally only a few hours to spare before the main unsharded DB's drives filled up. If we also had to deal with removing FKs at the same time, we definitely wouldn't have made it in time and the product literally would have gone read-only for days or weeks, probably killing the company. Granted, these situations are incredibly rare, but they really do happen!
It’s at its core a case by case decision, I think FK are also a net negative in data ingestion scenari where the data set is big enough.
Trying to make sure everything is where it needs to be at any given time, everything is inserted in the right order and the data is always consistent brings exponential amount of conplexity when it could all be checked at the end and pruned for invalid data. And usually DB integrity will not be enough, you’ll want business level validation that all is OK, so there will be app level checks anyway.
This looks nice, it’s still limit to a single commit though. That’s where it’s a PITA for anything that won’t (or we don’t want to) fit a single commit.
In particular splitting commits allows to ingest data in parralel (for instance if we import stores and store owners, both could be ingested separately without caring at first if each references a valid entity)
At least with mysql, and probably with postgres as well, you can temporarily turn off foreign key checks for a set of statements. So you can still get the benefits of foreign key constraints by default but when they do more harm than good you can turn them off. With the added benefit that turning off FK constraints screams "I am doing something unusual and dangerous - this requires extra caution."
App level data integrity is usually critical. Ingestion into a reporting database is an entirely different construct (where I agree FK constraints are burdensome).
It's been a while, but IIRC at the time Rails got started MySQL actually did not even support foreign key constraints. Since that was the DBMS of choice, it wasn't much of choice.
InnoDB's foreign key support predates the existence of Rails by several years. However, InnoDB wasn't the default storage engine for MySQL at the time, so that may be a factor.
I wonder how different the state of database application development would be today if all those cheap whitelabel webhosts powered by cPanel or Plesk (where most of us got started, I imagine) opted for PostgreSQL instead of MySQL - which would have influenced the major MySQL adopters like phpNuke, phpBB, WordPress, etc.
The application code isn’t enforcing FK constraints because that is impossible for the application to do. Their database is almost 100% guaranteed to be corrupt as a result.
Application code has bugs. Application code can fail in ways that result in corruption. A primary job of the database is to keep itself from getting corrupted. Enforcing foreign key violations is only something the database can do correctly. Punting the responsibility to a higher layer will result in corruption.
> A primary job of the database is to keep itself from getting corrupted
That would be where the road splits. If you handle the database as a very fast and structured storage application, a lot of these assumptions go away, with then a different set of tradeoffs.
Some data corruption could be fine if you can guarantee the critical cases, just as bugs in the code are fine as long as the useful cases are covered.
I remember a database with scheduling entries in it that could get duplicated depending on the sharding, but it didn’t matter because the app handled the case gracefully.
I think understanding the tradeoffs that match the best the use case is the most important point, always assuming that a DB has to guarantee integrity can be a burden preventing from looking at all the options.
There are 100% foreign key violations in their database. That is not the same as their database being corrupt.
They have engineered for, and understand the implications of, foreign key violations. Typically, it's as simple as "This row can be deleted", and that can cascade - at a totally different rate than you'd find in a database and with totally different performance characteristics.
Foreign key violations are data corruption!!! It violates the rules of how the data relates and can and will screw up any number of things that depend on the rules being enforced.
Reporting and bi data might get hosed.
Account management might get hosed.
Who knows what happens when FK rules are violated because by definition they should never be violated. It puts all applications on top into a undefined state, leading to bugs and god knows what else.
You're missing an important point, it violates a certain set of rules of how the data relates. That certain set of rules typically makes it easier to write most general data applications.
If your rules define a foreign key as optional, you can easily engineer an application around it.
Great examples include NoSQL and Ruby duck-typing. In both cases, you infer actions based on the data structure, rather than making explicit assumptions.
The whole fact that you made a foreign key implies that that is now a rule of your database. If anything is in that column that does not adhere to the foreign key, that means your database is corrupted.
Then you have an implicit rule that is enforced by a hope and a prayer that some junior dev never makes a mistake, your senior engineers are clairvoyant and understand every single aspect of your systems 100% with zero off-days, your code review process catches every single possible edge case (especially the edge cases that you never knew existed), your QA process is 100% and never makes mistakes, your servers never crash in ways that leave things in an inconsistent state, etc.
Or you could, you know, simply add a foreign key constraint and never, ever, ever have corrupt table relations.
Why people fight their tools is beyond me. There is almost zero reason to defend not using foreign keys.
i suggest being open to the concept that other perfectly capable humans may, in fact, design systems with different underpinning assumptions than those you appear to presuppose always must apply to everyone and everything.
It's important to remove the DBA or Developer hat and realize that you are working together on a singular system (or maybe it's even the same person, etc.)
"corruption" implies that the desired results have not been achieved, that essential data has been lost or compromised, and this clearly is not the case with such a deliberate design decision. one should be prepared to accept this possibility in order to not be merely a fanatic and unreasonable.
These are self-imposed rules, so breaking them is wholly up to yourself. And there are trade-offs involved that may often make it acceptable relax them.
You seem to be rather invested in one set of arbitrary definitions. If people do fine even in situations where they "by definition" shouldn't, it's the definitions that have been found lacking, not the people.
You can keep saying that, but it does not make it true. Properly modeled data does not need those constraints. Well written software handles these correctly. Mediocrely written software fails and complains loudly. Badly written software might get hosed. You're at just as much risk (or more, in my opinion) of that with bad schema changes as with not having foreign key constraints.
Do you not check input from your javascript front-end before you save it? Even if that front-end does its own validation? (No, you see, but there is only one web front-end.... we don't need the backend to validate the input!!!!)
In what way is letting the database ensure it isn't getting fed crap any different?
Why do developers constantly think it is okay to let unvalidated user input hit their database? Any client calling your database is a hostile client that will feed your database bad data. Arguing otherwise is complete ignorance.
> In what way is letting the database ensure it isn't getting fed crap any different?
Because one canonical validation layer is generally enough. You can see how having two separate partial validation layers could cause problems, right? And you can't put all the validation in the database, for non-trivial apps.
And "clients" shouldn't be talking to the database, no matter if you have foreign keys or not. That's a totally separate issue.
LOL, if I read the data directly from the db instead of via the application's API then sure, I lose the application's guarantees. But, y'know, same might be said if I just go and read the DBs files from a sidecar shell script or something.
> It puts all applications on top into a undefined state
If you want to operate assuming your data is always corrupt because your engineers don't understand how to use the tools provided by their database.... Seems like an awful lot of work to re-invent a wheel that your DB server can solve for you. I guess that is on you though....
1. Over my career almost none of it has been debugging problems due to a lack of foreign keys.
Further to that, instead of having lazy cascading deletes moving the goal posts on you. Data missing its parent for example is an indication something is wrong. It's a useful diagnostic.
2. Most string data is safe to represent as empty. Handling null on the other hand is a different situation and often induces warnings or other side effects depending on the language. Forcing the null checking to your boundaries is much like forcing your state/mutation to the boundaries in FP. Hell eveb though I think it's insane, hexagonal arch works on this idea as well.
> or just run a query to check for invalid rows at 3am every morning
One of the nice things you get from database FK constraints is when something messes up the data, you get an exception thrown in the code that's messing it up, complete with a backtrace and whatever inputs and timestamps you care to log.
I'll jump in here as well in case someone is considering throwing out foreign keys due to this. I would argue, never make compromises like this unless you have a scaling problem that is so bad, that there is no other way around it. Don't get me wrong, there are places where this is the case, but 99% of software does not have this issue. Use foreign keys and save yourself the headache in the future. Also bear in mind, that if you ever hit that scaling problem, you are making so much money, that it's a nice problem to have.
I would add to this if you have the scaling problem, and are planning to throw out foreign keys, and start sharding: you have essentially moved to a distributed data store, but you are using a query language not designed for it. You will face challenges learning what parts of the database system you can and cannot use safely, and enforcing these constraints will be fraught.
You are essentially moving to NoSQL. Maybe it makes more sense to just own up to being on NoSQL, and using a data store and data access patterns that were actually built for the task? It should be something to think about, anyway.
Certainly consider this option if you're planning for that scale from the start as a more meaningful alternative to simply saying "no foreign keys" from the start. I won't necessarily say it's overengineering; there are entire problem classes where tracking millions and billions of records are on the table, particularly in event monitoring.
I've worked on exactly one project where performance concerns lead to removing Fkeys.
The compromise we came to was to enforce them in dev and qa in order to catch bugs, and relax them in prod.
I still strongly believe that database constraints are a developer's best friend, in that you can trivially make your data structures fight back against misuse. This makes several classes of bugs obvious. But like anything, there are times they are not optimal.
I'll also say I think there are very few cases where a small performance advantage outweighs the costs, and would be hesitant to head down this path with a team less competent than the folks at Github.
I've worked on projects where adding foreign keys increased performance tremendously. A good relational database can take advantage of them in all sorts of ways in execution planning. Even scenarios such as some types of sharding, a relational database sometimes can use foreign keys for useful data proximity information, to better co-locate "families" of records together.
You need to know your database engine, how normalized you are or are not, how appropriate your data is for its relational model. It's not always your FK constraints that are the bottleneck you might think they are. Sometimes their "slowness" is hiding a problem elsewhere in your schema, a bad shard or an index that could be better or a key-value pool with soft version controlled enums pretending to relational data.
Sometimes "optimal" isn't a simple spectrum but a checklist of trade-offs and no "right" answer.
Two of Google's big internal databases don't use foreign keys (Spanner[0] and F1 [1]), rather they use hierarchies (parent/child tables). It definitely limits you a lot more than foreign keys, but seems to make sharding easier. It works well for some use cases, but takes some getting used to. Cockroachdb also has this functionality [2].
Someone two degrees of separation from me coined the term "medium data". I don't remember their name, but I absolutely love the term.
You are "big data" when black swan events (like hardware failure taking down a database node) become regular enough that you begin to statistically model it. Before that, you might just have a lot of data, but you aren't having "big data" problems.
Or, as a wise man once said, "Premature optimization is the root of all evil."
It's a different part of the field of course, but so many people yield opportunity cost chasing some difficult architectural problem that will likely never impact you, and if it does you'll have resources to throw at it.
Many startups won't even have slave production databases, much less have to worry about network distributed sharding, complex caching strategies, database optimization beyond query analysis.
Also, if you do end up sharding a relational database, it is often by identifying subgraphs of document-like structures that you can shard on. Need to ensure these subgraphs do not have foreign key relations with one another, but you can maintain the valuable foreign key relations WITHIN the subgraphs.
Practical example: database of user profiles where suddenly you have billions? You can skill keep foreign keys on user to email-addresses or user-to-comments while eliminating cross-user foreign keys.
I would also add a good rule of thumb: when you make the design decision to remove a database feature, you need to assume that you now need to handle that feature yourself, or your data will get corrupted. For example, when removing FK constraints, or transactional boundaries, or introduce irregular checkpointing, you now need to implement a data repair system, because your data will get broken. At which point you are probably going to end up using a change event log (your own transaction log) and a system that replays the logs in order to repair and rebuild the database.
Even at very large companies, hacks can work surprisingly well for a surprising amount of time. As much as we harp on technical debt (which includes myself), we rarely appreciate how much time can be saved by implementing the most straightforward solution so we can focus on hard things.
I'm not sure if Guido was just being amicable, but in the recent Dropbox retirement post he said, “There was a small number of really smart, really young coders who produced a lot of very clever code that only they could understand,” said van Rossum. “That is probably the right attitude to have when you're a really small startup.”
In addition to context + YMMV, I want to mention, sharding is NOT THE ONLY solution for scaling[1], albeit it's a popular technique. So that makes Shlomi Noach's first point weaker.
Additionally, at least in Sql Server, trusted foreign keys can give the CBO more, good options because of the guaranteed referential integrity on reads. FKs are killer on larger writes, though.
+1 million times this. Seems like everywhere I work with are concerned about google size data problems when the apps I'm working on are at best 1000 concurrent users...
I agree. No two systems are comparable. Must be very confusing for the people ramping up. A solution fits the problem, not the other way around. Keep it simple stupid. KISS
The number of times I’ve seen serious data corruption because “foreign keys are bad and we can just enforce it in code” is amazing. There is zero excuse to not use FK’s
Any database that doesn’t use FK’s is almost guaranteed to have crap in it that didn’t get cleaned up, resulting in data corruption (and yes, dangling stuff in tables count as data corruption).
Developers aren’t perfect. Shit will slip through even with the most rigorous process. The database should always provide tools to keep its self from getting corrupted. This is why good database systems have constraints like FK’s. If your database software makes using those tools painful (cough MySQL), get a better database system.
You can turn off overcommit, in which case you'll get a null return instead of OOM killing.
But also, you can still get a malloc failure without actually be running out of memory if the allocator can't find a big enough contiguous chunk of address space.
This is highly unlikely on a 64 bit system, but if you try to malloc gigabytes on a 32 bit machine you might see it.
And also, you never know when this theoretical case actually happens but it did happen to me: at one point someone may use your code on an Arduino and unexpectedly, it works! But memory allocation could fail more than you expect!
I think this is a bit hyperbolic of an expression, but I do want to reinforce that all applications need to confirm data integrity, you either confirm it when saving the data, when extracting the data, or have to very carefully balance various consistency concerns and either enforce consistency before saving data or enforce consistency after saving (but before extracting) data.
Things like RDBMS's provide some really nice utilities for data consistency enforcement in the form of FKs, these are not the most performant approaches in all cases but they're optimized to be good for nearly all use cases.
If your business need /Requires/ moving off of FKs onto some other data integrity guarantee, then just understand that you're going to be reinventing the wheel. It might turn out to be a better wheel for your car, but it's going to be expensive. And... unless you specifically hire specialized people who comprehend data guarantees and ACID properties, you'll probably do it wrong. This sort of an undertaking is for mature companies only.
> I do want to reinforce that all applications need to confirm data integrity, you either confirm it when saving the data, when extracting the data, or have to very carefully balance various consistency concerns and either enforce consistency before saving data or enforce consistency after saving (but before extracting) data.
There are two kinds of error handling. The “happy error” and the “fuck you asshole” error handing. Good systems have both.
Yes, the application should check it isn’t about to violate a FK constraint. Just like it usually checks that it isn’t about to violate a unique constraint. That way the application can fail in happy, controlled way. But if the application doesnt do the right thing than the DB server still should puke all over the transaction with a “fuck you, don’t feed me bullshit” error.
Just like data entry over the web. The javascript can return all kinds of nice happy messages to the user about missing fields and stuff. But the backend should always validate and enforce correctness even if it’s mode of failure is some ugly “piss off, don’t feed me crap” message.
Just like you should never trust user provided data coming from an HTTP post, a database should never trust the INSERT or UPDATE is valid and won’t corrupt the database. Both backend systems can return mean old ugly errors when shit is bad and let the front end do pre-validation that can do happy nice errors. The backend always has to enforce its own data integrity. Period.
Oh I absolutely agree and even when a DB is properly configured with references all cleared defined and constrained it's absolutely a good UX thing to pre-check as much as possible.
But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity. It is stupidly expensive and unless you have a few billion in the bank there is absolutely no reason to even consider it, but if you're dealing with data volumes like GitHub then it's conceivable that all the other salves for enforcing data integrity fall short. In that case there are ways to approach removing FKs, but, when you do so, you're not (logically speaking) giving up the data-integrity from FKs, you are replacing FKs as a tool for data-integrity with another tool for data-integrity (one that will probably be very similar to FKs) - under this guise DB FKs can stop making sense (though also having any sort of RDBMS engine likely also stops making sense as you're essentially adopting the functional responsibility for being an RDBMS into the primary application).
Being pedantic in this case doesn’t help the cause. Too many developers don’t understand database theory at all and will read this
> But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity
And not the rest of your post. Yes technically you are right but it is stupidly expensive and nobody should do it.
The problem with being technically correct is, again, people will stop at the sentence I quoted and go build yet another FK free system. Said system will undoubtedly fill up with corrupt bullshit data that eventually leads to exciting mystery bugs in production that has everybody scratching their heads. I’ve seen it time and time again....
I feel like reading HN should come with a warning on the tin that "If you're reading a long technical comment, taking away just part of it is dangerous" - were I speaking to someone in the business side of a corp that asked "Hey do we need these FK things, some developers have been saying they're slow" I'd say 'Yes, we absolutely do need FKs' then go on to talk to the developers, double check I wasn't at one of the about dozen of companies with data at a scale that FKs as implemented in RDBMSes (especially postgres, mysql tends to drop off in performance much easier without heavy tweaking) is insufficient, and then tell them that FKs do work and they probably really just need to read up a lot on indexes and stop throwing around table locks like it's christmas.
Unique constraint is a good example, because it reminds us about race conditions.
The app can check that it won't violate a unique constraint before doing an insert/update, but in between that check and actually doing the insert/update, some other process may have changed data, such that unique constraint can be violated.
So when the rdbms catches this, it'snot just a "fuck you for giving me bad data" condition if the implication there is that it was a bug in app code, and it's a failsafe. It isn't necessarily a bug at all -- unless you intended it to be the app's responsibility to use db-level locks and/or transactions to guarantee this can't happen without the uniqueness constraint -- but then why not just use a uniqueness constraint, the tool the db is actually giving you for this?
Mature rdbms's sometimes don't get the recognition they deserve for being amazing at enforcing data consistency and integrity under concurrency, with pretty reasonable performance for a wide variety of usage patterns.
Foreign key constraint can be similar; you can do all the checking you want, but some other process can delete the object with the pk right before you set the fk to it.
If you have app usage patterns such that you really can't afford database data integrity enforcement (what level of use that is of course depends on your rdbms)... you are forced to give up a lot of things the rdbms is really good at, and reinvent them (in a way that is more performant than the db??) or figure out how to make sure all code written never actually assumes consistent data.
The disdain some developers tend to have for their data storage system is pretty unreal. It’s like they see it as a necessary evil instead of their friend.
They often react to FK’s as if they cramp their style, where their style is to just insert whatever into the database without regards for what And where it is.
I had a job where the database was intentionally lacking FKs so that users could input data out of order, for example create a shipping route Foo that goes to a warehouse Bar, before the warehouse existed. Let's just say it was a suboptimal design.
The engineer https://en.wikipedia.org/wiki/Niki_Lauda from Rush movie fame, if he's working on a race-car-level-system where every little bit of performance you can tweak out counts, like wringing a towel dry, every last bit... and in that mindset, FKs are removed NOT from ignorance!
Not using FKs essentially means eventual consistency, and it's on you to ensure it. Obviously, eventual consistency is something a lot of systems have to settle for, often for reasons that have nothing to do with architecture, but just because of circumstances where you have to synchronize disparate systems and there's no way to eliminate all but one.
I'll note that there are very large systems that do provide ACID-like consistency (e.g., Spanner, Lustre), so it's not at all that "large -> eventual consistency".
I won't pass judgment on projects where eventual consistency was a given from the start, even if I might think they could have done better, but too, I wouldn't make either approach a hard and fast rule: circumstances vary. What bothers me is the extent to which I see "triggers bad", "FKs bad", "business logic in the DB bad" blanket statements out there.
Eh. The issue arises when you have a very poorly designed schema and missing application-level operations to perform cleanup and deletion.
You don’t want to be caught in that situation because then you are handcuffed and cannot clean the database properly. For example, from the perspective of business operations you might have something that creates 1 single thing where under the hood 5+ DB objects are all created in a transactional manner and a very specific order. So how do you unroll that? You gotta carefully construct the rollback commands or just get rid of the bowling lane bumpers and slash and burn.
So you gotta commit and go all-in. If you make the choice to do it all within the app later, at least you keep that flexibility.
It all depends on your processes and your team. If you don’t design things properly and institute the correct procedures for designing and operating the DB, having or not having constraints at the DB level is inconsequential to the other issues you will face.
Bullshit. Even with the most perfect developers and perfect system, letting application code enforce constraints will lead to data corruption period. Some application will crash or something and leave dangling garbage behind and boom you are fucked.
Would you ever trust that form data is valid because the JavaScript front end “validated” it? No! Why the hell are you going to trust that everything sent to the database is valid? No constraints like FK’s is exactly like not validating input because “the JavaScript layer got it”.
It is out of ignorance that people argue otherwise, sorry. Too many people don’t understand relational database...
Well, what's true is that if you leave it to the application, you now have two problems: you have an RDBMS you use half-way, and an app that needs to implement the RDBMS features that you're not using from the RDBMS. Since most app devs are not RDBMS devs and don't want to be, they're bound to get a few things wrong.
The most likely thing to go out the window in an ORM-type app is concurrency. Take a big lock around DB ops and you're good, right? But then, too, there goes performance.
Or, if it's not concurrency that goes out the window, you have to implement eventual consistency...
The worst thing I've seen too much of is application-level JOINs and such. Next are recursive queries: done in the app. All that absolutely destroys performance. Then you have custom query languages that are just never good enough.
So I am mostly in agreement with you, but "b.s." is too strong for me. You can get all of this right in the app, though at a terrible cost, and there are times when maybe it's actually for the best.
Let me give you an example of how I might build something that's somewhere in the middle. Think of Uber or Lyft. I might have a single widely-replicated RDBMS for users, and maybe another for drivers, and maybe per-city in-memory-only ride DB for tracking requests and current rides, and one more, possibly sharded DB for billing. The ride DBs would send billing updates via some pub/sub scheme (e.g., like PG's NOTIFY, or something else). Recovering from reboots of the ride DB is easy: rely on the apps to recover the state. Most operations only have to hit one DB at a time. The whole thing is eventually consistent for user/driver ratings and billing, which seems rather fine to me given that the operations to synchronize are essentially just aggregation updates. In such an app there's not a lot of room for application logic in the DB -- some yes, and I'd put as much as possible in the DB.
There are plenty of apps where some degree of eventual consistency makes sense. What doesn't make sense is to end up hand-coding JOINs, GROUP BYs, and so on. And my preference is to use FKs and triggers as much as possible, but within limits (see all the above).
Sorry. I have just seen way to many systems fail in way too many mysterious ways because "foreign keys are bad" and "the application code can do the validation". In fact, I wager that 100% of all databases that don't have FK's have some kind of data corruption that is causing at least some kind of user-visible issues.
It is frustrating to me because even the most green behind the error developer would cringe at somebody saying "we don't need the backend to validate user input because the front-end javascript does it for us". This is the same thing.
Never trust your inputs. Your application code calling the database server is exactly like some javascript client calling your backend system. Your database has tools to keep it from being fed bullshit input. For some extremely frustrating reason, people think it is perfectly okay to have the database trust its user input.
The result is in almost every single instance where those tools aren't used, the inevitable corrupt data will cause some kind of hard to reproduce user-impacting issue. I've seen it so many times it makes my head hurt.
I should have clarified that I don't think FK's are bad ... just that if you are going to use them you need to plan for that.
Like you, I have been bitten by this in many different environments on both sides of the fence. In certain environments the absence of FK's has caused major headache and likewise in other environments the presense of them and the lack of a robust db design has meant the FK's cause more harm than good.
Thanks. As a mere occasional user of databases I was confused and trying to figure out how you could not use FKs without recreating essentially the same thing in code.
Every time I have seen a database use foreign keys there has been data corruption, because everyone thought foreign keys were declarative and not procedural. Just because you have a foreign key doesn't mean it was always there or that it applied on every transaction. You can turn them off at the connection level and you in fact must turn them off for almost any kind of bulk data load.
You should read shlomi's post he gives good reasons. Specifically this is a github issue on his tool gh-ost which is for online schema migration, and FK's pose lots of problems for online schema migrations.
Maybe shitty toy database systems like MySQL let you disable constraints on a per session basis. A real database system might let you defer them until the end of a transaction—which is the only correct way to operate. Once you commit that transaction, what you put into the system better fucking make sense and it is the job of the database system (and only the database system) to enforce that.
Like I said before if your database system makes using constraints hard or lets you shoot yourself in the foot (lol at disabling constraints per session), don’t just walk but run away from that system.
What other database has been used at such a giant scale at so many companies, than MySQL? I'm sure Oracle and SQL Server are used at big companies, but nowhere near Facebook scale.
While appropiate to define in few words some of MySQL's colossal mistakes, this isn't the kind of language that will sway heads that have been comfortably using MySQL because those defects are just "what DB's do".
Any database that would let you disable constraints on a session basis is a toy database. Such an operation doesn’t even make sense because at some point the relational integrity has to be enforced for the entire table. You can’t just have parts of a table be relationally correct. That is like saying 1 + 1 = 3. It is a completely illogical statement.
However I would not at all be surprised to learn MySQL supports such a thing. Which supports my assertion it is a toy used (or at least installed by) people who have no understanding of relational database architecture.
An alternative view is that your statements are incorrect. Do you have much direct experience with high-volume OLTP database workloads, or are you basing your views of MySQL on something else?
Once you are stuck with MySQL it is very, very, very hard to get an organization to switch--not only from a technical standpoint but a political one.
I bet you any competent engineer who knows their shit about DB in those companies regrets using MySQL. I bet their code is full of hacks, crappy schemas, and all kinds of work arounds because they chose mysql. I've seen it in every company that uses MySQL. The lengths people go to avoid schema changes is astonishing.
It is much, much better to start with a real database like PostgreSQL because whatever you pick is going to be what your entire org uses from now until eternity.
Cool, so I'm going to assume that means your answer to my question of "Do you have much direct experience with high-volume OLTP database workloads?" is "no". Given your "bet" as well as comments about schema change difficulty, I'm also going to assume you did not click through to my profile...
And to clarify, I'm not saying that to toot my own horn. My point was, I primarily work on open source schema management and related consulting. I talk to large companies about MySQL schema changes literally every single day. The comment about "the lengths people go to avoid schema changes is astonishing" simply does not gel with reality among large-scale MySQL users.
As for the random blind accusations about bad code, engineer incompetence, etc that's just rude, mean-spirited, and misinformed. I personally know a lot of exceptional database engineers who work on MySQL-related infrastructure at the companies I listed above. Why crap on other people's work that you haven't seen and know nothing about?
Agreed. I think MySQL is still dragging the bad reputation it got in the early 2000s, which is unfair considering how much it improved. And I write that I someone that used to hate MySQL for all its shortcuts. I have one app in production based on MySQL. I have been thinking of switching to PostgreSQL for years. But the truth is that, the more MySQL improves, the less the switch is justified :) The engineering effort put by Google, Facebook, Oracle, etc. in MySQL and InnoDB during the last ten years is impressive.
Qualifying MySQL as a "toy" database in 2019 is obviously wrong.
But I think most companies you mentioned don't use MySQL in the usual way, as they would use a "standard" enterprise database like Oracle, SQL Server or PostgreSQL.
These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture. For example, YouTube uses Vitess "over" MySQL.
Companies like Instagram are known to do something similar with PostgreSQL.
My point is that maybe you and the parent comment are not thinking about the same use case.
> These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture.
Yes and no. Often it's both. I say this first-hand, having performed significant work on the database tier for two of the companies I listed, and consulted for several others.
For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.
And in any case, why does it matter if there's another layer involved? It's still MySQL powering mission-critical global-scale use-cases. And for example with YouTube, literally the primary benefit of Vitess is that your application can treat it as a single normal unsharded MySQL installation, so those interactions are still very MySQLy.
> For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.
I didn't know about that. That's interesting!
> And in any case, why does it matter if there's another layer involved?
I was writing that in the context of the parent comment about "disabling constraints". I can see why disabling constraints makes sense in a sharded environment, with an intermediate layer like Vitess. But the benefit of disabling constraints is less clear when using MySQL directly in a non-sharded environment.
Since you're here, I'd like to ask why you would use MySQL over PostgreSQL in a new project nowadays?
Regarding MySQL, the two main advantages I can think of are that tables are organized as clustered index (instead of a heap in PostgreSQL, which can be an advantage or a drawback depending on the workload) and the replication tooling.
On the other hand, PostgreSQL has a lot of useful features that I miss in MySQL: table elimination/join removal (exists in MariaDB but not in MySQL), indexes bitmap scan (to combine indexes efficiently), partial indexes, transactional DDL, LISTEN/NOTIFY, materialized views, row-level security, table functions like generate_series.
To explain more re: FB and having another layer on top of MySQL, there are a bunch of separate sharded MySQL tiers there. It's split by workload -- for example, the access pattern, schema, and sharding key differs completely between the main social graph, Messenger data, ad market, financial transaction data, etc. And then there's also the internal MySQL database-as-a-service, which allows any engineer to provision one or many databases for any other purpose. Overall, some of these things have services on top that use MySQL more as low-level storage, and others use MySQL in a more traditional fashion.
re: "disabling constraints", that's kind of orthogonal. The large MySQL users simply don't create foreign key constraints in the first place; there's nothing to disable :) Whereas MySQL's ability to disable constraints for a single session is a feature intended to make things like logical dump/restore easier and faster, schema management easier, etc. Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables), which is needlessly complex if the tables are new/empty, and very slow if restoring a logical dump which is already known to be referentially consistent.
As for MySQL vs Postgres, IMO both databases are close enough in major functionality that for many use-cases it's best to just go with what you already know, can hire for, and can operate. There are special cases where one is better than the other, for example personally I'd go with MySQL for social networking / UGC / very high volume OLTP, and go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores.
Ideally with MySQL you're just using it for OLTP, and deferring to separate systems for OLAP, fulltext search, etc. In a way that's "more UNIXy" but it's also potentially an operational headache.
In terms of specific feature comparison, you already have a great list there. A couple other things on the MySQL side I'd mention are InnoDB's buffer pool (smarter caching than relying on the OS cache as pg does) as well as the existence of MyRocks storage engine (LSM-based system offering better compression than pretty much anything else of comparable performance level for OLTP).
That all said -- Postgres is an awesome database, and I'd say that Postgres is more closely aligned with the textbook definition of a proper relational database. But then again I'd also say something similar about FreeBSD (vs Linux) for server operating systems, yet for practical purposes I always go with Linux anyway :)
> Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables)
This problem alone justifies the ability to temporarily disable constraints. I notice PostgreSQL, which is a toy too ^__^, offers something similar.
> Go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores
I agree that PostgreSQL is a really good match for these use cases.
This really makes it seem like you have never used a database system at scale. There are reasons why systems like MySQL let you turn them off, and they are some of the same reasons why pretty much everyone who uses a database at scale has settled on MySQL. Also its why as is mentioned in the issue once you actually scale your database foreign keys become a nightmare. If all you have is a toy project you can feel free to use any database you want, but if you actually need to serve traffic you might want to rethink your priorities.
I am a bit doubtful of this, especially within PostgreSQL you need to specifically go out of your way to create a NOT VALID constraint. I know that MySQL of old would default to an engine that didn't actually enforce key relationships (which was terrible but at least well documented) but in the modern world DBs will tend toward enforcement unless you specifically work against it.
FK’s don’t just maintain “stronger” data integrity, they are the only way to maintain relational data integrity. Application code cannot maintain that relational integrity, period.
Any developer who thinks application code can enforce relational integrity is naive and does not understand relational database systems. It is impossible for any layer above the database itself to keep things from getting corrupt.
This is not remotely true. You can do this with pretty much any database that supports SERIALIZABLE isolation-level transactions correctly.
The point is that application code has bugs, and it's a lot easier to specify your constraints declaratively in a single place using a purpose-built DSL (SQL) than it is to enforce them procedurally every time you access the database.
if you are avoiding the use of foreign keys for the high and lofty goal of "performance" and 100% uptime while you run your fancy online migration utilities, SERIALIZABLE isolation is the last thing you would be using as it does what it says, serialiazes transactions and locks things like crazy. lots of waiting on locks, lots of deadlock potential. Throw in InnoDB's quirky implementations of isolation levels and you'd be in for a world of fun.
I agree with your general gist, but it's not impossible, you just, essentially, are writing the engine yourself.
I've seen a few applications that keep some data hot-loaded into a shared slice of memory and have tight controls over the altering and saving of that data - and that's sort of one of the easier ways to partially avoid a full reliance on FKs, if you want to use write-through caching then you're essentially accepting the cost of implementing data integrity checks in the application layer which can be done - if you're very very careful.
Unless you're working with immutable primary keys (i.e. the values can neither be updated nor deleted), it is actually not possible to reliably enforce a foreign key constraint outside the database.
Speaking in terms of theory, it absolutely is. RDBMSs aren't magic boxes, they stage data for insertion, validate it, execute that insertion - all while littering the WAL (or equivalent) with the steps necessary for the guarantee of that operation. You absolutely can write this logic into your application, at that point your application may basically be Postgres but it's possible.
If you attempt this and don't exhaust a small rainforest's worth of notepad paper and a few truckloads of dry erase markers then you're probably doing it wrong - but the problem is solvable.
There are some unsolvable problems in this realm (like holding to ACID while also ensuring that all valid communications from a client are properly processed) but you can accomplish anything your RDBMS does, you just need to be really really absurdly careful and know what you're doing at a theoretical level.
But, if you want to ensure the data is never visible in an inconsistent state, you either need to use db-level concurrency-related features like transactions or locks; or some kind of lock or other concurrency-control features at the app level while guaranteeing the db has no clients other than your app.
It seems difficult to wind up with better performance characteristics by doing this than using the higher-level abstractions the database is already supplying for this purpose, like constraints. I suppose it's possible with enough hours by enough experts, as you suggest. I doubt that's what github actually did though, they probably instead decided to write app logic that was resilient to visible data inconsistency. Which sounds difficult and dangerous to me, but github is a pretty reliable app, so I guess something worked.
Ensuring foreign key consistency without foreign key constraints is not just "use transactions", you have to be careful and intentional about how you are using them (and ensure all clients do). Why would you choose this over just using the foreign key constraint which takes care of it for you, using the same underlying technology?
Perhaps that's a clearer way to say what I was trying to say originally.
NoSQL got popular because Google needed it for their billion-user planet-scale system, and then a million startups pretended they were Google. As soon as Google built their planet-scale NoSQL system, they immediately started replacing it with a planet-scale SQL system (Spanner)
And one of the reasons NoSQL didn't revolutionize anything is because people realized it's worthwhile to put on your seatbelt. It's either arrogant or naive to think that automated safety measures aren't going to catch any of one's mistakes.
Worked at a bank years ago. One of their DBAs eschewed FKs in production. Databases were designed with FKs which were enforced in Dev and QA. If your app survived testing (automated, QA team ... the entire gamut) without producing FK violation exceptions, your app could be promoted to production, where FKs were not enforced, making things pretty fast.
I feel like this was a stopgap on the way to eliminating FKs. I have no idea what his roadmap was because I wasn't in that group. But his process made for good thought fodder.
This approach is quite scary to me and I would have argued very vocally against their rejections of FKs. There are ways to rely on replication for read serving, periodic disabling of FKs during batch inserts, FK integrity checks on replications, ... these can all address the performance issues inherent in FKs, it's also (generally) quite possible to attempt to architecturally disentangle too large networks of interdependent data in an effort to reduce how large any single data store you are relying on will get. But disabling FKs in production is... eh.
I'd much rather see the opposite, declare the FKs on a testing environment NOT VALID, run application code, VALIDATE CONSTRAINT those you'd marked and see if anything is in violation - ensuring that application code won't (in the normal course of operation) generally hit FK constraints can greatly reduce your performance hit from having them... then in prod you can turn them on and be confident in your comparably lighter performance loss to data integrity.
FKs add read locks to referenced rows. It limits concurrency and it is observable. FKs constrain your ability to incrementally widen 32-bit FKs once you go over 2 billion rows, if you start out with 32-bit PKs. Two concrete reasons to avoid FKs in production, or at least disable them for longer running transactions.
I think this perspective is something you only get once you've run bigger databases in production.
These sound like implementation details. In PostgreSQL FKs just add a shared write lock to the foreign key columns of the referenced table (i.e. the locks prevent anyone from updating the primary key of the referenced row). Also what you said about 32-bits is not true in PostgreSQL either as far as I can remember.
The costs of having FKs in PostreSQL are:
1) If you update the primary key of the referenced table you might see issues with locking. But this is rare in real world applications.
2) Performance overhead from having to check all FKs and taking the locks. This can be a big issue on some workloads and may force you to add extra indexes. A PostgreSQL specific issue is that FK checks cannot be done in batches.
3) Adding new FKs block writes from the referenced table and dropping FKs lock out both readers and writers from the referenced table. This is a limitation of the implementation.
not sure about mysql, but in postgresql an update on a table with a foreign key will take 'FOR KEY SHARE' locks on the referenced table, which is a weaker type of lock. updates on the referenced table that do not update (primary) key columns (changing a pk is very uncommon anyway) will suffice with a 'FOR NO KEY UPDATE' lock, which does not get blocked by 'FOR KEY SHARE' locks. in fact, the main reason postgresql has these weaker 'FOR KEY SHARE' and 'FOR NO KEY UPDATE' lock types is for handing of foreign keys.
Also worth noting here, if you ARE updating the primary key on the referenced table, then locks MUST be taken to ensure data consistency. If that PK isn't locked, than by the very problem definition you have open transactions relying on the original PK value.
Those locks would be very challenging to accomplish at the application level.
This sounds like an implementation specific detail (no doubt for mysql) that for all you know has been fixed in later versions but gets passed around as if it is a permeant truth that applies to all database servers.
Yes, but it can be done without locking out reads. The only thing you need to lock against is someone changing the primary key of the referenced row or deleting the row.
PostgreSQL has implemented this minimum level of necessary locking for quite many years now.
I never said anything about locking out reads. Did I?
FWIW, I just tested in Postgres. Locks like I said it does:
A: create table parent(id int, value int, unique(id));
A: create table child(id int, parent_id int references parent(id));
A: insert into parent values (1, 10);
A: begin;
A: insert into child values (1, 1);
B: begin;
B: select 1 from parent where id = 1 for update;
B: (blocks)
The situation in MySQL is worse because it'll block updates on any field, not just row locks.
I have specific experience of this due to use of database locks at the application level to avoid deadlocks (different lock orders) and inconsistent updates (updates based on reads across multiple tables that may have separate racing updates) by locking rows up front. For understandable schema reasons, what is logically a parent entity is the natural thing to lock, but for understandable performance reasons, FKs to the parent entity are distributed through some fairly large tables.
Ok, now think this the rest of the way through. Without FKs, how do you, at the application level, ensure that A and B don't commit separate changes (A to child, B to parent) that break consistency?
If A tries to insert a child for 1, and B changes the id to 2.... OOPS! And from both's perspective it looks perfectly safe.
A neat idea for sure. However I'd be concerned that suddenly you have two different applications. Example, if you delete a row with a cascading deletion elsewhere, that will work on dev, but on prod it would leave dangling data.
Though, I imagine you could simply not use cascading behaviors.. Neat idea though. Scary, but neat hah.
But seriously, there are a host of issues one must overcome for this plan to work. And there were indeed managed permissions for DELETE. But ultimately, I wasn't aware of much of the implementation detail ...
My feeling on it though is that it just shifts all the faith to your test data being true to not only does but conceivably could happen in prod.
All very well testing with a bunch of constraints, but if it turns out one actually isn't exercised and then happens in production?
It also means you can't use them for validation or whatever. Which, I don't know, maybe you shouldn't anyway, but if you turn them off in prod you can't.
That's fine when one application is using the database. I work in the enterprise space, and we have at least 5 different applications all working with the same database. Each of these applications has their own team. Expecting each team to handle constraint checking uniformly in their applications isn't always feasible, so we use foreign keys.
GitHub's approach might work fine for small, focused teams who have exclusive control over a product. However, that's not often the reality of most organizations.
> That's fine when one application is using the database.
It's not even fine for a single application. Said application has bugs in it. Said application can (and will) crash in unexpected ways that leave the database in an invalid state. And if you think that "oh, that will never happen to me". You just haven't been around long enough. It will happen, every single time. Any system without foreign keys will have corrupt data in it. Said system will have user-visible issues as a result.
Just like only the backend can truly validate user input from a webform, only the database can validate its input.
Until a programmer forgets to wrap the new feature in a transaction, or just has a bug in their logic that breaks the integrity. Or a db admin working directly in the database breaks it. Or a schema migration. It's not impossible, sure, but it requires you and everyone working with the data two be very, very disciplined, and is only as effective as the weakest link.
I fixed one too many data corruptions to do away with FKs. If we are large enough that FKs are causing performance issues, then we are large enough to invest in and deploy another performance boosting solution while still protecting the integrity of our database from contractors/employees running around with direct SQL access (another nightmare unto itself, but one that I as a developer have less control over than if there are FKs or not).
Wow. No discussion at all regarding the correctness of their databases. Is all the data correct, or are there FKs referencing missing PKs? Actually, even discovering this would be difficult, since any FK violation could be just viewing the middle of what would be a consistent update in flight.
I get that cross-shard FKs are especially difficult. But any discussion of this topic without addressing correctness concerns is woefully incomplete.
It's not like you can avoid incorrect data just with foreign keys. Say you have an invoice model. There's a boolean value indicating that the invoice is final and a numerical value for invoice number. Final invoices must have invoice numbers. But a bug in your system manages to update an invoice so that it's final, but missing an invoice number. That row is incorrect and it's gonna cause an issue somewhere.
I can only imagine how many production databases contain incorrect data.
> There's a boolean value indicating that the invoice is final and a numerical value for invoice number. Final invoices must have invoice numbers. But a bug in your system manages to update an invoice so that it's final, but missing an invoice number.
If you're finalizing the invoice, you're hopefully doing something like this, right?
UPDATE invoices SET
final = TRUE,
invoice_number = @InvoiceNumber
WHERE id = @InvoiceId;
(Where @InvoiceNumber is some variable the application's substituting into the query)
If so, then the problem you present should never happen (unless the DB doesn't do atomic updates by default, but wrapping the update in a transaction should provide the necessary guarantees to prevent the problem from happening; if your DB doesn't support transactions, then you really should be switching to a different DB yesterday).
If that somehow could happen, though (i.e. you don't trust the application to be bug-free, which is a reasonable attitude), a CHECK constraint (as others have pointed out) would make the database enforce that:
ALTER TABLE invoices
ADD CONSTRAINT final_invoices_have_invoice_number
CHECK (final = FALSE OR invoice_number IS NOT NULL);
Of course, I'd also be wondering why an invoice would ever exist without an invoice number (it's easy enough to just make the invoice number NOT NULL - or, better yet, make it the primary key), but hey, if that's the business requirement, then that's the business requirement.
>you're hopefully doing something like this, right?
Well, in an ideal case, yes, you would be doing it like that. The reality might be different, especially when using an ORM.
>Of course, I'd also be wondering why an invoice would ever exist without an invoice number
Draft invoices do not have an invoice number, since they don't really exist anywhere. You can delete a draft invoice and nothing has happened. But if you have an invoice number, that's a record that must be kept.
> Well, in an ideal case, yes, you would be doing it like that.
I mean, it'd be either that specific case or a bug. And if it's a bug, then the check constraint as described previously would catch it and prevent it entirely.
Another option, though, would be to not even bother with a separate field for "final"; if the only two states are "draft" and "final", and finality is conditional on there being an invoice number, then the application logic can be greatly simplified:
SELECT CASE
WHEN invoice_number IS NULL THEN 'Draft'
ELSE 'Final'
END AS state
FROM invoices WHERE -- yadda yadda yadda
Most databases can cache this as a computed value column on the table itself or in a view or what have you.
> Draft invoices do not have an invoice number, since they don't really exist anywhere.
Sure they do: they exist in your database. Unless your company has a strict rule about invoice numbers always being sequential with no gaps (and that'd be pretty darn strict, in my experience), you might as well pre-assign it.
> Draft invoices do not have an invoice number, since they don't really exist anywhere. You can delete a draft invoice and nothing has happened. But if you have an invoice number, that's a record that must be kept.
Must invoices be sequential or something? Why can't I just assign the draft it's to-be number, and know it's a draft because FINAL=false/0
The main criticism seems to be that the FK relationship makes migrating the referenced table difficult. But why not remove the FK with ALTER TABLE before the migration, migrate, and add the FK back again (which will catch any missing primary keys), preferably inside a transaction?
The method you propose might work, but not necessarily well. Catching missing primary keys at the end is a problem because then you have a bunch of data integrity issues to sort out, preventing you from re-applying the constraint, in a live system where apps are relying on the database to do integrity checks, meaning that there's a decent chance that the running system is creating integrity issues faster than you can sort them out. Possibly orders of magnitude faster.
Doing everything inside a transaction may invite concurrency issues around locking, maybe even deadlocks. Which is again a problem in a live system, because you might be causing services to fail.
I'm not sold on FKs being something you should never do, ever, but I will say that I used to work at a place that pushed their database servers very hard on real-time tasks, and also tended to avoid foreign keys, and my biggest complaint was not data integrity, it was just that there was extra effort that needed to go into documenting the logical foreign keys (the keys still exist, they're just not enforced by a database constraint), since you couldn't just read them from the database schema.
I'd like to say that this can be something where you default to having FKs, and remove them as you hit performance problems, except that the poster is right: Developers have a habit of leaning on database constraints instead of doing their own sanity checks, and they simply don't know all the places they're doing it - they're not necessarily even aware they're doing it when they're doing it - meaning that removing FK's after the fact doesn't get you to "constraints are enforced by the apps", it gets you to "constraints aren't enforced". I think you probably want to try and anticipate ahead of time if you'll have a problem, so that everyone can know from the get-go whether they'll be working without a safety net.
Is it even possible to realistically do FK checks in app code? Nobody uses serializable transactions and with other isolation levels I think it's possible to check for FK, it's OK now, insert new row, but another transaction simultaneously deletes that row and both transactions happily commit. It's called phantom read. Probably it's very rare event, but it'll happen.
At the place I worked that skipped FK checks, they used append-only database schemata. If you limit yourself to CR operations, a lot of things get easier than they are with CRUD - ensuring (a somewhat weak form of) database consistency without transactions, yeah, but also things like audit trails and slurping new data into the BI system.
It's not "you can't add checks because they might fail"; it's "you can't temporarily remove checks in a live system where the database's users are used to leaning on them, because when you try to re-apply them you just might find yourself in a world of fail".
Adding a foreign key to a massive table can take a long time, since all existing row data must be validated against the constraint. With logical replication, long operations like this can be quite disruptive, even with a multi-threaded logical replication scheme.
Sadly I don't think this is actually true. From that page:
DDL statements, atomic or otherwise, implicitly end any transaction that is active in the current session, as if you had done a COMMIT before executing the statement. This means that DDL statements cannot be performed within another transaction, within transaction control statements such as START TRANSACTION ... COMMIT, or combined with other statements within the same transaction.
Whereas with real transactional DDL, the DDL can be done within large transactions, multiple ALTER's could be rolled back, no changes are visible outside the transaction...
MySQL 8 was extremely new at the time this comment was posted. It's likely that mysql limitations were a big part of the thinking. The mention of pt-online-schema-change at the bottom makes me think so.
I didn't quite understand the migration issue TFA talks about, but at least with PG, there's more than enough clever ways to make schema changes incrementally.
Ahh to fkey or not to fkey. The problem with doing referential integrity in the app is what happens if a change is made to the database outside of the app?
I guess it depends. If you’re at the scale where you can’t refactor your database to keep up and you, after profiling or some other method, have proven that moving integrity enforcement into the model of your app is faster then by all means do that.
That being said: foreign keys help the query optimizer make better decisions. If you’re using and ORM though that doesn’t matter as you’ll get generic queries and you’re likely not profiling your queries anyway.
They also are a way of self documenting the relationship between things.
If you’re going to get rid of foreign keys then why not denormalize so that the data is easier to query and less likely to get corrupted? Reduce the number of tables say into a single table a-la the DynamoDB approach
There seems to be a certain mysql tinge to this value calculation. I use foreign keys and like it (I use Postgres). I do not encounter the problems the author does. I omit them when they are a problem for some reason. This is rare in routine work.
True, but the people I know who run PostgreSQL at that scale do not complain about foreign keys. Not since PostgreSQL added separate lock levels to improve concurrency with foreign keys.
Because I've never thought about this, I'll ask the dumb question...
A shopping cart has many items. An item belongs to a shopping cart. In a relational database, without foreign keys, how do you associate the shopping cart with the items?
The suggestion is to not use foreign key constraints in the database.
The suggestion is not to eschew columns which might be JOINed on in a query.
I.e. "Don't enforce FK relationships with a constraint in the DB. Make sure the values in both tables which may be joined are consistent by using application code to enforce this."
> Don't enforce FK relationships with a constraint in the DB. Make sure the values in both tables which may be joined are consistent by using application code to enforce this.
For those reading at home who aren't good with databases. This is exactly the same statement as:
"Don't enforce valid inputs on the backend. The javascript front-end will enforce it for us".
Always validate your inputs at the appropriate layer. Your backend can never trust input from a web front-end (even if there is "only one web front-end"). Likewise, your database should never trust input from its clients (even if there is "only one client"). Ignore this rule at your own peril.
Are your engineers omnipotent beings who never fail, never make mistakes and understand the system entirely (even all the parts they don't know about)? Is your QA process perfect? Your servers never crash in weird states?
A foreign key is a hard-constraint that must be enforced.
There's no stopping you from storing item IDs in a theoretical shopping cart table without FK constraints though. It's just that the onus is on your application to provide the guarantee that the item exists.
This becomes a little clearer if your item IDs are not simply auto-incrementing IDs (which have their own challenges in a large distributed system), but instead are SKUs, or things which have actual meaning.
If you have a need for a ID that is unique but otherwise meaningless in a distributed system, feeding a concatenated node ID, timestamp (to ms), and looping transaction counter (mod a sufficiently large number) into a secure hash like SHA-256 should take care of it.
Sure, that satisfies the uniqueness aspect if that's all you're looking for in an ID. But it doesn't satisfy the predictability aspect.
You can't know what the ID will be ahead-of writing the item to the database, nor store the item in a shopping cart without retrieving the item via a different piece of information (which is pretty likely to be SKU in this case)
When there aren't any connections to known unique keys (like SKUs for shopping), PK predictability isn't really possible (or necessary). The above formula is just an alternative to using auto-incrementing integers when in a distributed environment.
In a non-relational model, you'd generally copy the items/products and their quantity into the shopping cart.
That's a huge change from the relational model, and takes much more space than a normalized approach, but it also comes with some advantages. For example, it makes sure that the price of the items in the cart remain constant, even when you change product prices. That way, you don't have to inform the user "oh, your cart just got 8% more expensive" before checkout.
Agreed. Invoices, for example, are a representation of history so they need the exact data that was used at the time. They can have a reference to a product id but at any time it's possible for the data associated with that product to be completely changed from the meaning at the time of the invoice. This happens with, e.g., UPCs, which can be reused. On the other hand, if you have assemblies of products then you probably want those items to update themselves automatically.
It's confusion over the terminology. There are two definitions of "foreign key" in wide use:
(1) When one table's key occurs as a value in another table, in that second table that column is sometimes called a foreign key. For example, table A has id, table B has A_id, and people refer to A_id as "a foreign key".
(2) When you define this relationship explicitly in the database, so that the database can enforce it (and/or to document it), that's called a "foreign key constraint" but also sometimes just a "foreign key".
This article just means they don't use #2. It doesn't mean they don't use #1.
The phrasing was probably clearer in the context of the discussion they were having at the time because the comment above it linked some docs for the software in question (qh-ost), and under the "Limitations" section, those docs said, "Foreign key constraints are not supported." So to them it was clear they were talking about database constraints.
I agree, whether or not a given data model has foreign keys is a math question with a correct answer (up to isomorphism of the data model); if it does, where in a system they should or should not be enforced is an entirely separate, implementation (vs semantic) level issue.
you are talking about relation, in this context fk is an abstract concept
linked article describes fk as an dbms construct - a schema constraint
relations and conceptual foreign keys may exist without dmbs constraints, in such scenario dmbs does not guard validity of conceptual "foreign keys" and treats them as usual data
The association would still exist implicitly in the code that uses the database. The columns could stay the same, and the code could still use the item's hypothetical cart_id column to get the items in a cart.
You lose the explicitness and safety of referential integrity in the database, but it's a trade off that can enable other things, like sharding, as mentioned by the GitHub employee.
You do not have to have a foreign key defined to do a join across tables. Foreign keys just enforce that all the values in the given column of a table are present in another given column on a specified table (usually a second table). They prevent you from entering data that does not match up.
A FK declaration uses named field(s) in your table. You can omit the FK but the associative fields remain. An FK is a constraint on what values can be in those fields.
An item should never be deleted from a catalog. That would break a lot more than carts.
If you need to track the inventory status/availability of an item, then you store that in the database. The specific implementation (field, list of current items, etc) depends on your needs.
All of these are mostly issues with the database engine implementation:
1. "FKs are in your way to shard your database." => not a problem for distributed databases that can query and duplicate data across shards/servers; alternatively, if the referenced data just isn't there in the database, a foreign key is not usable by definition
2. "FKs are a performance impact" => the app either just got the foreign key value from the database, so it should be cached in memory in a properly implemented database engine, or otherwise, the application relies on the database checking so you need the foreign key for correctness
3. "FKs don't work well with online schema migrations." => not a problem with database engines that properly support online schema changes without locking, downtime or table renaming hacks
The performance impact is at write time, and having the data cached in memory is irrelevant. It's the locking overhead that's non-trivial, especially if each table has several FKs, when operating in a high-volume OLTP environment.
As for "database engines that properly support online schema changes without locking, downtime or table renaming hacks", please name some. In my experience, every major DBMS has cases where certain ALTERs block concurrent DML, which is extremely problematic on very large tables. Or even cases where there's no locking, but the operation still takes a very long time, which is conceptually problematic for logical replication.
> FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.
You have to check referential integrity somewhere. You can do it in the application or you can let the RDBMS do it for you, but it will have to happen somewhere. So the overhead argument is kind of bullshit. Yes, managing the integrity in the app might provide more flexibility, but it comes with a price of, well, having to do your own integrity checking and introducing more bug space into your code.
> FKs don't work well with online schema migrations.
...
> Doing schema migration of P is just not going to work. Recall that gh-ost renames the table at the end.
So basically, the problem is not foreign keys, the problem is gh-ost implementation. Since they don't use FKs, they implemented their migrations in such a way as to make it impossible to support foreign keys.
Yes, in the parent-child example, to migrate the parent you would have to migrate the child. But it is not an impossible thing to do. Create the new parent, import the data, map the record ids from the old parent to the new parent, create new child, drop the foreign key, import the data, update the FK field using the map and create a new FK. Rename the tables and you are done.
> When eventually you want to shard or extract data out, you need to change & test the app to an unknown extent.
Sharding your data is not something you do on a whim. And you will need to test your app. But you were going to test your app anyway, weren't you? Or did you think that you could split your database into multiple databases without any testing? As for unknown extent, that is kinda bullshit. Look for places that catch integrity exceptions. Thats what you need to look at. Look at your database schema and if any of your tables are "ON CASCADE DELETE", go and punch your DBA in the face. If you are the DBA you might want to start running before someone else finds out.
> > FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.
>
> You have to check referential integrity somewhere. You can do it in the application or you can let the RDBMS do it for you, but it will have to happen somewhere. So the overhead argument is kind of bullshit. Yes, managing the integrity in the app might provide more flexibility, but it comes with a price of, well, having to do your own integrity checking and introducing more bug space into your code.
Perhaps they meant "latency" rather than "performance". With eventual consistency you can have lower latency, and spend more cycles fixing referential integrity later.
Of course, sometimes things fail in eventually-consistent systems (e.g., you order some item from a vendor who claims to have stock but oops! they don't, so your order will take a few more days to ship than they claimed and now you're a bit sad but hey, it all works out in the end, right?).
Github is probably past the stage of premature optimization.
Presume (for argument's sake) that we accept the premise that foreign keys do not scale. Then the question is whether there exists some sort of realistic migration strategy once usage grows past that. If it does, then using FK makes total sense until we hit that point, doesn't it? And if true this should apply to 99% of users.
But if there exists no such migration strategy, then maybe the guy has a point?
This thread makes it sound like it's impossible to make a medium scale application work without FKs constrains. I work on a commercial application that depends on relational relationships without FKs and the number of data corruption cases that we run into on regular basis is zero. Data integrity is handled at the application side, unexpected crashes are accounted for by heavy use of transactions, and everything works just fine.
To be clear, I'm not advocating against the use of foreign keys. But not using them is perfectly doable and not at all what this thread would have you believe.
> It may even rely on FK to cascade deletes (shudder)
Is this just the author's personal taste, or is there something about mysql that makes ON DELETE CASCADE a bad idea? In postgresql it's a useful tool for maintaining database consistency.
I'm guessing it's the possibility for data loss. So someone accidentally deletes a user and it cascade deletes all invoices referenced to that user (deleting invoices is a big no-no in accounting).
These kinds of things are exactly why foreign keys are so helpful to keep things from getting corrupted. Invoices should always reference a user in the user table.
The relationship should not be "ON DELETE CASCADE" but probably "ON DELETE NO ACTION". Then when some junior dev tries to issue a DELETE on the users table, the DB will correctly tell them to get bent and throw an error because there are dangling invoices still associated with the user. Because you are a smart person, you've also prohibited DELETE and UPDATE's against the invoices table for the db user that is used. The application cannot delete the invoice, and it cannot delete the user either--which is the only correct course of action. The database can now protect itself from hostile input from all your junior devs. Exactly why you want FK's. You can't trust user input.
The fact that the author of the article shudders about cascading deletes demonstrates that they have zero authority to speak about databases--something that is very common in the developer community. DB's aren't mysterious black boxes. They are really fucking cool tools that can do all kinds of neat shit to keep your data nice, safe, cozy and warm (they do way more than that too... a good database is perhaps one of the most underutilized tools developers have in their toolbox)
If the author is thinking of that kind of situation, he's right that ON DELETE CASCADE would be inappropriate, but he's also incredibly wrong to not want a foreign key enforcing that relationship. Deleting a user associated with invoices that should not be deleted is a big no-no too. A foreign key would not let the user be deleted until after the associated invoices are deleted.
There are plenty of other situations where ON DELETE CASCADE is the best option.
the problem with ON DELETE CASCADE isn't the CASCADE, it's the DELETE. Almost always, you want to to mark an entity disbled, not really DELETE it. If you want to DELETE it (for GDPR?) you should have something in place to fail your delete unless you've properly defined how to clean up danglig keys. (Perhaps what you need to do is delete the non-primary-key fields containing user data, but keep the row for relational integrity)
ON CASCADE DELETE is used for deleting logical sub-components (a row owned by another row) in a well normalized database -- it's for deleting your dadress book when you delete your account. It's not for deleting all your friends when your account is deleted.
> It's not for deleting all your friends when your account is deleted.
Very nice summary up until that point. But in a well normalized db the friends relationship (many-to-many) would be its own table. Friend relationships with the deleted user would be deleted, as they should be. Friends would not. (You still have time to edit for a better example.)
Okay, then don't do ON DELETE CASCADE. If you really want to allow the user to be deleted entirely, then ON DELETE SET NULL (or ON DELETE SET DEFAULT pointing to the ID of some "null" user) should do the trick. Otherwise, if you want to retain the customer info (which if you're saving the invoices you almost certainly do), then ON DELETE RESTRICT is what you want. This is all defined in the context of the invoice's schema, mind you, so you'd have to make a conscious effort to cascadedly delete your invoices on user deletion when defining that foreign key constraint.
Use the database to do your bidding. Easier to stay sane that way :)
I mean, sure, don't use on delete cascade in that case. But there's a big difference between "sometimes bad" and "always bad". The link seems to argue "always bad".
If I have data which much always be deleted if its owner is deleted, what's the process? Manually issue exhaustive delete statements in a transaction?
Is on delete cascade always bad or just a bad default? (It's obviously a bad default, I'm not arguing that.)
The performance of my team's application at Myntra (Bangalore) was badly affected by foreign keys on MySQL and we decided to drop them.
The trick we used was to drop them only in production, not in test environments. In test environments they acted as guardrails to ensure that our application did not break the constraint.
I had a database with no RI and it was a nightmare, filled with bugs.
So we added RI and it was awesome, and the bugs did go away with time.
But then we became dogmatic and had to have RI for everything, including things like maintaining local RI lookup tables for non local data. This just bogged us down so much that we couldn’t move.
Today, we use RI strictly within a single database. For foreign keys to remote databases, we assume the remote database knows what it’s talking about and that the incoming data has already been checked.
This works really well for us and in particular sharding is not a problem (sharding being a design decision not an implementation decision IMO)
Someone said elsewhere that Rails does the RI so the DB doesn’t need to. I don’t use Rails, and I don’t bet, but I’d wager that it’s faster to do RI inside Postgres than inside Rails.
The idea that we should throw out ALL RI is just dogmatic nonsense. I should know :)
Rails in 2010 was just coming around to adopting foreign keys, and the rubygem Foreigner made it happen before it was a proper rails feature.
The idea of rails enforcing unique constraints and FKs back then was... Optimistic. The design constraints assumed exactly one rails instance would access the database at any given time to ensure consistency.
I thought foreign keys were necessary for the consistency part of ACID compliance. Once you remove the integrity constraint, the database cannot guarantee consistency, even with isolation and atomicity.
Sure, removing the constraint will get you a huge burst in performance, but now you are in charge of guaranteeing consistency outside of transactions in a distributed system involving at least one client and one server. That's a tall order.
As Runar Bjarnson put it: "Constraints liberate, liberties constrain,"[1]. If you give something the freedom to be inconsistent at any extreme scale, Murphy's law requires that anything that can go wrong, will go wrong.
It is hubris to think that the average developer who hasn't spent years becoming a consistency expert is going to be able to guarantee that data won't become corrupted better than the engineers that build and maintain and acacademics that study rdbms systems full-time.Even experts get this wrong all the time [2]. To all junior devs out there, until you know you need to abandon a guarantee available to you in a given system, exhaust every other possible option.
Consistency, in the context of ACID, just means that the integrity constraints that exist are enforced. It does not mean that constraints that don't exist should be enforced.
My understanding is that it's the other way around. Without ACID you can't guarantee FK correctness. You can't be sure another transaction might add a key to a row you're about to delete.
Many people on this thread have asked if there is a rigorous way to migrate data between relational schemas that contain non-trivial foreign keys, and the answer is yes: In the open-source categorical query language CQL, data migrations necessarily respect foreign keys - a property guaranteed by the mathematics of category theory and the automated theorem prover that ships with CQL. http://categoricaldata.net
Oh man, that migrations point. I remember at my last company needing to handle this. We had some (huge) read-only tables that were referenced by other tables and we needed up update the read-only tables in batches from raw mysql dumps. The keys in the dumps were stable, so we could swap out the table and the references would still be valid. Since the referenced tables were huge, they took a long time to upload the batches, and we needed a migration that was ACID and fast.
The process ended up looking like this:
1. Create a `foo_load` and a `foo_unload` database, dropping any that already exist.
2. Populate `foo_load` with all of my new data.
4. Rename `foo.table` to `foo_unload.table` and `foo_load.table` to `foo.table`. This is atomic, fast, and updates references to point to `foo_unload.table`.
At this point the application reads from the new data, but referential integrity checks still verify against the old data.
5. Update each reference to `foo_unload.table` to point to `foo.table` using ALTER TABLE DROP FOREIGN KEY, ADD FOREIGN KEY. Again this is atomic and runs quickly.
6. Drop the `foo_unload` database.
Now this process is slow, but the switchover is 2 atomic steps and requires locking the tables for only a few milliseconds. The process is safe as well, if it crashes at any step, running the script again will cause it to safely recover. Well, with one more step zero:
0. If `foo_unload` exists, do the same rename step as step 5.
This solution worked for us for a few years. Eventually we swapped out the whole architecture around these tables and ended up with a more traditional ETL flow that didn't involve moving round huge mysql dumps.
One great thing about not enforcing FKs is that it makes integration testing a lot easier. You can load just the data you need to test with, and none of the FKs need to point to rows that exist that aren't within the testing scope.
This approach isn't for everyone. It works well with DDD where aggregates form contextual table boundaries and is eventually consistent by default.
If you ain't testing against the full system state, then I'd be hard-pressed to call that "integration testing".
If you're generating the test data from scratch, then it shouldn't be hard to generate the dependent data while you're at it. If you're testing against (anonymized) production data, then it shouldn't be hard to pull the dependent data while you're at it. In either case, you should be validating the integrity of those data relationships as part of the test criteria.
It isn’t hard, but often very cumbersome. Oh, so you want to test invoicing, for this Customer, which must have a Delivery Address and an Invoicing Adress which both needs valid Postal Codes. And the Customer must have a Contact person. Then we need the Product, that must consist of at least one Article, and each Article must be connected to the Company that we bought it from, with Addresses and Contact Persons and half a dozen other entities. But we have forgotten that the article also need to be in a Category so we know which Sales tax or VAT that needs to applied, for the State or Country of Sale. Then of course, which Sales Person, belonging to a Sales office, with contact person, addresses etc should get credited for the sale. And we still haven’t been able to actually create the order yet, because we haven’t created the delivery options.
I think you get my point. I’ve easily used more than a full day just to get enough data in a naked system to make just the simplest test. I’m very grateful for tools like tsqlt that make unit testing possible (by temporarily turning of FKs)
That all seems pretty reasonable to me, and darn well should be included when "test[ing] invoicing":
- You surely want to make sure invoice creation depends on a valid billing address at the very least, right? If that breaks, you're gonna have a lot of rather irate AR clerks.
- You surely want to make sure the Contact is aware of the new invoice on the order, right? In fact, that might very well be part of the Contact's performance metrics, so if that breaks, you're gonna have a lot of rather irate sales/support reps (whatever "Contact" means in this context).
- You surely want to make sure your invoices are against valid items, right? And you surely want to make sure that expected revenue correctly ties back to an inventory movement, which in turn ties back to an inventory receipt, which in turn ties back to a paid invoice to a vendor, right? If that chain breaks, you're gonna have a lot of rather irate accountants and financial auditors.
- You surely want to make sure you're getting the right sales tax calculations, right? If that breaks, you're gonna have a lot of rather irate accountants, financial auditors, and tax collectors.
- You surely want to make sure the Sales Person and Sales Office both get credit for the invoice, right? If that breaks, you're gonna have a lot of rather irate sales reps and managers thereof.
So... no, I ain't exactly getting your point, lol. If you're changing invoicing, then all of the dependencies and dependents of invoicing ought to be tested.
But more to my point:
"very grateful for tools like tsqlt that make unit testing possible"
Unit testing != integration testing. If you're unit testing, then sure, turn off foreign keys and practice your quick draw while you cowboy it up. If you're integration testing, then that inherently means testing the whole system as a whole; what you call "cumbersome" I call "the bare minimum of comprehensiveness".
"I’ve easily used more than a full day just to get enough data in a naked system to make just the simplest test."
That's usually pretty easy to script, even with foreign key constraints. It might take you a day, but future days should be able to call upon that same script, saving you quite a bit of time :)
I don't think we disagree about much. My only objection was to
> then it shouldn't be hard to generate the dependent data while you're at it
which makes in sound like somewhat light work. My point was only that it isn't. And of course we script a lot of this test-data-creation, but then you need different kind of data for different tests, and need to add some flexibility. And after a while, just generating test data becomes somewhat complex in itself. And all of the test-data scripts needs to be maintained and changed whenever the model changes. None of this is done in a breeze.
One trick here is that if you're integration-testing, your tests are hopefully already generating this data for you (if not, then you should probably be writing more tests, lol), in which case you'd make the tests you do want to run dependent on the tests which would generate that data (for example, the Invoice tests would be dependent on the Sales Order and Customer and Product and Address and Contact and Sales Rep and etc. tests, so those tests will generate the requisite data, and the Invoice tests will use that data to create and test Invoices).
I'm confused, how do you design a database where things in A can't reference things in B? It's not relational at that point right?
Or is this just about the problems of DBs where you enforce referential integrity? In which case, if that's such a problem, like just don't enforce referential integrity...right?
I have been perpetually annoyed at the SQL/RDBMS/relational calculus model. It always feels like a huge context shift from imperative programming. after many years of writing SQL, I noticed that many other people end up writing SQL statements that look more or less like computer programs (CASE statements, subselects, etc).
It all came to a head when I naively asked an experienced SQL developer how to represent a tree in SQL and learned one way to do it was to have CHILD nodes with FK references to PARENT nodes. So any time you want to get all the CHILDREN of a PARENT you have to query all children to see if they have a FK to the appropriate PARENT.
This always seemed strange and backwards to me and when I did a deeper dive I learned that tree representations in SQL are a rabbit hole of insanity all arranged around referential integrity.
> I did a deeper dive I learned that tree representations in SQL are a rabbit hole of insanity all arranged around referential integrity.
It’s not crazy, just impractical.
Trees are mostly represented the same way in memory, except you don’t have to access them in table-format, which admittedly for tree-data is very, very cumbersome.
I'm not sure what data structure/language environment you mean, but in programming languages, trees are normally represented using pointers (typically from the parent to the child and back).
All of this literally goes back to hierarchical databases which were mostly replaces by relational databases for what appear to be mainly performance and implementation reasons.
So now you have just created a many-to-many relationship. Due to a bug in software, a child could now be linked to multiple parents, or even no parents. Part of good database design is preventing corruption of these sorts of relationships.
> I think it's more about the expressivity of pointers and arrays compared to the "backwardness-feeling" of index lookups.
Greater expressivity in ways to store bad data is never a good thing.
> It all came to a head when I naively asked an experienced SQL developer how to represent a tree in SQL and learned one way to do it was to have CHILD nodes with FK references to PARENT nodes. So any time you want to get all the CHILDREN of a PARENT you have to query all children to see if they have a FK to the appropriate PARENT.
One far-less-painful solution: use what's called a "closure table" to track the ancestor-descendant relationships not just at the parent-child level, but also grandparent-grandchild, greatgrandparent-greatgrandchild, etc.
For example (assuming Postgres, and eliding some NOT NULL constraints for readability):
Then, whenever inserting a new node (assuming that Postgres allows specifying integer primary key values on insert, which I don't recall if it restricts by default):
-- First node
INSERT INTO node VALUES (0, NULL, "foo");
INSERT INTO node_closure VALUES
(0, 0, 0); -- self
-- Second node (child of first node)
INSERT INTO node VALUES (1, 0, "foobar");
INSERT INTO node_closure VALUES
(1, 1, 0), -- self
(1, 0, 1); -- parent
-- Third node (child of first node; sibling of second node)
INSERT INTO node VALUES (2, 0, "foobaz");
INSERT INTO node_closure VALUES
(2, 2, 0), -- self
(2, 0, 1); -- parent
-- Fourth node (child of second node, grandchild of first node)
INSERT INTO node VALUES (3, 1, "foobarbaz");
INSERT INTO node_closure VALUES
(3, 3, 0), -- self
(3, 1, 1), -- parent
(3, 0, 2); -- grandparent
The upside is that it's now trivial to query for a node and all its descendants:
SELECT descendant.name
FROM node AS descendant
JOIN node_closure ON descendant.id = node_closure.descendant_id
JOIN node AS ancestor ON ancestor.id = node_closure.ancestor_id
WHERE ancestor.name = 'foo';
Or more succinctly (if you already know the ID):
SELECT name FROM node JOIN node_closure ON node.id = node_closure.descendant_id
WHERE node_closure.ancestor_id = 0;
Either of which gives you:
----
name
----
foo
foobar
foobaz
foobarbaz
As part of this upside, since you're not having to loop through every level of ancestry, reads for deeply-nested descendants are much faster.
The downside is that you have to do extra inserts to an extra table. The inserts themselves can be automated by adding a trigger on node which automatically creates/updates/deletes node_closure rows as necessary (which is why you'd still want the parent_id in the main table: so that the trigger can grab that and build the closure rows, and so that if the closure table gets out-of-sync you can fall back on that and rebuild it), but that still leaves a performance impact on writes (pretty negligible for shallow descendants, but it gets worse for deeper ones).
Personally, I'd opt for a closure table if I know that reads are going to be more common than writes. If writes are more common than reads, then sure, some sort of crazy recursive query might be preferable.
GitHub is a special case, perhaps in all of computerdom, in that most of their public data is stored not in (SQL) databases, but in git repositories. That's rather unique!
So we know they must already have a great indexing and search system in place, and they can't use foreign key constraints on most of their data, anyway. Git simply doesn't have that. It's not that kind of database.
Given this environment, it's not surprising they don't use FKs. It wouldn't surprise me if they don't use JOINs much! Throw it all in a repo, and let the indexer sort it out.
That's great for them, but my data isn't shaped like that. Data types and referential integrity are more important to me than history and bisection.
Also, to be noted, Github was a Rails app since long before there was native support for adding foreign keys. I've worked on a lot of large monorails that were created in Rails' early days and none of them have foreign keys.
GitHub is a Rails app. [1] Rails does not create foreign keys by default, so most Rails apps are doing exactly this.
I think you may be missing the distinction between the database feature "foreign keys", and the general concept of an entry that refers to another entry. Rails does the latter, but not the former, unless you explicitly do it yourself.
(Maybe Rails has changed dramatically in the last few years, but up until at least Rails 4, this was the case)
1: Obviously it's also way more, but the main codebase is a Rails app, and has been forever. This thread is from 2016, so that was probably even more true then than now.
The summary is basically this: if you say that a User has_many foos, then the foos table will have a user_id in it. When you ask for the foos a user has, the SQL will be emitted that has the "user_id=1" or whatever clause on it.
If you add a foreign key constraint, the database itself will verify that the foos have valid user_ids. If you don't, then it won't. You're basically giving up the ability for the database to verify the relationship for you. Like any kind of checks, this is more restrictive, but you get some value out of it. The question is, is the validation worth the restrictiveness. The OP comment argues no. Many would argue yes. It really just depends.
They're not talking about whether or not a field is considered a foreign key to another table, but whether or not to define those keys in the database itself so that it also creates a constraint. Really the post is about having a constraint in the database, not the fact that the fields can be considered foreign keys.
To me the issue is that the data model should be defined in a single place. If you're using foreign keys, then the development process needs to take that into account and provide a sensible way to modify the data model in a way that the FKs keep up with the evolution of the product. It's going to mean a lot more migrations for sure.
I personally have never seen foreign keys implemented correctly -- they've captured an outdated version of the data model and are just in the way. For me, they're similar to stored procedures, usually to be avoided.
This decision makes sense from the perspective of the person who owns doing database migrations, or who owns the performance of the database.
However, the other 99% of the engineers in the company end up dealing with the fallout of not having FKs -- more bugs, and slower development velocity.
One alternative is to have FKs in smaller, sharded databases, where you don't have to migrate between shards. For example, all the database rows related to one repo could be co-located in a shared with strong FKs between them.
My advice, put constraints that make sense as you're developing an application, and marvel at how often you accidentally violate them while writing trivial code. You'll realize pretty quickly that even when you know the whole area of what you're working on you still screw up. FKs might not add a lot of value on bet the company, mission critical, heavily tested code. But on less rigorous codebases they will save your ass.
Application vs Integration databases: https://martinfowler.com/bliki/IntegrationDatabase.html. If you're going to have one application writing to your database, having your constraints live in your application is definitely a viable option.
Granted, I have never worked with a database the size of Github, so I've never had to deal with sharding issues. But, even though foreign keys can be a PITA in some situations, I prefer the DBMS to do it for me than rely on the application. Foreign keys have always kept my back. I've been burned a few times when they weren't used.
I'm curious what these comments think they're saying. What effort went into GitHub such that they feel they can make the claim that FKs are an anti pattern. Do they think they're just better than everyone else or is there some other practice that makes this claim more interesting than simple hubris? Me a
Last time a DBA suggested me dropping foreigb keys for performance reasons, i suggested him to write code ensuring data consistency with distributed parallel writes. Haven't heard anything about FKs from him since.
Enable FKs for test and development, disable them in production, for tables that see billions of rows.
FKs have other downsides. They add locks on the referenced rows, to ensure they don't disappear before the transaction commits. If you start out with 32-bit primary keys, you have a major pain when you go over 2 billion records - incrementally upgrading all your FKs before your PK breaks your referential integrity, since FKs generally need to have the same type on each side.
Never used them in my personal projects either eventhough I learned about them in school. It just didn’t make sense to me, why would I constrain myself in the database when I could manage all of this in the application logic?
And if you want to experiment with different schema it’s just a nightmare.
Foreign keys should definitely be used in most schema designs. While I can agree that they are problematic when doing a schema migration, a schema migration would only happen "very rarely" whereas inserts/deletes happen regularly. I believe this comment is more of a reflection of the specific database (MySQL) which has relatively poor FK performance and adjunct issues when compared to other SQL databases.
Also, from a process perspective of migrating a schema, the ideal mechanism would be to incrementally copy from the existing schema to the new schema (meaning there is no breakage in the interim). Granted (and somewhat obviously) for databases as large as what GitHub is likely using, this is not possible -- and so a "full migration" at once is their only option. However, part of that "full migration", to me, would be to _remove_ the original foreign keys from the schema, rename/delete/alter tables/fields/indices, and then after all of that is complete, as a final step, re-add referential integrity (ie, foreign keys).
I'd specifically like to point out the following:
1) This comment gives advice about all SQL databases based on the conclusion / operation of only one database (MySQL), whereas the same base facts do not apply to others (eg, PostgreSQL, Microsoft SQL Server, ...), and should not be used to evaluate whether using FKs makes sense for other databases or not.
2) Schema migrations should be done rarely, if ever. A better approach would have been to start with a schema design that was less prone to _needing_ change. Regardless, even if schema migrations are done with _regularity_, that frequency would still be orders of magnitude less regular than, eg, code changes or CRUD operations. As such, throwing away FKs simply to accommodate either a poor or non-thoroughly thought-out schema is tossing the baby with the bathwater, especially since the purpose of FKs is to reduce the ability of other tiers in the application from losing or otherwise corrupting the structured data storage.
3) FKs are useful when doing CRUD operations. They are not useful during schema changes. As such, the proper process in this case would be to temporarily remove FKs, perform the schema migration, and then re-institute the FKs. This may not always be possible, due to, eg, active business logic components or some such that can't allow the database to be taken offline to perform a migration (see point #2). In that instance, there are a number of other mitigations that can be put in place in the logic tiers, such as being able to recognize different schema revisions, and deploying that _prior_ to doing a schema change. As such, one would first migrate an active "smart" logic tier that can recognize different schema revisions, and allow for incremental migration of schemas without enacting a cluster-wide lock/disabling of reads/writes.
2b) And keep your schema changes as incremental and backwards-compatible as possible. Add new tables, add columns to tables, but don't drop or rename anything. If you need a flag day, in PG terms just setup a new schema, use triggers to sync the two, and migrate your apps to use the new schema -- when they're all migrated, drop the sync triggers and the old schema.
Dunno why you are getting downvoted. MySQL is such a poor database that is always leads to hacks. The people who use MySQL almost always get burnt by something and figure all DBs are like that. Then people who have never worked with non-toy databases assume all databases are like MySQL and dismiss the whole field, which is a complete shame.
The fact is, a real database system like PostgreSQL is one of the coolest, most underutilized tools developers have in your toolbox. Good databases can do so much cool shit to help you. It is amazing how well a good database system can take your weird crazy query and return something in mere milliseconds.
while most rdbms-based projects will continue using foreign key constraints (and all will be referentially cast in stone), and postgressives will enjoy 100000x query speedups by remembering to include your hornsnaggle_tree_index_lookup_speeder_upper index,
there is a certain fanatical mindset from the rdbms world that sounds a bit like:
never forget
always include jquery with every web app you make.
it's made by real profeshunalz who know more than you and lots of people already included it and you should simply just learn to accept that all front-end web problems have already all been solved by jqueremy and you should simply refactor it all in terms of jqueremy.
anything else is korrupshun.
In other words, always only ever consider the same tools and techniques. LOL
A mere search for other database types produces pages upon pages of "why sql is beating nosql" and all stack overflow answers always doom you to the 7th level of hell for anything but 3rd normalized form. hey let's all use lookup tables for many to many. it's a cult. it's nuts.
anyone who has spent any amount of time with database-backed applications is surely aware of the tradeoffs they are making when choosing non-relational databases of any kind.
The obsession with forcing every all all use cases into the rdbms mindset makes me say "well then, let's just flatten everything even further into mapped key-value a la ohm." in which one encodes the key with path info and the value with the value, etc.
if the act of suffering for the sake of suffering and breaking things into iddy bidd y pieces is virtuous in and of itself, surely this is a more true pure and rigorous religion to bend every single thing to.
(before smarty pants says "um, querying" may I remind you of adding another entire secondary index of any kind via adding it with the id of the first key as it's value.)
I'm not advocating everyone do this. no freakouts pls.
ideally we would just persist our objects, which is coming with persistent memory, but i digress...
This Lance fellow seems to have taken plenty of "you have no idea what you are talking about moron!" abuse, despite having a phd in computation theory and writing erp software since the dinosaurs roamed the earth. (I'm also not advocating his solutions for any and all problems, but not taking his ideas seriously is a mistake IMO)
https://codeburst.io/databases-the-future-buggy-whips-of-sof...
mongodb is an incredibly powerful tool for when you want the whole json, all the json, and packed as bson. When you want to store/dish out blobs and not join nor have no use for the parts separately, then you can do that nicely.
(besides this usual story about "for when you don't know your schema or have lots of irregular semi-structured data, stripe, your mailgun webhooks, whatever. It's a great DB, for some things, despite the people saying "since it didn't do this one thing i really obsess over and think all dbs should always do that therefore its' crap"
(well, copying the same master-slave with single-write master failover or vertically scale story as the rdbms people was probably a weak point...)
pg as a jsonb store with secondary indices on actual columns LOL...
I guess we just basically need to accept that the same tools can be used as primitives/components in different contexts in terms of the actual application.
Some people use Redis as a primary datastore.
zealotry is best avoided, I guess..
look, this silo-ing stuff is also silly, and a product of the corporate dysfunctional mentality:
split everyone up and get DBA's warring with Devs...
oh, I also forgot to add that many applications mix-n-match data components to create a data layer with a standard API that then may then be implemented with various backends for various purposes, each being applications in their own right, etc.
this is seldom discussed in such debates but is actually quite common for many applications.
You have columns that refer to the PKs of other tables, just without constraints. So your invoice table would have a column `user_id`, which tells you which user the invoice is for. But if you enter an invoice for user ID 999,999 when you have no users, the database eats it happily, since it has no opinion as to what that data should be.
At GitHub they should be using CockroachDB (:P), a scalable relational database that also supports schema changes. Then their FKs would work just fine (albeit with some performance implications).
As a counter point to the linked issue, I operate a few small applications. Foreign-keys (and constraints in general) are great at ensuring that invalid data doesn't find its way into your database. Yes, they have a performance cost. Yes, they make sharding more difficult. In my experience, at smaller scale the trade-offs are worth it. YMMV