A Simple Guide to Five Normal Forms in Relational Database Theory

novum · on Aug 1, 2010

I do a lot of interviews for software developers, and I ask a very simple database normalization question that close to half of the candidates miss completely.

Something like: "You're tracking doctors, hospitals, and patients. Doctors can see multiple patients, and doctors can staff multiple hospitals". All I'm looking for you to say is that you'd put junction tables between the main entities.

It's so, so crucial that developers understand the basics of database design, and normalization is a huge part of that.

Groxx · on Aug 1, 2010

In this case, the many to many relationships should be obvious, but I feel I must point out:

Based on that text specifically, all that's needed is that the patient and hospital each have a doctor ID column. How many would get it if you specifically pointed out that they need to consider real-world circumstances in this, and / or if you enumerated all the requirements?

Many many many times, when stating a hypothetical question, real-world handling responses are not desirable to the questioners. If the interviewee realizes this, they also realize that they risk losing points no matter how they answer: answering based on the specified requirements fails with you, but answering based on real-world uses fails with people testing against over-engineering / people religious about specific answers to specific questions. And being verbose risks both groups, based pretty much solely on the interviewer, of whom they have effectively no knowledge.

Or have you done this, and it just wasn't included in the example above?

skybrian · on Aug 2, 2010

There's no sure way to win with everyone, but usually you get more points by asking the interviewer what sort of answer he or she is looking for than by blindly making assumptions. Part of the point is to see whether people have good communication skills and can deal with ambiguity.

Groxx · on Aug 2, 2010

Certainly. And if that's part of the test, then by all means do that.

But if it's not, you might be artificially deflating some fantastic people because a) they guessed wrong, and b) you're not aware of your implicit expectations. Be aware, it's in your best interest in many many ways.

SkyMarshal · on Aug 2, 2010

In what real-world instance/s would you prefer a doctor id column in Employee and Hospital tables to junction tables?

You'd either end up with the same employee or hospital records listed multiple times with different doctor ids (lots of repeated data), or with a field in each employee or hostpital record containing a list of doctor ids (searching by or joining on doctor id requires less simple SQL).

I would think the obvious answer to someone familiar with relational databases would be the OP's expected answer - normalize all three tables, then use junction tables to enable many-many relationships.

What am I missing?

Groxx · on Aug 2, 2010

No instances (that I'm aware of).

What you're missing is that the question-as-stated does not necessarily reflect the real world. The real world does need junction tables. A fully-correct answer to the question-as-stated does not. Change the nouns to anything you desire; the stated problem remains precisely the same, but for this interviewer the answer is different. But not for all interviewers - some do want answers off only what was stated, not off what baggage our language and culture applies to those words.

The question is an artificial, massively-simplified hypothetical problem. What was removed in simplification cannot be inferred with accuracy unless it is stated, and depending on the interviewer, there is always a risk that your answer, however accurate, may be "wrong". It's a guaranteed loss (statistically) for the person being interviewed, and thus most likely for the interviewer as well.

zzzeek · on Aug 4, 2010

thank you, that's the first thing I thought. "doctors can have multiple patients, OK, so patient->doctor_id right ?"

the way interviewers flip between "real world answer" and "assume X, Y, Z, Q, P, R don't exist answer" drives me batty. Write a union of two sets ? Sure. x.union(y). Oh now its "hypothetical" time, right lets break out the pointers..

novum · on Aug 1, 2010

I usually include some extra details about each entity to imply that each should be its own table.

Groxx · on Aug 2, 2010

Ah, sorry, I left that part vague. I meant 3 tables, not just two. The point I was trying to make was that your correct answer required junction tables; other interviewers' correct answers require _only_ doctorID, because that's all your text required. And still other interviewers dock points for answering "cleverly" and covering both bases.

Without explicit context, how is the interviewee to know which the interviewer wants? They risk a lower review score no matter how they answer.

arethuza · on Aug 2, 2010

Asking the interviewer for more information? The ability to ask questions is far important than whether anyone can remember the details of database normal forms!

tbrownaw · on Aug 1, 2010

Nice sanity test buried in there too, how many candidates try to put a DoctorID field on the patient and hospital tables (ie, assume a patient can only see one doctor, or a hospital only has one doctor working there)?

timwiseman · on Aug 2, 2010

As a senior DBA that has done many interviews for more junior DBA/database developer, I too have been surprised by how many candidates cannot answer a question like that at all.

With that said, many developers (other than database developers) really do not need to understand that. They don't need to understand it because they will normally have a good DBA/database developer that handles that part for them. In fact, large projects may have many specialized DBA roles including Architects focused specifically on schema design.

djb_hackernews · on Aug 1, 2010

devils advocate, but why are you asking database engineer questions during a software developer interview?

brunoc · on Aug 2, 2010

Wouldn't you agree that most programmers will have to use databases and should at least have a basic working knowledge of how they work?

Database specialists usually end up with very specific knowledge of a particular platform, for example Oracle where there are endless ways to tweak for performance, or replication or whatever. Modeling with RDBMS is usually very straightforward and ends up being only a minuscule fraction of what you do day to day.

djb_hackernews · on Aug 2, 2010

programmers? While I hate the word I would agree some basic working knowledge is necessary. But if you post a job description looking for a php programmer, and then you bring people in and start asking questions about schema design where even a correct answer isn't correct because now you also want your programmer to be a software designer who does requirements gathering as well it's your own fault for expecting to be interviewing architects instead of programmers.

If you are looking for someone to do schema design, that damn well better be in the job description given to candidates before they apply, and your software engineers damn well better be compensated for the extra required skill.

Confusion · on Aug 2, 2010

Because I've never had a database engineer available to create my schemas for me. And frankly, I've never had a problem designing or maintaining a schema: it's not rocket science. Writing queries and getting performance out of the DB, that can be rocket science.

strlen · on Aug 2, 2010

> Writing queries and getting performance out of the DB, that can be rocket science.

I tend to ask questions more about database internals, but it's a crucial job for an engineer to understand how to get performance and integrity for their queries.

Sometimes times you may get world-class DBA/DBE (my current employer is one of those places) who can do an engineer's job (in terms of tuning the queries, etc...) but that is really an engineer's job.

Shorel · on Aug 2, 2010

It totally depends on the use you plan for the DB.

For a DataWarehouse, you probably want to de-normalize the DB if you expect good performance when calculating those cubes.

kingkilr · on Aug 2, 2010

Gah, I've been in this business too long (I'm 19, WTF?), my first question is, can doctors be patients as well?

prodigal_erik · on Aug 2, 2010

I would strongly urge a doctor who gets sick to either use a pseudonym or go to some other hospital at which he's not listed as a doctor. MUMPS code and its schemaless shared-everything database (for medical records!) is terrifying enough all by itself before you give it edge cases like "this doctor is a patient".

http://en.wikipedia.org/wiki/MUMPS

Confusion · on Aug 2, 2010

No, that would not be a good thing to support. There can be Doctors and Patients that refer to the same real person, but you don't normalize these entities with respect to a single Person column. It's just not worth it, because you too rarely care that a Doctor can also be a Patient. In such a case, if it poses a problem (only when Patient is assigned himself as Doctor?), it will sort itself out in The Real World.

nradov · on Aug 2, 2010

The HL7 RIM actually allows for this. A Person can have multiple roles including Patient and LicensedEntity (doctor). http://www.hl7.org/v3ballot/html/infrastructure/rim/rim.htm

Confusion · on Aug 2, 2010

True, this example of doctors and patients is just the wrong sort of example, because that world has already been standardized into oblivion. However, in general, for your relatively small app for company X, you wouldn't go this far. In general, the fact that a supplier can sometimes also be a client is usually not very interesting.

ciupicri · on Aug 2, 2010

Interesting question, but does it really matter? You're not tracking persons in general like a government would do, you're only tracking staff and clients. It would matter that a doctor is also a patient only if he/she would have some kind of preferential treatment, e.g. a discount or a better room.

kingkilr · on Aug 2, 2010

Or if you're storing information about them, say their home address, better to have it in one place than two.

lolipop1 · on Aug 2, 2010

No, simply because they are from different domains and by extension from and for different systems. And as such, a natural ID (or arguably a GUID) can also be said to be more fitting to establish that kind of relation.

Another1 · on Aug 1, 2010

junction! really! i looked up this word online and only found this http://en.wikipedia.org/wiki/Perl_6#Junctions

the relational term would be relation, please stop interviewing ppl for jobs you are not qualified

brown9-2 · on Aug 1, 2010

Wouldn't the term "junction" tell you what you need to know about this table?

"Junction", "join table", "intersection table" - really, you are going to judge someone's competency based on which flavor of a synonym they use?

giardini · on Aug 2, 2010

Why downmod him so? His implication that the term "junction table" is nonstandard terminology is correct.

Despite 45 years in IT and a good 25 years of relational databases, I don't recall having heard (or read) the term "junction table" in a technical discussion. A cursory Google search indicates it may be recent terminology from the Microsoft corner of the universe.

Confusion · on Aug 2, 2010

  Why downmod him so?

Probably because many here don't like someone badmouthing another over something so small.

arethuza · on Aug 2, 2010

I don't think I've ever heard the term either. However, in the context I don't think it's that difficult to work out what is meant.

sandee · on Aug 1, 2010

http://en.wikipedia.org/wiki/Junction_table

sovande · on Aug 2, 2010

You just added that to Wikipedia now didn't you :-)

epochwolf · on Aug 2, 2010

Easiest edit: 18:07, 18 February 2006

mbateman · on Aug 1, 2010

You "looked the word up online" just now and feel qualified to judge whether or not someone is a qualified interviewer?

EDIT: Meh, I have been trolled.

Groxx · on Aug 1, 2010

I hadn't seen the Perl6 junctions before... that's an utterly awesome language feature o_O

I really need to take a close look at Perl 6 some time.

btilly · on Aug 2, 2010

You understand everything most developers should ever need to know about normalization if you get the following two sentences.

The key, the whole key, and nothing but the key. So help me Codd.

(For those who don't get it, the first sentence describes first, second and third normal forms, and the second sentence names the guy who is responsible for most of the theory behind them.)

sovande · on Aug 1, 2010

Thanks, nice to get a refresher on the fifth normal form. Though in practice I have yet to see anything behind the third normal form used; "The Key, the whole Key, and nothing but the Key, so help me Codd"

ergo98 · on Aug 2, 2010

Almost everyone honours the 4th normal form without thinking about it. I think the "the 3rd normal form is the practical level" just became a bit of a meme.

slim · on Aug 1, 2010

Don't take it for granted.

With modern software design and especially OOP the risk of data inconsistency is less of a problem (I do web apps for entreprise and it simply never happened to me).

On the other hand duplication of information is a great way to scale an application. And incidentally offers data security since you can cross check your data in case of corruption.

ergo98 · on Aug 2, 2010

>With modern software design and especially OOP the risk of data inconsistency is less of a problem

This sounds mighty hand-wavy. How does "modern software design" reduce the risk of data inconsistency?

>On the other hand duplication of information is a great way to scale an application

Most relational systems have facilities for duplicating information to scale. Materialized views, for instance, generate vile offenses to all normal forms but perhaps the first, hyper optimized for consumers, but it is guaranteed coherent and consistent, and happens with barely any work.

Seriously, we've been solving these performance issues for years. Every time some, failing a better word, noob writes up their big internet paper on why the relational model fails (with "modern" software, which is chuckleworthy), the world gets just a little bit dumber.

The bit about checking data for corruption is just disturbing.

slim · on Aug 21, 2010

> How does "modern software design" reduce the risk of data inconsistency?

By grouping code by business logic, code manipulating the same set of tables tend to land in the same place. Sometimes it even gets refactored.

> Seriously, we've been solving these performance issues for years.

Who are you?

Another1 · on Aug 1, 2010

If you want to learn the relational model using a fairly slim book, i cant recommend "Practical Issues in Database Management" by Fabian Pascal enough for you!

And always remember, relational theory is one thing, and popular RDBMS is another.

maxdemarzi · on Aug 2, 2010

Software developers and SQL developers can both usually figure out how to get what they want out of the database...

The problem is the software developer may rely on loops ( and variables, and cursors and temporary tables and dynamically generated sql, the horror) to do his thing while the SQL developer will use set theory to avoid as much of that as possible.

They both rely on what they know, it's just that one persons knowledge is better suited for programming and the other is specific to relational databases.

timwiseman · on Aug 2, 2010

You are entirely right, but permit me to respectfully add a few points I think are critical.

First, the two domains of knowledge are not mutually exclusive. Almost all SQL experts/DBAs are also able to do many forms of more conventional programming, and do them properly. Similarly, many software developers know SQL reasonably well and can write good set based code when they choose too.

Second, if working in the other domain, you pay a hefty price for not approaching that domain on its own terms. Using loops and cursors in MS SQL Server is enormously less efficient than the same solution done in sets (Oracle has a slightly narrower gap, but still a gap. I suspect the same is true of all RDBMS but I can only speak to those from experience). On the flip side, if a normal database developer is shoe horning data into an RDBMS that is not relational, that will cause complications as well.

grk · on Aug 2, 2010

I've always thought that 5NF is a bit overkill. Didn't find one decent example that would make me say "oh, so that's why they do it".

I must say though, after the Database Systems course at DTU, when I come up with a DB design it's almost always in 4NF to start with :). Man, I hated this course.

alan-crowe · on Aug 2, 2010

I know so little about relational databases that I'm worried about being misled by an unreliable guide found on the internet. Reading suspiciously, this guide seems garbled.

Section 3, second and third normal forms, says "Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form."

There seems to be a stray word "us". Ignoring that, this is cute word play that doesn't quite make sense. If your table has non-key fields you are inevitably providing facts about the non-key fields.

Continuing,

"We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee."

This is the wrong way round. Usually there are lots of employees and a few deparments. Each employee works for just one department, but each department has many employees. Thus the "department of an employee" is a many-to-one relation, or function, which takes an employee and yields a department. The one-to-many relationship here is the employee list of a department.

The example for 3.2 seems to be opening the wrong can of worms. Suppose that Mr Strauss, who works for the department of waltz in Vienna, is seconded to the department of piety in Rome, in order to teach them some dance steps. Then we want his row in the database to read

(Straus, Waltz, Rome)

So one can of worms is sticking generic labels on your fields. If you label your fields (Employee, Department, Employee-location) there is no problem. If you label your fields (Employee, Department, Department-location) you have a problem, but it is obvious. If you label your fields (Employee, Department, Location) you are heading for trouble as some users of the database fill in the location of department and other users of the database fill in the location of the employee.

Hmm, second and third normal form are suspiciously similar, differing only because we regard some fields as belonging to the key. Is the article trustworthy?

cmurphycode · on Aug 1, 2010

I studied this summary before my most recent interview. It's a little abstract (in terms of lacking motivation/examples) for those who don't have experience in the field, but if you read it enough times, you get a decent handle on the idea.

eru · on Aug 3, 2010

I don't have any experience in the field, but I did not find it too abstract. Rather not abstract enough. (But on the other hand, I am a mathematician, and know about relations in theory.)

dennisgorelik · on Aug 2, 2010

Did anyone notice how many grammatical mistakes that article has?