Natural Keys vs. Surrogate Keys

Posted by lec** on Tuesday, January 15 2013 @ 12:48:16 GMT

Natural keys vs. surrogates. The topic of much debate among database guys and gals. My own conclusion is that natural keys can be used, but surrogates should be used by default. The only time I was sure about using natural keys is for a database of my own compositions, where every successive composition belongs to a fixed catalog and is assigned an ID equal to the last assigned ID, plus one. In this way, compositions can be recorded in the database when they are finished, but their key still reflects their chronological order of coming to existence which the machine database could not perceive if it was allowed to create keys for them.

The problem with certain natural keys is authority. For example, unlike my example above with composition IDs, there is no central authority that must be contacted when issuing an ISBN.

Sure, if you want a valid ISBN you're going to have to go to a publisher who can assign a unique one to your book. But if you publish a book privately, it's entirely possible (however horrific the idea) to not give it an ISBN at all, or give it a fake one just to make people squirm.

What about books published before the ISBN was introduced? If you want those in a database, you'll have to enter something for the ISBN if it is a primary key. Also, there have been cases when dupicate ISBNs were issued accidentally.

The underlying problem is that the ISBN is not really a natural key. Clearly, if not every book has one and if they aren't truly guaranteed to be unique, it is not a key.

The ISO 3166-1 alpha-2 specification allows for 26^2 possible country codes. Granted, almost half of them are ununsed even now - and if every country in the world split up into two countries, database normalization would probably be somewhat overshadowed by the ensuing chaos - so it seems like enough, but the argument of natural keys vs. surrogate keys isn't a pracical one. The makers of IPv4 thought that 32-bit addresses would never be used up, and nothing could be further from the truth. The natural vs. surrogate key argument is a philosophical one.

Clearly, for any entities that don't have a fixed count in space and time, anything that is not an ascending (i.e. uniquely changing) sequence of numbers maintained by a central authority is not guaranteed to be a primary key for them. Obviously, using the so-called "natural" or "surrogate" keys is personal choice and judgement that carries responsibility.

Natural keys are easier to read and understand for humans and sometimes bring certain benefits even to the machine-stored database. Often though, such keys are rare or are tricky cases (like the ISBN) because they look like they SHOULD be unique natural keys, but are not due to human error or bad design.

Either way, unless you are getting unique codes for something very specific, very manageable from a very reliable, fully trusted centralised source (like release ID numbers from a music label that employs a database administrator with a crippling obsessive-compulsive disorder), or unless you are otherwise absolutely sure you will always get a unique natural key with every such entity, it is safe to assume all natural key candidates are in fact just possibly-unique attributes.

Yes, I agree it would be wonderful if every table of books in the world would have the same unambiguous ID and we could use them to always uniquely identify a book -- but real-world data is unpredictable and random. Unless you can actually find a source that guarantees at the pain of death that something will be unique, you won't really be getting a true natural key, but a fake attribute that can "usually" be a key.

There is a known problem in the database world that applies to normalization that also applies here: going too far. Perhaps the whole point is moot because either scenario works. The idea of this whole text was really to provoke you into thinking about natural keys before you use them, both theoretically and practically, because they are often less suited in both regards compared to a nice integer sequence.

The Big Conclusion
Essentially, if you actually had a natural key for some entity, using it would be fine, probably preferable to an artificial key. However, don't trust that something really is a key just by looks alone. Maybe it's a key now, and tomorrow it won't be. Relying on your own database to create keys is a compromise you should be prepared to accept, perhaps not willingly, but more willingly than using a broken so-called natural so-called key.