Surrogate versus Natural Primary Keys

Warning: Random SQL related musings

I broadly agree with this article on primary keys in database design - in effect saying that autoincrement/serial/identity columns are a Bad Thing. There are some cases where I don’t think there is any realistic option though, and some cases where an artificial primary key is more pleasant.

Firstly, the effectiveness of a natural primary key does depend on you having enough data to uniquely identify the records - which is not always a situation you are in. For example, a table with First Name, Last Name, Date of Birth is clearly not going to be enough for uniquely identifying any kind of extensive system, so lets say for something like a medical register you go for: First Name, Middle Name, Last Name, Address, Date of Birth, Date Registered.

This would probably be reliably unique, but isn’t of much use as a primary key - there are too many columns there, and it’s going to be impractical, and a little bloated to pass all that data around. This includes tracking it through client application, and having to include it in any related tables - i can certainly envisage a doctor-patient relationship where there were 10 columns of key.

Given the size of the primary key, I’m pretty sure that the performance penalties in the situation of having an extra unique index aren’t going to be so relevant. I have to say that dismissing the surrogate key as a lazy attempt to get out of typing a “few extra characters” is a sketchy one when it comes to the development - in my experience, development time is as important as performance, and there are plenty of cases when getting something out is more important than getting it quick.

I’ve seen a trade off used where people construct a primary key, usually a varchar of some kind, from the unique data, and then use that as the primary key. Admittedly, it’s not something I’ve tried myself, but it might give you some of the uniqueness benefits without a great big multicolumn key - though that would depend on the function you used to create the constructed key.

To be fair, this is not really disagreeing with what Joshua Drake put in his article - it’s just that for real world applications I think you’re likely to run into more trouble with a poorly thought our natural key than you are with a poorly thought out surrogate key.

3 Responses to “Surrogate versus Natural Primary Keys”

  1. Shogz Says:

    I think I understand what you’re getting at.

    It looks like in some cases, especially the doctor one you mention, that the record basically -is- the primary key.

    The issue you’re going to face there is not so much bandwidth or tossing too much data around but data protection, if you’re passing almost all personal details around as a header within a system, and necessarily to its child applications you’re going to have a lot of likely unencrypted data, being used as an identifier when in fact, it IS that same data which you’re attempting to identify.

    Remind me why you cant take the most specific data, say firstname and dob and just add an incremental number? Is it just sloppy?

  2. Ian Says:

    Yeah, data protection is actual an interesting point, the fact is that there may well be a requirement to hide some uniquely identifying information - though that could be done by an intermediate layer. The incremental number you suggest is basically a surrogate primary key, and that is how an awful lot of databases are designed. The disadvantage is that you end up with a less normalised DB structure.

  3. Stu Says:

    It’s a tricky one, like most development choices there has to be balance between the ‘perfect’ system and the one you can actually implement.

    In Joshuas example you could have seperate out a name table, then first name, last name and have a unique key of first name + last name but thats a lot of hassle for a ‘perfect’ system (otherwise we’d take our dbs to 5nf every time).

    Looking closer I’d agree something is wrong with the default way of creating an incrementing id column, if only because it seems like were re-implementing something the underlying system should (and probably) does already do under the covers.

    I started writing this thinking that performance issues would be a problem, but in practice probably not - the system should internally be passing around integer ids of some sort even if you are acting on text fields.
    Databases are read more than written, so internal id’s would only need recalculation on writing.

    In general databases are good (well postgres:) working with sql it can feel powerful and limiting at the same time (why can’t you pass in arrays/tables for instance).

    Until we work out where were going next were going to be stuck adding id columns in most of our implementations.

    There are interesting choices

    ORM is interesting, but it is building a whole layer on top of sql, so maybe object database are the way to go ?

    CouchDB, looks good too, but I really haven’t looked at this at all yet.

    Both ORM and CouchDB both seem to be trying to make the database like the language you already use (RAILS, django do a sort of similar thing).

    It would be really easy to rant here about the fragmentation of the platforms we have to work with, but luckily somebody already has
    http://steve-yegge.blogspot.com/2006/09/bloggers-block-3-dreaming-in-browser.html
    “M, V and C don’t need to be three separate languages.”

    Right now, edging towards lisp, especially when you can you can effectively write the html, sql, javascript all in it, I think i just need an excuse to actually use it :)

    html in lisp:
    http://www.newartisans.com/blog_files/hunchentoot.primer.php#unique-entry-id-48

    [Keep it SQL.........S++]

Leave a Reply