Natural keys as Primary keys
Any Database designer looking at the post title would cringe. So would I for the next time. This post is a walk through of an example why it is a horrendous idea to do the above. Yup it’s my embarrassing story of bad design.
With the beginning of LeanAgri’s product development, one of the things I have stressed on as extremely important is database design. I am in no way an expert at Data modelling, but I enjoy thinking of how to represent real world data, how we can make it as future proof as possible.
One of my major learnings from Giles, one of the most fantastic engineers I worked with was, Code is dynamic, you can always change it when a new use-case comes up. This doesn’t hold true for data, atleast when you are using relational databases. Relationships are hard to change and migrations can take hours of downtime on production systems.
I don’t see a lot of people talking about data modelling. I think this is one of the most important part of LeanAgri’s architecture that I obsess over. Because a single bad decision can throw you in the gutter in the future.
Natural vs Surrogate keys
Let’s talk about modelling data and what should be the unique identifier for the data (primary key).
A natural key represents something physical. For eg: In a Library management setup, you might be tempted to use Author name and Book name as Primary keys for their respective table. After all, they obviously point to the data correctly.
A surrogate key has no physical meaning but is present as a record in the same row as our data. Examples can be a randomly generated unique number stored in the row containing the Author details.
Thinking about it, it makes more sense to use a Natural key against a Surrogate key when our data is unique. Let’s dwell further into the real world example from LeanAgri architecture, which will show you why this might not be a good idea:
The requirement
The problem was to model a list of fertilisers in our system. I decided to take the name of the fertiliser as the primary key because why not? A fertiliser must be represented only once in our model and a fertiliser’s name is a good enough identifier for representing itself.
Hints that something was going wrong
With time, we had to build a translatable system, where we could translate the fertiliser name in different languages. Now what should be the primary key of this model? Should we have a key defined by the name in each language?
When I decided that it was a blunder
One of my colleagues was adding fertiliser information and mispelled a fertiliser name. We had used this object in various relationships already. There’s absolute no way to change this name without running a query that also replaces all Foreign key relationships.
What’s the right way?
I would quote this Stackoverflow answer :
The formal primary key should always be a surrogate key. Never anything else. Everything is changeable, even when the users swear on their mothers’ graves that the value cannot be changed is is truly a natural key that can be taken as primary. It isn’t primary. Only surrogates can be primary.
(These aren’t my words, but wake me up from deep sleep and you would get this response from me.)
Fixing it (Changing to a model with no Natural keys)
It’s definitely possible to fix it, but doing it 2 years later with millions of records is painful, so do it whenever you realize that you messed up.
There’s no quick fix way to just replace your Primary key(since it is referenced in other tables). So you need to create a new table and add references to the new table’s primary key everywhere. Here’s what my migrations in Django looked like.
Step 1
– Take a Backup!!!! Even if you do all steps below correctly, you(Murphy) WILL screw something up.
Step 2
– Created a new table for Fertilisers(this time with name as unique but not a Primary Key!)
– Write a migration script which copies data from the existing model to the new model.
Step 3
– Duplicate all relationships for the existing model (ForeignKey and M2M relationships)
– Write a migration script which copies data from existing relationships to new relationships.
Step 4
– Finally remove the original model and all the relationships defined on that model.
Fixing this took away a weekend, but I believe it was worth the saved hours trying to live with the problem.