Database normalization

Database normalization provides a set of easy-to-apply principles that can help the developer structuring its database and improve MySql performance.

Normalizing a database is to organize its data more efficiently thanks to normal rules or principles. For more efficiency, building tables and the relations between the tables will be the result of the normalization process. DB normalization ensures :

  • an optimized structure that does not waste disk space (the same data should not be found in more than one table for obvious reasons : updating several tables for the same data is absurd)
  • logical dependencies (implies storing relevant data inside or related to the right table)

Normal forms

I’ll stick here to the 3 first rules, guidelines that db-literates call Normal Forms. These rules should add enough efficiency to your database. Keep in mind you may sometimes prefer not to apply a rule for practical reasons. Indeed, normalization consists in adding relation tables, for example, that will require the developer to change his MySql queries.

Normal forms must be applied in the right order : further normalisation can only take place if the previous normalisation rule has already been applied.

For more convenience, I’ll take the example of a discography database that will store albums of famous artists.

Creating the database on the fly with no normalization experience could result in the creation of one single table :

records_table :

record_ID : 1 (defined as Primary_KEY)

artist : Fleetwood Mac is a British-American band that formed in 1967…

record_title : The pious bird of good omen

record_style : Blues, Rock, Pop

record_label : blue horizon

record_release : 1969

With this very simple table, one can easily understand that :

  • a SELECT query cannot get information about the artist without a record_ID
  • an UPDATE query that wants to modify the artist biography will have to do so in every row that is associated with that artist
  • a DELETE query will delete the record information as well as the artist information

These limitations may produce anomalies such as data corruption, MySql resource problems (with LIKE subqueries),… So, it’s time to normalize that database and we will do it thanks to the relational theory.

Relational theory

1NF => 1st Normal Form : the atomic form

Each table field should have an atomic value, i.e. only one single value in a column of your table. In the above example, the fields ‘artist’ and ‘record_style’ contain more than one value. This is why applying 1NF will result in the creation of 2 extra tables :

records_table :

record_ID : 1 (defined as Primary_KEY)

record_title : The pious bird of good omen

record_release : 1969

and

artists_table :

artist_ID : 1 (defined as Primary Key)

artist_name : Fleetwood Mac

artist_bio : Fleetwood Mac is a British-American band that formed in 1967…

and

styles_table (with 3 rows) :

style_ID : 1 (defined as Primary Key)

style_name : blues

style_ID : 2

style_name : pop

style_ID : 3

style_name : rock

and

labels_table :

label_ID : 1 (defined as Primary Key)

label_name : Blue Horizon

You’ll easily understand it is now easy to add an artist without adding a record from that artist, you can delete a record without deleting artist information,…

Data is now separated in different tables. We still have to create relations between these tables. Database relations can be of the following types:

1 to 1 : one element of table A corresponds to one element of table B and that element from table B corresponds to only one element of table A

1 to N : a record from the records_table can have only one label but that label may have more than one records. In that case, we add a foreign key to the records table containing the ID of the label

N to N : a record from the records_table can have more than one artist and an artist from the artists_table can have more than one record. In that case, we implement a third table called records_artists_table, a relational table that will link the records_table with the artists_table

So, let’s add a foreign key to the records_table :

records_table : 1 to N relationship

record_ID : 1 (defined as Primary_KEY)

record_title : The pious bird of good omen

record_release : 1969

label_ID : 1

and create the 2 new tables :

records_artists_table : N to N relationship

record_ID : 1

artist_ID : 1

We could also create the same kind of relationship between the records and the styles of music :

records_styles_table : N to N relationship

record_ID : 1

style_ID : 1

2NF => 2nd Normal Form : the relationships to keys

This Normal Form ensures non-key columns strictly depend on the primary key. So, in case of a composite primary key (a row in the table is identified thanks to the record_ID and the author_ID), all non-key columns must depend on the 2 fields, and not one of them only. Imagine the following table structure :

records_reviews table :

record_ID : 1 (defined as primary key with the following field)

author_ID : 1 (defined as primary key with the previous field)

review : This compilation album is one of my favourites…

author_website : http://www.domain.com

In the above example, the field author_website doesn’t strictly depend on the composite key record_ID-author_ID. It only depends on the author_ID. Applying the second normal form will result in moving the author_website field to the author table.

3NF => 3rd Normal Form

With the third normal form, every field must depend on the primary key. In other words, a field that depends on another column than a primary key violates the 3NF. Addresses in a table are a good example of 3NF violation. Imagine the following table structure :
labels_table :
label_ID : 1 (primary key)
label_name : Blue Horizon
label_zip : 1770
label_city : Sherborn
label_state : MA
In this table, the fields label_city and label_state strictly depend on the label_zip field, and not to the label_ID primary key. In order to make this table 3NF compliant, we have to create a zip table :
zip table :
zip_ID : 1770
zip_city : Sherborn
zip_state : MA
Then, the labels table will look like the following :
labels_table :
label_ID : 1 (primary key)
label_name : Blue Horizon
zip_ID : 1770
For some reasons, the zip_state field may also require a specific states table :
states_table :
state_ID : 1 (primary key)
state_name : MA
Normalization as stated above allows you to build a clean database structure but also implies more complexity in your application. Indeed, you’ll have to change your database queries. Your application will deal with more tables, so you’ll have to join tables in your queries, which can result in less performance for your applications.
I would advise developers to be aware of normalization, and to stop normalizing their tables just before the tables structure becomes too complicated. In our example, retrieving one row from the labels table results in creating the following query :
SELECT * FROM labels as A INNER JOIN zip as B ON B.zip_ID=A.zip_ID INNER JOIN states as C ON C.state_ID=B.state_ID WHERE A.label_ID=’1′ LIMIT 1;
Webliography :

MySql Introduction to db normalization

W.J. Gilmore’s introduction to database normalization

W. Kent’s simple guide to 5 normal forms