How to Normalize Relational Databases With SQL Code?

Sanjana Bhakat 31 Oct, 2023 • 9 min read

Introduction

Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are going to learn how to normalize an RDBMS with the help of SQL programming. This article attempts to explain how to remove inconsistencies in relational databases with SQL using normalization and a dummy database, for example. Before delving deeper, one must know a few related terms.

Learning Objectives

  • Know why normalizing a database is important
  • Understand the normal forms used in the industry
  • Learn how to convert a given database into the desired Normal Form using SQL.

This article was published as a part of the Data Science Blogathon.

What is a Relational Database?

Relational Database SQL

Imagine a retail store wanting to store and manipulate data on their sales. Ideally, this data will contain the Item Code, Item Name, Price of each unit, Number of units sold, Category, and Date of Purchase. So, all these attributes are the names of the columns, and a particular row contains data on the purchase of a particular product on that day. Hence, here interrelated data has been collected and stored in rows and columns. This is what we call a “Relational Database.” An RDBMS (Relational Database Management System) is used to store this data.

If we go for a formal definition of a database, it can be said that it stores interrelated data from which users can efficiently retrieve information. Relational databases are structured, self-describing, and storing data according to predefined relationships.

But real-time data stored in this model can be messy most of the time, making it inconsistent and hard to work with. So, to easily manipulate and work with these data tables, we normalize the database.

Terms that are Commonly used in an RDBMS

Now, let’s delve deeper into the practical aspects of normalizing a database. But first, we must know a few terms to help us on the journey.

  • Data Dictionary: Contains all information about the objects in a database.
  • Metadata: Contains information about what the data stores, sources of data, etc.
  • Schema: A Schema is the internal design of the database, i.e., how the data looks at different levels to users.
  • Primary Key: Primary key is that attribute used to identify a tuple or row in a database uniquely. If it contains more than one attribute, it is called a composite key. It cannot contain redundant or null values.
  • Candidate Key: The candidate key is the attribute or combination of attributes that can uniquely identify a tuple.
  • Super key: A Super key is a set of attributes used to identify a tuple in a related database uniquely. This being said, each attribute in the Super Key may contain repeated or redundant values (which is not in the case of primary).
  • Foreign Key: A foreign key is a primary key of another table that becomes an attribute or field in a different table.
  • Functional Dependency: A functional dependency is a relation or a specific constraint between two attributes or two sets of attributes in a database. A —>B means that B depends on A or a functional dependency exists between B and A.

For more information on terms related to the type of database used in this article, click here.

This article mainly focuses on how to normalize a database. But first,

Why and When Should we Normalize a Relational Database?

Why and When Should we Normalize a Relational Database using SQL

We normalize a database because:

  • It makes it easier to organize the database,
  • It sequentially decomposes tables and hence, reduces data duplication,
  • It reduces the storage space taken by the database,
  • It removes specific anomalies and makes updating, deleting, and inserting data easy.

We normalize the data if:

  • The database is messy and takes up much storage space.
  • Updating or deleting a record from the database leads to inconsistency
  • The database has multiple values in a single cell.
  • No such attribute or column is present in the database that can help identify every tuple uniquely.
  • Duplicate or repetitive values are present in the database.

How to Convert a Relational Database to Different Normal Forms?

So, let’s use a practical example to understand how normalization works. Consider a hospital that maintains its database in the form:

How to Convert a Relational Database to Different Normal Forms with SQL
 
Step 1: First, you need to identify the relationships that exist between each attribute of the database. Here the relationships are:

One-to-one:

  • Patient_Id and Patient_Name
  • Disease_Id and Disease
  • Test_Id and Test

One-to-many: 

  • Category and Disease_Id/Disease
  • Category and Doctor_Name

Many-to-one:

  • Doctor_Name and Category

Many-to-many:

  • Doctor_Name and Patient_Id/ Patient_Name
  • Test and Patient_Name/ Patient_Id

Step 2: Once you know the relationships between the attributes, you need to normalize your data. To do this, you must convert your data to the best Normal Form (NF) that suits your business purpose. This article will look at normalizing a database using SQL (Structured Query Language). Click here to learn the basics and more about the SQL syntaxes used.

We can create the table in PostgreSQL using the following SQL code:

create table Patient_Information(
Patient_id varchar(10) NOT NULL,
P_Name varchar(100),
Age integer,
Marital_Status varchar(10),
Gender varchar(10),
Disease_Id varchar(10),
Disease varchar(100),
Category varchar(100),
Doctor_Name varchar(20),
Test_Id varchar(10),
Test varchar(40),
Price float,
PRIMARY KEY(Patient_Id)
);

Note: All SQL codes mentioned further in the blog are extended from those mentioned earlier.

  • First Normal Form or 1NF

To convert the first table to 1NF, we must ensure that every column consists only of a single data, and the attribute domain must not change. So, the relationship will become:

Patient_Information:

1200

If the database is not in 1NF, then the design is considered a bad database design. So, one must certainly apply
transformations to bring a database to 1NF before applying further normalization techniques.

Now we insert data into the table using the code:

INSERT INTO Patient_Information
VALUES('P001', 'Anna Smith', 18, 'F', 'HYTSH',
'Hypothyroidism', 'Dr Rohit Shaw', 'T4', 'Thyroxine', 400);
  • Second Normal Form or 2NF

To convert into 2NF, the relation must be in 1NF, and no partial dependency should be present.So, let’s consider the table as follows:

Patient_Id Disease_Id Test_Id Test Doctor_Name
P001 HYTSH T4 Thyroxine Dr. Rohit Shaw
P001 HYTSH TSH Thyroid Stimulating Hormone Dr. Rohit Shaw
P002 HRTBLCK ECG Echocardiography Dr. Rima Dhara
P002 HYPTSN BP Blood Pressure Dr. Rima Dhara
P003 PCOD USG Ultrasonography Dr. Shalini Thakur

So, this part of the table stores the disease name with the disease’s id and the doctor’s id who specializes in treating that specific disease.

  • The first advantage we get here is that we need not store long names of diseases in the patient’s information table.
  • Secondly, by knowing the disease’s id, we can easily find which doctor is or can be allocated to the patient.
  • Primary key: Patient_id+Disease_id

But we see that the doctor’s name only depends on the disease’s id, not the patient’s. Hence a partial dependency will exist. To remove this partial dependency, we divide the table into two:Patient_Info:

Patient_Id Disease_Id Test_Id Test
P001 HYTSH T4 Thyroxine
P001 HYTSH TSH Thyroid Stimulating Hormone
P002 HRTBLCK ECG Echocardiography
P002 HYPTSN BP Blood Pressure
P003 PCOD USG Ultrasonography

Disease_Info:

Disease_Id Disease Doctor_Name
HYTSH Hypothyroidism Dr. Rohit Shaw
HRTBLCK Heart Blockage Dr. Rima Dhara
HYPTSN Hypertension Dr. Rima Dhara
PCOD Poly Cystic Ovarian Syndrome Dr. Shalini Thakur

We can do this in SQL using the code:

CREATE TABLE Patient_Info ASSELECT Patient_Id, Disease_Id,Test_Id,
Test
FROM Patient_Information;
CREATE TABLE Disease_Info AS
SELECT Disease_Id, Disease,
Doctor_Name

FROM Patient_Information;
ALTER TABLE Patient_InfoADD PRIMARY KEY (Patient_Id, Disease_Id, Test_Id);
ALTER TABLE Disease_InfoADD PRIMARY KEY (Disease_Id);

  • Third Normal Form or 3NF

The database or relation is in third normal form if it is in 2NF and no transitive dependency is present.

Consider this part of the main table:

Patient_Id Disease_Id Disease Test_Id Price
P001 HYTSH Hypothyroidism T4 400
P001 HYTSH Hypothyroidism TSH 350
P002 HRTBLCK Heart Blockage ECG 1100
P002 HYPTSN Hypertension BP 50
P003 PCOD Poly Cystic Ovarian Syndrome USG 1200

Now, the table stores more information:

  • Primary key: Patient_id+Disease_id.
  • But the attribute Price depends on the attribute Test_Id. Both of them are non-prime attributes. Hence, a transitive dependency exists.

To remove it and convert the table to 3NF, we follow the approach:Patient_Info:

Patient_Id Disease_Id Disease
P001 HYTSH Hypothyroidism
P001 HYTSH Hypothyroidism
P002 HRTBLCK Heart Blockage
P002 HYPTSN Hypertension
P003 PCOD Poly Cystic Ovarian Syndrome

Test_Price:

Test_Id Test Price
T4 Thyroxine 400
TSH Thyroid Stimulating Hormone 350
ECG Echocardiography 1100
BP Blood Pressure 50
USG Ultrasonography 1200

We can achieve the following using the SQL code:

ALTER TABLE Patient_InfoDROP COLUMN Test_Id,
Test;
ALTER TABLE Patient_InfoADD COLUMN Disease; INSERT INTO Patient_Info(Disease)
SELECT Disease,

FROM Disease_Info;
CREATE TABLE Test_Price AS
SELECT Test_Id,
Test,
Price
FROM Patient_Information;
ALTER TABLE Test_PriceADD PRIMARY KEY(Test_Id);

  • Boyce-Codd Normal Form or 3.5NF or BCNF

For the database to be in Boyce-Codd Normal Form, it must be in 3NF. Also, in BCNF, A must be a super key for a functional dependency from A to B. Consider the part from our main table:

Patient_Id Category Doctor_Name
P001 Endocrinology Dr. Rohit Shaw
P001 Endocrinology Dr. Rohit Shaw
P002 Cardiology Dr. Rima Dhara
P002 Cardiology Dr. Rima Dhara
P003 Gynecology Dr. Shalini Thakur

Here, we see that:

  • Primary Key: Patient_Id+Category (since it can uniquely identify all the tuples in the table). That means the prime attributes here are: Patient_Id and Category.
  • But a dependency Doctor_Name —->Category exists, meaning a prime attribute depends on a non-prime attribute. This type of dependency is not allowed in BCNF.

So, to convert the relation to BCNF, we divide the table into two:Patient_Doctor:

Patient_Id Doctor_Id
P001 D110
P002 D200
P003 D201

Doctor_Info:

Doctor_Id Doctor_Name Category
D110 Dr. Rohit Shaw Endocrinology
D200 Dr. Rima Dhara Cardiology
D201 Dr. Shalini Thakur Gynecology

SQL Code:

ALTER TABLE Patient_Information
ADD COLUMN Doctor_Id DEFAULT Dr;
UPDATE Patient_Information
SET Doctor_Id='D110',
WHERE Doctor_Name='Dr. Rohit Shaw';
UPDATE Patient_Information
SET Doctor_Id='D200',
WHERE Doctor_Name='Dr. Rima Dhara';
UPDATE Patient_Information
SET Doctor_Id='D201',
WHERE Doctor_Name='Dr.Shalini Thakur';
CREATE TABLE Patient_Doctor AS
SELECT DISTINCT Patient_Id, Doctor_Id,
FROM Patient_Information;
ALTER TABLE Patient_Doctor
ADD PRIMARY KEY(Patient_Id, Doctor_Id);
CREATE TABLE Doctor_Info AS
SELECT DISTINCT Doctor_Id,Doctor_Name,Category,
FROM Patient_Information;
ALTER TABLE Doctor_Info
ADD PRIMARY KEY(Doctor_Id);
  • Fourth Normal Form or 4NF

For a database to be in 4NF, it must be in BCNF, and no multivalued dependency should be present.

Consider another part of the main table:

Patient_Id Marital_Status Disease_Id
P001 U HYTSH
P001 U HYTSH
P002 M HRTBLCK
P002 M HYPTSN
P003 U PCOD

Here, there is no relation between Marital_Status and Disease_Id.Both are independent of each other. Hence, a multivalued dependency exists. So, we divide the table as follows:Patient_MaritalInfo:

Patient_Id Marital_Status
P001 U
P001 U
P002 M
P002 M
P003 U

Patient_Disease:

Patient_Id Disease_Id
P001 HYTSH
P001 HYTSH
P002 HRTBLCK
P002 HYPTSN
P003 PCOD

SQL Code:

CREATE TABLE Patient_MaritalInfo AS SELECT DISTINCT Patient_Id, Marital_Status, FROM Patient_Information;ALTER TABLE Patient_MaritalInfoADD PRIMARY KEY (Patient_Id),CREATE TABLE Patient_Disease AS SELECT Patient_Id, Disease_Id, FROM Patient_Information;ALTER TABLE Patient_DiseaseADD PRIMARY KEY(Patient_Id,Disease_Id);

Benefits we get from Normalizing this Data

The benefits we get from normalizing the dummy database are:

  • We now have tables containing patient attributes, doctor’s attributes, disease attributes, and test attributes separately.
  • Also, tables show the relationship between those attributes’ tables using foreign keys.
  • Hence, our database is more organized, and the relationships between each table are well-defined. This makes the data more efficient for analyzing, updating, and retrieving. For example: If the hospital has to update the age of a particular patient, then that record needs to be updated only in the table containing the patient’s attributes. Once that is done, this record will automatically be updated for all the related tables since they are related or modeled using foreign keys (these constraints can be declared in SQL).

Hence, now we can see how a messy database can be reduced to these normal forms. Normalization helps reduce space and makes updating, deleting, and working with the data easier. Well, there are other normal forms viz:

  • 5NF: Here, the relation must be in 4NF, and no join dependency should be present.
  • 6NF: Here, the relation must be in 5NF, and all join dependencies must be trivial i.e., the relation becomes irreducible.

But most databases can be organized with forms up to 4NF. Based on the business purpose, it is up to you to decide which normal form suits you best.

Conclusion

That’s it! You are all done with organizing your database. Now your database is efficient enough for further analysis. You can further transfer this database to a DBMS or from the SQL server; you can interact with your data using queries to meet the business requirements.

Here are some key takeaways from the blog:

  • Updation anomaly, insertion anomaly, deletion anomaly, and data redundancy make a database inefficient for analysis and querying. Hence, we use normalization to organize related data efficiently and eliminate anomalies.
  • Five major types of normal forms are commonly used in the industry: 1NF, 2NF, 3NF, BCNF, and 4NF. Normal forms like 5NF and 6NF are not commonly used in the industry. The normal forms must be used depending on the context.
  • Normalization helps organize interrelated data tables by decomposing them into sub-components or small tables based on constraints. This eliminates certain dependencies.
  • One of the best ways to achieve a normalized database is by using SQL (Structured Query Language). In this blog, you can see how one can normalize and decompose the dummy database using queries in SQL.

I hope this blog helps in your learning journey. Meanwhile, you can check more articles related to databases here.

Have a great learning ahead!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Sanjana Bhakat 31 Oct 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear