How to Normalize Relational Databases With SQL Code?

Sanjana Bhakat 31 Oct, 2023 • 9 min read

Introduction

Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are going to learn how to normalize an RDBMS with the help of SQL programming. This article attempts to explain how to remove inconsistencies in relational databases with SQL using normalization and a dummy database, for example. Before delving deeper, one must know a few related terms.

Learning Objectives

Know why normalizing a database is important
Understand the normal forms used in the industry
Learn how to convert a given database into the desired Normal Form using SQL.

This article was published as a part of the Data Science Blogathon.

What is a Relational Database?

Imagine a retail store wanting to store and manipulate data on their sales. Ideally, this data will contain the Item Code, Item Name, Price of each unit, Number of units sold, Category, and Date of Purchase. So, all these attributes are the names of the columns, and a particular row contains data on the purchase of a particular product on that day. Hence, here interrelated data has been collected and stored in rows and columns. This is what we call a “Relational Database.” An RDBMS (Relational Database Management System) is used to store this data.

If we go for a formal definition of a database, it can be said that it stores interrelated data from which users can efficiently retrieve information. Relational databases are structured, self-describing, and storing data according to predefined relationships.

But real-time data stored in this model can be messy most of the time, making it inconsistent and hard to work with. So, to easily manipulate and work with these data tables, we normalize the database.

Terms that are Commonly used in an RDBMS

Now, let’s delve deeper into the practical aspects of normalizing a database. But first, we must know a few terms to help us on the journey.

Data Dictionary: Contains all information about the objects in a database.
Metadata: Contains information about what the data stores, sources of data, etc.
Schema: A Schema is the internal design of the database, i.e., how the data looks at different levels to users.
Primary Key: Primary key is that attribute used to identify a tuple or row in a database uniquely. If it contains more than one attribute, it is called a composite key. It cannot contain redundant or null values.
Candidate Key: The candidate key is the attribute or combination of attributes that can uniquely identify a tuple.
Super key: A Super key is a set of attributes used to identify a tuple in a related database uniquely. This being said, each attribute in the Super Key may contain repeated or redundant values (which is not in the case of primary).
Foreign Key: A foreign key is a primary key of another table that becomes an attribute or field in a different table.
Functional Dependency: A functional dependency is a relation or a specific constraint between two attributes or two sets of attributes in a database. A —>B means that B depends on A or a functional dependency exists between B and A.

For more information on terms related to the type of database used in this article, click here.

This article mainly focuses on how to normalize a database. But first,

Why and When Should we Normalize a Relational Database?

We normalize a database because:

It makes it easier to organize the database,
It sequentially decomposes tables and hence, reduces data duplication,
It reduces the storage space taken by the database,
It removes specific anomalies and makes updating, deleting, and inserting data easy.

We normalize the data if:

The database is messy and takes up much storage space.
Updating or deleting a record from the database leads to inconsistency
The database has multiple values in a single cell.
No such attribute or column is present in the database that can help identify every tuple uniquely.
Duplicate or repetitive values are present in the database.

How to Convert a Relational Database to Different Normal Forms?

So, let’s use a practical example to understand how normalization works. Consider a hospital that maintains its database in the form:

Step 1: First, you need to identify the relationships that exist between each attribute of the database. Here the relationships are:

One-to-one:

Patient_Id and Patient_Name
Disease_Id and Disease
Test_Id and Test

One-to-many:

Category and Disease_Id/Disease
Category and Doctor_Name

Many-to-one:

Doctor_Name and Category

Many-to-many:

Doctor_Name and Patient_Id/ Patient_Name
Test and Patient_Name/ Patient_Id

Step 2: Once you know the relationships between the attributes, you need to normalize your data. To do this, you must convert your data to the best Normal Form (NF) that suits your business purpose. This article will look at normalizing a database using SQL (Structured Query Language). Click here to learn the basics and more about the SQL syntaxes used.

We can create the table in PostgreSQL using the following SQL code:

create table Patient_Information(
Patient_id varchar(10) NOT NULL,
P_Name varchar(100),
Age integer,
Marital_Status varchar(10),
Gender varchar(10),
Disease_Id varchar(10),
Disease varchar(100),
Category varchar(100),
Doctor_Name varchar(20),
Test_Id varchar(10),
Test varchar(40),
Price float,
PRIMARY KEY(Patient_Id)
);

Note: All SQL codes mentioned further in the blog are extended from those mentioned earlier.

First Normal Form or 1NF

To convert the first table to 1NF, we must ensure that every column consists only of a single data, and the attribute domain must not change. So, the relationship will become:

Patient_Information:

1200

If the database is not in 1NF, then the design is considered a bad database design. So, one must certainly apply
transformations to bring a database to 1NF before applying further normalization techniques.

Now we insert data into the table using the code:

INSERT INTO Patient_Information
VALUES('P001', 'Anna Smith', 18, 'F', 'HYTSH',
'Hypothyroidism', 'Dr Rohit Shaw', 'T4', 'Thyroxine', 400);

Second Normal Form or 2NF

To convert into 2NF, the relation must be in 1NF, and no partial dependency should be present.So, let’s consider the table as follows:

Patient_Id	Disease_Id	Test_Id	Test	Doctor_Name
P001	HYTSH	T4	Thyroxine	Dr. Rohit Shaw
P001	HYTSH	TSH	Thyroid Stimulating Hormone	Dr. Rohit Shaw
P002	HRTBLCK	ECG	Echocardiography	Dr. Rima Dhara
P002	HYPTSN	BP	Blood Pressure	Dr. Rima Dhara
P003	PCOD	USG	Ultrasonography	Dr. Shalini Thakur

So, this part of the table stores the disease name with the disease’s id and the doctor’s id who specializes in treating that specific disease.

The first advantage we get here is that we need not store long names of diseases in the patient’s information table.
Secondly, by knowing the disease’s id, we can easily find which doctor is or can be allocated to the patient.
Primary key: Patient_id+Disease_id

But we see that the doctor’s name only depends on the disease’s id, not the patient’s. Hence a partial dependency will exist. To remove this partial dependency, we divide the table into two:Patient_Info:

Patient_Id	Disease_Id	Test_Id	Test
P001	HYTSH	T4	Thyroxine
P001	HYTSH	TSH	Thyroid Stimulating Hormone
P002	HRTBLCK	ECG	Echocardiography
P002	HYPTSN	BP	Blood Pressure
P003	PCOD	USG	Ultrasonography

Disease_Info:

Disease_Id	Disease	Doctor_Name
HYTSH	Hypothyroidism	Dr. Rohit Shaw
HRTBLCK	Heart Blockage	Dr. Rima Dhara
HYPTSN	Hypertension	Dr. Rima Dhara
PCOD	Poly Cystic Ovarian Syndrome	Dr. Shalini Thakur

We can do this in SQL using the code:

CREATE TABLE Patient_Info ASSELECT Patient_Id, Disease_Id,Test_Id, TestFROM Patient_Information;
CREATE TABLE Disease_Info ASSELECT Disease_Id, Disease, Doctor_NameFROM Patient_Information;ALTER TABLE Patient_InfoADD PRIMARY KEY (Patient_Id, Disease_Id, Test_Id);ALTER TABLE Disease_InfoADD PRIMARY KEY (Disease_Id);

Third Normal Form or 3NF

The database or relation is in third normal form if it is in 2NF and no transitive dependency is present.

Consider this part of the main table:

Patient_Id	Disease_Id	Disease	Test_Id	Price
P001	HYTSH	Hypothyroidism	T4	400
P001	HYTSH	Hypothyroidism	TSH	350
P002	HRTBLCK	Heart Blockage	ECG	1100
P002	HYPTSN	Hypertension	BP	50
P003	PCOD	Poly Cystic Ovarian Syndrome	USG	1200

Now, the table stores more information:

Primary key: Patient_id+Disease_id.
But the attribute Price depends on the attribute Test_Id. Both of them are non-prime attributes. Hence, a transitive dependency exists.

To remove it and convert the table to 3NF, we follow the approach:Patient_Info:

Patient_Id	Disease_Id	Disease
P001	HYTSH	Hypothyroidism
P001	HYTSH	Hypothyroidism
P002	HRTBLCK	Heart Blockage
P002	HYPTSN	Hypertension
P003	PCOD	Poly Cystic Ovarian Syndrome

Test_Price:

Test_Id	Test	Price
T4	Thyroxine	400
TSH	Thyroid Stimulating Hormone	350
ECG	Echocardiography	1100
BP	Blood Pressure	50
USG	Ultrasonography	1200

We can achieve the following using the SQL code:

ALTER TABLE Patient_InfoDROP COLUMN Test_Id, Test; ALTER TABLE Patient_InfoADD COLUMN Disease; INSERT INTO Patient_Info(Disease)SELECT Disease,FROM Disease_Info; CREATE TABLE Test_Price ASSELECT Test_Id, Test, PriceFROM Patient_Information;ALTER TABLE Test_PriceADD PRIMARY KEY(Test_Id);

Boyce-Codd Normal Form or 3.5NF or BCNF

For the database to be in Boyce-Codd Normal Form, it must be in 3NF. Also, in BCNF, A must be a super key for a functional dependency from A to B. Consider the part from our main table:

Patient_Id	Category	Doctor_Name
P001	Endocrinology	Dr. Rohit Shaw
P001	Endocrinology	Dr. Rohit Shaw
P002	Cardiology	Dr. Rima Dhara
P002	Cardiology	Dr. Rima Dhara
P003	Gynecology	Dr. Shalini Thakur

Here, we see that:

Primary Key: Patient_Id+Category (since it can uniquely identify all the tuples in the table). That means the prime attributes here are: Patient_Id and Category.
But a dependency Doctor_Name —->Category exists, meaning a prime attribute depends on a non-prime attribute. This type of dependency is not allowed in BCNF.

So, to convert the relation to BCNF, we divide the table into two:Patient_Doctor:

Patient_Id	Doctor_Id
P001	D110
P002	D200
P003	D201

Doctor_Info:

Doctor_Id	Doctor_Name	Category
D110	Dr. Rohit Shaw	Endocrinology
D200	Dr. Rima Dhara	Cardiology
D201	Dr. Shalini Thakur	Gynecology

SQL Code:

ALTER TABLE Patient_Information
ADD COLUMN Doctor_Id DEFAULT Dr;
UPDATE Patient_Information
SET Doctor_Id='D110',
WHERE Doctor_Name='Dr. Rohit Shaw';
UPDATE Patient_Information
SET Doctor_Id='D200',
WHERE Doctor_Name='Dr. Rima Dhara';
UPDATE Patient_Information
SET Doctor_Id='D201',
WHERE Doctor_Name='Dr.Shalini Thakur';
CREATE TABLE Patient_Doctor AS
SELECT DISTINCT Patient_Id, Doctor_Id,
FROM Patient_Information;
ALTER TABLE Patient_Doctor
ADD PRIMARY KEY(Patient_Id, Doctor_Id);
CREATE TABLE Doctor_Info AS
SELECT DISTINCT Doctor_Id,Doctor_Name,Category,
FROM Patient_Information;
ALTER TABLE Doctor_Info
ADD PRIMARY KEY(Doctor_Id);

Fourth Normal Form or 4NF

For a database to be in 4NF, it must be in BCNF, and no multivalued dependency should be present.

Consider another part of the main table:

Patient_Id	Marital_Status	Disease_Id
P001	U	HYTSH
P001	U	HYTSH
P002	M	HRTBLCK
P002	M	HYPTSN
P003	U	PCOD

Here, there is no relation between Marital_Status and Disease_Id.Both are independent of each other. Hence, a multivalued dependency exists. So, we divide the table as follows:Patient_MaritalInfo:

Patient_Id	Marital_Status
P001	U
P001	U
P002	M
P002	M
P003	U

Patient_Disease:

Patient_Id	Disease_Id
P001	HYTSH
P001	HYTSH
P002	HRTBLCK
P002	HYPTSN
P003	PCOD

SQL Code:

CREATE TABLE Patient_MaritalInfo AS SELECT DISTINCT Patient_Id, Marital_Status, FROM Patient_Information;ALTER TABLE Patient_MaritalInfoADD PRIMARY KEY (Patient_Id),CREATE TABLE Patient_Disease AS SELECT Patient_Id, Disease_Id, FROM Patient_Information;ALTER TABLE Patient_DiseaseADD PRIMARY KEY(Patient_Id,Disease_Id);

Benefits we get from Normalizing this Data

The benefits we get from normalizing the dummy database are:

We now have tables containing patient attributes, doctor’s attributes, disease attributes, and test attributes separately.
Also, tables show the relationship between those attributes’ tables using foreign keys.
Hence, our database is more organized, and the relationships between each table are well-defined. This makes the data more efficient for analyzing, updating, and retrieving. For example: If the hospital has to update the age of a particular patient, then that record needs to be updated only in the table containing the patient’s attributes. Once that is done, this record will automatically be updated for all the related tables since they are related or modeled using foreign keys (these constraints can be declared in SQL).

Hence, now we can see how a messy database can be reduced to these normal forms. Normalization helps reduce space and makes updating, deleting, and working with the data easier. Well, there are other normal forms viz:

5NF: Here, the relation must be in 4NF, and no join dependency should be present.
6NF: Here, the relation must be in 5NF, and all join dependencies must be trivial i.e., the relation becomes irreducible.

But most databases can be organized with forms up to 4NF. Based on the business purpose, it is up to you to decide which normal form suits you best.

Conclusion

That’s it! You are all done with organizing your database. Now your database is efficient enough for further analysis. You can further transfer this database to a DBMS or from the SQL server; you can interact with your data using queries to meet the business requirements.

Here are some key takeaways from the blog:

Updation anomaly, insertion anomaly, deletion anomaly, and data redundancy make a database inefficient for analysis and querying. Hence, we use normalization to organize related data efficiently and eliminate anomalies.
Five major types of normal forms are commonly used in the industry: 1NF, 2NF, 3NF, BCNF, and 4NF. Normal forms like 5NF and 6NF are not commonly used in the industry. The normal forms must be used depending on the context.
Normalization helps organize interrelated data tables by decomposing them into sub-components or small tables based on constraints. This eliminates certain dependencies.
One of the best ways to achieve a normalized database is by using SQL (Structured Query Language). In this blog, you can see how one can normalize and decompose the dummy database using queries in SQL.

I hope this blog helps in your learning journey. Meanwhile, you can check more articles related to databases here.

Have a great learning ahead!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sanjana Bhakat 31 Oct 2023

Beginner Data Engineering Data Warehouse Database Programming