How to Normalize Relational Databases With SQL Code?
Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are going to learn how to normalize an RDBMS with the help of SQL programming.
This article attempts to explain how to remove inconsistencies in relational databases using normalization and a dummy database, for example. Before delving deeper, one must know a few related terms.
- Know why normalizing a database is important
- Understand the normal forms used in the industry
- Learn how to convert a given database into the desired Normal Form using SQL.
This article was published as a part of the Data Science Blogathon.
Table of Contents
What is a Relational Database?
Imagine a retail store wanting to store and manipulate data on their sales. Ideally, this data will contain the Item Code, Item Name, Price of each unit, Number of units sold, Category, and Date of Purchase. So, all these attributes are the names of the columns, and a particular row contains data on the purchase of a particular product on that day. Hence, here interrelated data has been collected and stored in rows and columns. This is what we call a “Relational Database.” An RDBMS (Relational Database Management System) is used to store this data.
If we go for a formal definition of a database, it can be said that it stores interrelated data from which users can efficiently retrieve information. Relational databases are structured, self-describing, and storing data according to predefined relationships.
But real-time data stored in this model can be messy most of the time, making it inconsistent and hard to work with. So, to easily manipulate and work with these data tables, we normalize the database.
Terms that are Commonly used in an RDBMS
Now, let’s delve deeper into the practical aspects of normalizing a database. But first, we must know a few terms to help us on the journey.
- Data Dictionary: Contains all information about the objects in a database.
- Metadata: Contains information about what the data stores, sources of data, etc.
- Schema: A Schema is the internal design of the database, i.e., how the data looks at different levels to users.
- Primary Key: Primary key is that attribute used to identify a tuple or row in a database uniquely. If it contains more than one attribute, it is called a composite key. It cannot contain redundant or null values.
- Candidate Key: The candidate key is the attribute or combination of attributes that can uniquely identify a tuple.
- Super key: A Super key is a set of attributes used to identify a tuple in a related database uniquely. This being said, each attribute in the Super Key may contain repeated or redundant values (which is not in the case of primary).
- Foreign Key: A foreign key is a primary key of another table that becomes an attribute or field in a different table.
- Functional Dependency: A functional dependency is a relation or a specific constraint between two attributes or two sets of attributes in a database. A —>B means that B depends on A or a functional dependency exists between B and A.
For more information on terms related to the type of database used in this article, click here.
This article mainly focuses on how to normalize a database. But first,
Why and When Should we Normalize a Relational Database?
We normalize a database because:
- It makes it easier to organize the database,
- It sequentially decomposes tables and hence, reduces data duplication,
- It reduces the storage space taken by the database,
- It removes specific anomalies and makes updating, deleting, and inserting data easy.
We normalize the data if:
- The database is messy and takes up much storage space.
- Updating or deleting a record from the database leads to inconsistency
- The database has multiple values in a single cell.
- No such attribute or column is present in the database that can help identify every tuple uniquely.
- Duplicate or repetitive values are present in the database.
How to Convert a Relational Database to Different Normal Forms?
So, let’s use a practical example to understand how normalization works. Consider a hospital that maintains its database in the form:
- Patient_Id and Patient_Name
- Disease_Id and Disease
- Test_Id and Test
- Category and Disease_Id/Disease
- Category and Doctor_Name
- Doctor_Name and Category
- Doctor_Name and Patient_Id/ Patient_Name
- Test and Patient_Name/ Patient_Id
Step 2: Once you know the relationships between the attributes, you need to normalize your data. To do this, you must convert your data to the best Normal Form (NF) that suits your business purpose. This article will look at normalizing a database using SQL (Structured Query Language). Click here to learn the basics and more about the SQL syntaxes used.
We can create the table in PostgreSQL using the following SQL code:
create table Patient_Information(
Patient_id varchar(10) NOT NULL,
Note: All SQL codes mentioned further in the blog are extended from those mentioned earlier.
- First Normal Form or 1NF
To convert the first table to 1NF, we must ensure that every column consists only of a single data, and the attribute domain must not change. So, the relationship will become:
To convert into 2NF, the relation must be in 1NF, and no partial dependency should be present.So, let’s consider the table as follows:
|P001||HYTSH||T4||Thyroxine||Dr. Rohit Shaw|
|P001||HYTSH||TSH||Thyroid Stimulating Hormone||Dr. Rohit Shaw|
|P002||HRTBLCK||ECG||Echocardiography||Dr. Rima Dhara|
|P002||HYPTSN||BP||Blood Pressure||Dr. Rima Dhara|
|P003||PCOD||USG||Ultrasonography||Dr. Shalini Thakur|
So, this part of the table stores the disease name with the disease’s id and the doctor’s id who specializes in treating that specific disease.
- The first advantage we get here is that we need not store long names of diseases in the patient’s information table.
- Secondly, by knowing the disease’s id, we can easily find which doctor is or can be allocated to the patient.
- Primary key: Patient_id+Disease_id
But we see that the doctor’s name only depends on the disease’s id, not the patient’s. Hence a partial dependency will exist. To remove this partial dependency, we divide the table into two:Patient_Info:
|P001||HYTSH||TSH||Thyroid Stimulating Hormone|
|HYTSH||Hypothyroidism||Dr. Rohit Shaw|
|HRTBLCK||Heart Blockage||Dr. Rima Dhara|
|HYPTSN||Hypertension||Dr. Rima Dhara|
|PCOD||Poly Cystic Ovarian Syndrome||Dr. Shalini Thakur|
We can do this in SQL using the code:
CREATE TABLE Patient_Info AS
CREATE TABLE Disease_Info AS
SELECT Disease_Id, Disease,
ALTER TABLE Patient_Info
ADD PRIMARY KEY (Patient_Id, Disease_Id, Test_Id);
ALTER TABLE Disease_Info
ADD PRIMARY KEY (Disease_Id);
- Third Normal Form or 3NF
The database or relation is in third normal form if it is in 2NF and no transitive dependency is present.
Consider this part of the main table:
|P003||PCOD||Poly Cystic Ovarian Syndrome||USG||1200|
Now, the table stores more information:
- Primary key: Patient_id+Disease_id.
- But the attribute Price depends on the attribute Test_Id. Both of them are non-prime attributes. Hence, a transitive dependency exists.
To remove it and convert the table to 3NF, we follow the approach:Patient_Info:
|P003||PCOD||Poly Cystic Ovarian Syndrome|
|TSH||Thyroid Stimulating Hormone||350|
We can achieve the following using the SQL code:
ALTER TABLE Patient_Info
DROP COLUMN Test_Id,
ALTER TABLE Patient_Info
ADD COLUMN Disease;
INSERT INTO Patient_Info(Disease)
CREATE TABLE Test_Price AS
ALTER TABLE Test_Price
ADD PRIMARY KEY(Test_Id);
- Boyce-Codd Normal Form or 3.5NF or BCNF
For the database to be in Boyce-Codd Normal Form, it must be in 3NF. Also, in BCNF, A must be a super key for a functional dependency from A to B. Consider the part from our main table:
|P001||Endocrinology||Dr. Rohit Shaw|
|P001||Endocrinology||Dr. Rohit Shaw|
|P002||Cardiology||Dr. Rima Dhara|
|P002||Cardiology||Dr. Rima Dhara|
|P003||Gynecology||Dr. Shalini Thakur|
Here, we see that:
- Primary Key: Patient_Id+Category (since it can uniquely identify all the tuples in the table). That means the prime attributes here are: Patient_Id and Category.
- But a dependency Doctor_Name —->Category exists, meaning a prime attribute depends on a non-prime attribute. This type of dependency is not allowed in BCNF.
So, to convert the relation to BCNF, we divide the table into two:Patient_Doctor:
|D110||Dr. Rohit Shaw||Endocrinology|
|D200||Dr. Rima Dhara||Cardiology|
|D201||Dr. Shalini Thakur||Gynecology|
ALTER TABLE Patient_Information
ADD COLUMN Doctor_Id DEFAULT Dr;
WHERE Doctor_Name='Dr. Rohit Shaw';
WHERE Doctor_Name='Dr. Rima Dhara';
WHERE Doctor_Name='Dr.Shalini Thakur';
CREATE TABLE Patient_Doctor AS
SELECT DISTINCT Patient_Id, Doctor_Id,
ALTER TABLE Patient_Doctor
ADD PRIMARY KEY(Patient_Id, Doctor_Id);
CREATE TABLE Doctor_Info AS
SELECT DISTINCT Doctor_Id,Doctor_Name,Category,
ALTER TABLE Doctor_Info
ADD PRIMARY KEY(Doctor_Id);
- Fourth Normal Form or 4NF
For a database to be in 4NF, it must be in BCNF, and no multivalued dependency should be present.
Consider another part of the main table:
Here, there is no relation between Marital_Status and Disease_Id.Both are independent of each other. Hence, a multivalued dependency exists. So, we divide the table as follows:Patient_MaritalInfo:
CREATE TABLE Patient_MaritalInfo AS
SELECT DISTINCT Patient_Id, Marital_Status,
ALTER TABLE Patient_MaritalInfo
ADD PRIMARY KEY (Patient_Id),
CREATE TABLE Patient_Disease AS
SELECT Patient_Id, Disease_Id,
ALTER TABLE Patient_Disease
ADD PRIMARY KEY(Patient_Id,Disease_Id);
Benefits we get from Normalizing this Data
The benefits we get from normalizing the dummy database are:
- We now have tables containing patient attributes, doctor’s attributes, disease attributes, and test attributes separately.
- Also, tables show the relationship between those attributes’ tables using foreign keys.
- Hence, our database is more organized, and the relationships between each table are well-defined. This makes the data more efficient for analyzing, updating, and retrieving. For example: If the hospital has to update the age of a particular patient, then that record needs to be updated only in the table containing the patient’s attributes. Once that is done, this record will automatically be updated for all the related tables since they are related or modeled using foreign keys (these constraints can be declared in SQL).
Hence, now we can see how a messy database can be reduced to these normal forms. Normalization helps reduce space and makes updating, deleting, and working with the data easier. Well, there are other normal forms viz:
- 5NF: Here, the relation must be in 4NF, and no join dependency should be present.
- 6NF: Here, the relation must be in 5NF, and all join dependencies must be trivial i.e., the relation becomes irreducible.
But most databases can be organized with forms up to 4NF. Based on the business purpose, it is up to you to decide which normal form suits you best.
That’s it! You are all done with organizing your database. Now your database is efficient enough for further analysis. You can further transfer this database to a DBMS or from the SQL server; you can interact with your data using queries to meet the business requirements.
Here are some key takeaways from the blog:
- Updation anomaly, insertion anomaly, deletion anomaly, and data redundancy make a database inefficient for analysis and querying. Hence, we use normalization to organize related data efficiently and eliminate anomalies.
- Five major types of normal forms are commonly used in the industry: 1NF, 2NF, 3NF, BCNF, and 4NF. Normal forms like 5NF and 6NF are not commonly used in the industry. The normal forms must be used depending on the context.
- Normalization helps organize interrelated data tables by decomposing them into sub-components or small tables based on constraints. This eliminates certain dependencies.
- One of the best ways to achieve a normalized database is by using SQL (Structured Query Language). In this blog, you can see how one can normalize and decompose the dummy database using queries in SQL.
I hope this blog helps in your learning journey. Meanwhile, you can check more articles related to databases here.
Have a great learning ahead!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.