How to Join Multiple Dataframes in Python

Gyan Prakash Tripathi Last Updated : 14 Jun, 2020

4 min read

The Challenge of Merging Multiple Dataframes in Python

Here’s a scenario that trips up almost every fresher and aspiring data scientist:

You are working on a project where data is being collected from several sources. Before you can get to the exploring and model-building part, you would need to first join these multiple datasets (in the form of tables, dataframes, etc.). How can you do this without losing any information?

This might sound like a simple scenario but it can be intimidating for a lot of newcomers, especially those who are unfamiliar with Python programming.

Drilling down further into this, I can broadly classify this into two scenarios:

First, the data with similar attributes may be distributed into multiple files. For example, suppose you are provided with multiple files each of which stores the information of sales that occurred in a particular week of the year. Thus, you will have 52 files for the whole year. Each file will have the same number and names of the columns.
Second, you may require combining information from multiple sources. For example, let’s say you want to get the contact information of people who have bought your products. Here you have two files – the first one with sales information and a second one with information about the customers.

I will show you how to work with both scenarios and join multiple dataframes in Python.

Understanding the Problem at Hand

I’ll take a popular and easy-to-understand example for the purpose of this article.

Let’s consider the example of examinations in a particular school. There are various subjects being taught with different teachers assigned to each subject. They update their own files regarding the student marks and overall performance. We’re talking about multiple files here!

For this article, we will use two such files that I have created to demonstrate the working of functions in Python. The first file contains data about class 12th students and the other one has data for class 10th. We will also use a third file that stores the names of students along with their Student ID.

Note: While these datasets are created from scratch, I encourage you to apply what you’ll learn on a dataset of your choice.

Step-by-Step Process for Merging Dataframes in Python

Here’s how we’ll approach this problem:

Load the Datasets in Python
Combine Two Similar Dataframes (Append)
Combine Information from Two Dataframes (Merge)

Step 1: Loading the Datasets in Python

We will use three separate datasets in this article. First, we need to load these files into separate dataframes.

The first two dataframes contain the percentage of students along with their Student ID. In our first dataframe, we have the marks for class 10 students while the second dataframe contains marks for the students in 12th standard. The third dataframe contains the names of students along with their respective Student ID.

We can use the ‘head’ function to check the first few rows of each dataframe:

Step 2: Combining Two Similar Dataframes (Append)

Let’s combine the files of class 10th and 12th in order to find the average marks scored by the students. Here, we will use the ‘append’ function from the Pandas library:

Output: ((50,3),(50,3),(100,3))

As you can see from the output, the append function adds the two dataframes vertically.

The resultant dataframe is allMarks. The shapes of all three dataframes are compared above.

Next, let’s have a look at the content of ‘allMarks’ and calculate the mean:

Output: 49.74

Step 3: Combining Information from Two Dataframes (Merge)

Now, let’s say we want to find the name of the student who came first among both the batches. Here, we do not need to add the dataframes vertically. We will have to scale it horizontally in order to add one more column for the name of students.

To do this, we will find the maximum marks scored:

Output: 100

The maximum marks achieved by a student are 100. Now, we will use the ‘merge’ function to find the name of this student:

Finally, the resultant dataframe has names of students mapped along with their marks.

The merge function requires a necessary attribute on which the two dataframes will be merged. We need to pass the name of this column is in the ‘on’ argument.

Another important argument of merge is ‘how’. This specifies the type of join you want to perform on the dataframes. Here are the different join types you can perform (SQL users will be very familiar with this):

Inner join (performed by default if you don’t provide any argument)
Outer join
Right join
Left join

We can also sort the dataframe using the ‘sort’ argument. These are the most commonly used arguments while merging two dataframes.

Now, we will see the rows where the dataframe contains 100 ‘Exam points’:

Three students have got 100 marks, out of which two are in class 10th. Well done!

End Notes

Pretty straightforward, right? No need to trip yourself up over this anymore! You can go ahead and apply this to any dataset of your choice. My recommendation is to pick up the food forecasting challenge that contains 3 different files.

If you are a newcomer to Python for data science, you can enroll in this free course.

Gyan Prakash Tripathi

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Beginner Python Structured Data Technique

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

How to Join Multiple Dataframes in Python

The Challenge of Merging Multiple Dataframes in Python

Understanding the Problem at Hand

Step-by-Step Process for Merging Dataframes in Python

Step 1: Loading the Datasets in Python

Step 2: Combining Two Similar Dataframes (Append)

Step 3: Combining Information from Two Dataframes (Merge)

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID