“I have two different tables in join Python but I’m not sure how to join them. What criteria should I consider? What are the different ways I can join these tables using merge pandas dataframes?”
Sound familiar? I have come across this question plenty of times on online discussion forums. Working with one table is fairly straightforward, but things become challenging when we have data spread across two or more tables. We can use different programming languages to work with tables, but today we will be focussing on python.
Python programming already has functions for basic tasks like python string joins and concatenating lists, but it becomes much easier if we use packages to perform tasks. One such package is the pandas library. It is a package that one must have heard of if you are learning python programming or working on machine learning projects.
This is where the concept of Joins comes in. I cannot emphasize the number of times I have used these Joins in Pandas! They’ve come in especially handy during data science hackathons when I needed to join multiple tables quickly. This article will mainly introduce you to the different types of joins in pandas and how to use them. It will also tell you how to deal with redundancy or duplicate values in the resulting dataframes.
Learning Objectives
Note: If you’re new to the world of Pandas and Python, I recommend taking the below free courses:
Looking to learn SQL joins? We have you covered! Head over here to learn all about SQL joins.
The join()
method is like a friendly string conductor that brings together individual strings into a harmonious whole. Imagine you have a group of strings (like a band), and you want to join them using a specific separator (like a musical note). That’s where join()
steps in!
join()
Do?join()
method takes all the items in an iterable (like a list or tuple) and joins them into a single string.Pythonresult_string = separator.join(iterable)
AI-generated code. Review and use carefully.
separator
: A string that acts as the glue between the items.iterable
: Any collection of strings (e.g., a list, tuple, or even a string itself).I’m sure you’re quite familiar with e-commerce sites like Amazon and Flipkart these days. We are bombarded by their advertisements when visiting non-related websites – that’s the power of targeted marketing!
We’ll take a simple problem from a related marketing brand here. We are given two tables – one containing product data and the other with customer-level information.
We will use these tables to understand hands-on how the different types of joins work using Merge Pandas.
Inner join is the most common type of join you’ll be working with. It returns a dataframe with only those rows that have common characteristics.
An inner join requires each row in the two joined dataframes to have matching column values. This is similar to the intersection of two sets.
Let’s start by importing the Pandas library:
import pandas as pd
For this tutorial, we have two dataframes – product and customer. The product dataframe contains product details like Product_ID, Product_name, Category, Price, and Seller_City. The customer dataframe contains details like id, name, age, Product_ID, Purchased_Product, and City.
Our task is to use our joining skills and generate meaningful information from the data. I encourage you to follow along with the code we’ll cover in this tutorial.
import pandas as pd
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})
print(product)
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
In the above code snippet, we have passed a dict in the pandas DataFrame constructor. In the dictionary are key-value pairs, where keys are python strings, and the values are iterable objects; these iterable objects could be lists, tuples, json, etc.
Let’s say we want to know about all the products sold online and who purchased them. We can get this easily using an inner join.
In Pandas is our friend here. By default, the merge pandas function performs an inner join. The merge function only takes series or dataframe objects as input for merging; if we pass a different data type, it will result in a typeerror. The two dataframe / series objects are passed in the merge function with a comma as the separator. For example:
pd.merge(product,customer,on='Product_ID')
The return value type of the merge function is dataframe.
There is another function in pandas called the concat function. From the name, it suggests that we can concatenate strings using this function, but it actually just concatenates two dataframes.
Here, I have performed inner join on the product and customer dataframes on the ‘Product_ID’ column.
But what if the column name strings are different in the two dataframes? Then, we have to explicitly mention both column names str.
‘left_on’ and ‘right_on’ are two parameters through which we can achieve this. ‘left_on’ is the name of the key in the left dataframe and ‘right_on’ in the right dataframe, so the
pd.merge(product,customer,left_on='Product_name',right_on='Purchased_Product')
Also, we cannot pass empty strings in left_on or right_on arguments
Let’s try the above code in the live coding window below!!
import pandas as pd
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})
print("Product Dataframe: \n\n")
print(product.head())
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
print("\n\n Customer Dataframe: \n\n")
print(customer.head())
print("\n\n Merged Dataframe: \n\n")
print(pd.merge(product,customer,on='Product_ID'))
Let’s take things up a notch. The leadership team now wants more details about the products sold. They want to know about all the products the seller sells to the same city, i.e., both the seller and the customer belong to the same city.
In this case, we have to perform an inner join on both Product_ID and Seller_City of the product and Product_ID and City columns of the customer dataframe.
So, how can we do this?
Don’t scratch your head! In the above example, We only passed single strings in the left_on or right_on arguments. We can pass an iterable object of column names to the left_on and right_on arguments, which can be a list or a tuple, as they are iterable in left_on or right_on:
pd.merge(product,customer,how='inner',left_on=['Product_ID','Seller_City'],right_on=['Product_ID','City'])
Here the list items are column names, which are the names of the columns on which we want to join two dataframes.
Here’s another interesting task for you. We have to combine both dataframes in order to find all the products that are not sold and all the customers who didn’t purchase anything from us.
We can use Full Join for this purpose.
Full Join, also known as Full Outer Join, returns all those records which either have a match in the left or right dataframe.
When rows in both the dataframes do not match, the resulting dataframe will have NaN for every column of the dataframe that lacks a matching row.
We can perform Full join by passing the how parameter as ‘outer’ to the merge() function, which takes input python strings as arguments:
pd.merge(product,customer,on='Product_ID',how='outer')
We cannot pass non-string values in the how parameter.
Did you notice what happened here? All the non-matching rows of both the dataframes have NaN values for the columns of other dataframes. But wait – we still don’t know which row belongs to which dataframe.
For this, Pandas provides us with a fantastic solution. By mentioning the indicator argument as True in the function, a new column of name _merge will be created in the resulting dataframe:
pd.merge(product,customer,on='Product_ID',how='outer',indicator=True)
As you can see, the _merge column mentions which row belongs to which dataframe.
Now, let’s say the leadership team wants information about only those customers who bought something from us. You guessed it – we can use the concept of Left Join here.
Left join, also known as Left Outer Join, returns a dataframe containing all the rows of the left dataframe.
All the non-matching rows of the left dataframe contain NaN for the columns in the right dataframe. It is simply an inner join plus all the non-matching rows of the left dataframe filled with NaN for columns of the right dataframe.
Performing a left join is actually quite similar to a full join. Just change the how argument to a different str that is ‘left’:
pd.merge(product,customer,on='Product_ID',how='left')
Here, you can clearly see that all the unsold products contain NaN for the columns belonging to the customer dataframe.
Similarly, if we want to create a table of customers including the information about the products they bought, we can use the right join.
Right join, also known as Right Outer Join, is similar to the Left Outer Join. The only difference is that all the rows of the right dataframe are taken as it is and only those of the left dataframe that are common in both.
Similar to other joins, we can perform a right join by changing the how argument to ‘right’:
pd.merge(product,customer,on='Product_ID',how='right')
Take a look carefully at the above dataframe – we have NaN values for columns of the product dataframe. Pretty straightforward, right?
Duplicate values can be tricky obstacles. They can cause problems while performing joins. These values won’t give an error but will create redundancy in our resulting dataframe. I’m sure you can imagine how harmful that can be!
Here, we have a new dataframe product_dup with duplicate details about products:
product_dup=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,103,107],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','Shoes','Laptop',
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Fashion','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,2999.0,79999.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','Chennai','Bengalore']
})
Let’s see what happens if we perform an inner join on this dataframe:
pd.merge(product_dup,customer,how='inner',on='Product_ID')
As you can see, we have duplicate rows in the resulting dataset as well. To solve this, there is a validate argument in the merge() function, which we can set to ‘one_to_one,’ ‘one_to_many,’ ‘many_to_one,’ and‘many_to_many.’
This ensures that there exists only a particular mapping across both the dataframes. If the mapping condition is not satisfied, then it throws a MergeError. To solve this, we can delete duplicates before applying join:
pd.merge(product_dup.drop_duplicates(),customer,how='inner',on='Product_ID')
But, if you want to keep these duplicates, then you can give validate values as per your requirements, and it will not throw an error:
pd.merge(product_dup,customer,how='inner',on='Product_ID',validate='many_to_many')
By now, you must have a strong expertise of the one-of-a-kind types of joins to be had in Pandas and how to utilize the merge() feature to mix dataframes successfully. Mastering those joins will significantly beautify your potential to control and analyze records, making your data science and gadget learning tasks a great deal more efficient and insightful.
Remember, at the same time as joins are powerful, they could introduce complexity, particularly while managing large datasets or redundant facts. There are also other functions in Pandas, such as concat() and append() that can be used to merge dataframes. I encourage you to explore and apply them in your next project alongside what you’ve learned about joins in this tutorial.
Key Takeaways
If you have any queries or feedback on this article, feel free to share it in the comments section below. I have listed some insightful and comprehensive articles and courses related to data science and Python below.
Courses:
Python Tutorials:
We can perform the join operation on two dataframes in pandas using multiple functions like the pandas join function. We can also use the pandas merge function to do a merge operation on the dataframes. So there are multiple ways to join two dataframes in pandas.
Mainly, there are 4 types of joins for a dataframe. They are: Left join, Right join, Inner join, and Cross join.
The most frequently used joins for dataframes and tables are left join and inner join. In inner join, all the records with matching keys are present, whereas, in left join, all the records from the left table are present.
The inner join only finds and returns all matching rows; it does not return any unmatched rows. But in the case of the outer join, it returns all matching rows and some unmatched ones as well. The outer join is of two types: the left outer join and the right outer join.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Thanks! This was a helpful and clear explanation.
I'm glad that you liked my work.
Helpful! Is there a way I could do something like: A left outer join B, where B.name contains A.name (but does not equal it)?
Very nicely explained!
It is very nice explained!! thank you so much.
Thanks Abhishek! And how to have this sql script in pandas: select * from t1 left join t2 on t1_id = t2_id where id2 is null this selects rows ids that are available 'only' on left table
very nice expaination. cleared my queries on joins in pandas
Loved this article. Simple and easy to understand 🤗
Very clear and concise explanation. Removed all my doubts. Thank you very much, Abishek Sharma! :-)