Introduction to Pandas DataFrame.query() function
Overview
- Sometimes during EDA(Exploratory Data Analysis) selecting or filtering data frames can be a tedious task
- Pandas DataFrame query() enables users to analyze and filter the data just like where clause in SQL
Introduction
Python is one of the dominantly used programming languages by data scientists. The reason for the proliferation of this language is that it is easy to use and enjoys strong community support, with dedicated libraries for data analysis and predictive modeling. Pandas is one such library that is widely accepted by the data science community for performing data cleaning and analysis.
Sometimes during EDA(Exploratory Data Analysis) selecting or filtering data frames can be a tedious task. Hence pandas provide some functions that make this task a cakewalk even for beginners. The query is a highly useful function. It enables users to analyze and filter the data just like where clause in SQL. Also, it helps in compiling several conditions simultaneously in a memory-efficient manner as it does not uses temporary variables.
In this article, we are going to see a very interesting pandas data frame function Query.
Table of Contents
- Syntax
- Filter Employees of IT Department
- Employees in IT Department and Not Living in India
- Implementation of OR condition
- Using Environment Variable
- Why use Query?
Syntax
Let’s have a look at the syntax before jump into the query.
pandas.DataFrame.query(expr,inplace=false,**kwargs) expr:The condition to filter the data. inplace: weather the query modify the original data or return a modified copy. **kwargs: Any other keyword argument you want to pass
Now we will look at some examples of how this amazing function eases your Analysis.
Let’s first create a Dataframe for the demo
import pandas as pd import numpy as np df= pd.DataFrame({'Name':['Vikram','John', 'Alex','Paul','Andrew','Rafel' ], 'Age':[28,39,21,50,35,43], 'Department':['HR', 'Finance','IT','HR','IT','IT'], 'Country':['USA','India','Germany','USA','India','India'] }) print(df.head())
Output
Now, let’s Filter the dataset using the query function provided by pandas for its DataFrame objects.
Filter employees of IT department
In this example want to select only the employees who work for the IT department.
import pandas as pd import numpy as np df= pd.DataFrame({'Name':['Vikram','John', 'Alex','Paul','Andrew','Rafel' ], 'Age':[28,39,21,50,35,43], 'Department':['HR', 'Finance','IT','HR','IT','IT'], 'Country':['USA','India','Germany','USA','India','India'] }) print(df.query('Department == "IT"'))
Here is the required result. We have all the IT department employees present in the dataset.
Employees in-department IT and not living in India
Let’s take another example of satisfying two conditions simultaneously using ‘and’.
import pandas as pd import numpy as np df= pd.DataFrame({'Name':['Vikram','John', 'Alex','Paul','Andrew','Rafel' ], 'Age':[28,39,21,50,35,43], 'Department':['HR', 'Finance','IT','HR','IT','IT'], 'Country':['USA','India','Germany','USA','India','India'] }) print(df.query('Department == "IT" and Country != "India" '))
In the dataset, there is only one employee who satisfies both requirements.
Let’s see the implementation of the OR condition
Here we present another example showing how can you implement the ‘OR’ clause. that means selecting the records satisfying either the first condition or the other.
import pandas as pd import numpy as np df= pd.DataFrame({'Name':['Vikram','John', 'Alex','Paul','Andrew','Rafel' ], 'Age':[28,39,21,50,35,43], 'Department':['HR', 'Finance','IT','HR','IT','IT'], 'Country':['USA','India','Germany','USA','India','India'] }) print(df.query('Age>= 40 or Country =="India"'))
Output
Using Environment variable
Sometimes we want to use the environment variables in the expression, we can do this using @ before the variable name. Just like SQL where you can use a subquery to reference for selecting rows where conditions are satisfied. The ‘@’ method saves the value that you want to compare against. as shown in the example below.
In the example Location is a local variable. To use it in a query expression you just need to put @ before it and it’s done.
import pandas as pd import numpy as np df= pd.DataFrame({'Name':['Vikram','John', 'Alex','Paul','Andrew','Rafel' ], 'Age':[28,39,21,50,35,43], 'Department':['HR', 'Finance','IT','HR','IT','IT'], 'Country':['USA','India','Germany','USA','India','India'] } Location= " India" print(df.query( Country == @Location'))
Output
Note: Always remember if there is any space in your column name, but that between a pair of backtick (“) like strings while using in expression or replace the space with “_”
Why use Query?
For a data scientist or software programmer time and space, complexities are important aspects while considering the ways to perform a task. The query method is highly efficient for larger datasets. As it saves the memory also it offers cleaner syntax to the user.
End Notes
To summarize, here we have seen pandas.query a highly useful function to slice and dice the dataframe. If you are an absolute beginner in python, I will recommend you to enroll in our Python course(Free) to enhance your fundamentals in python.
In the end, I have a task for you. Just pick a dataset and try different methods of filtering the data and compare them for time and memory efficiency. It will be great if you want to share your findings in the comments.
Also, comment below if you have any doubt.