Pandas Groupby for Data Aggregation

Aniruddha Bhandari Last Updated : 18 Oct, 2024

10 min read

What if I told you that we could derive effective and impactful insights from our dataset in just a few lines of code? That’s the beauty of Python Pandas GroupBy function! I have lost count of the number of times I’ve relied on the Pandas GroupBy function to quickly summarize data and aggregate it in a way that’s easy to interpret

This helps not only when we’re working on a data science project and need quick results but also in hackathons! When time is of the essence (and when is it not?), the GroupBy function in Pandas, particularly the “pandas groupby” method, saves us a ton of effort by delivering super quick results in a matter of seconds. If you are familiar with groups in SQL, this article will be even easier for you to understand!

Loving GroupBy already? In this tutorial, I will first explain the GroupBy function using an intuitive example before picking up a real-world dataset and implementing GroupBy in Python. Let’s begin aggregating!

In this article, you will learn how to effectively use pandas groupby to organize your data and apply pandas groupby aggregate functions for insightful analysis.

Learning Objectives

Understanding the syntax and functionality of the groupby() method is important for efficient data grouping.
Familiarizing yourself with different types of aggregation functions available in pandas, including sum(), mean(), count(), max(), and min(), is necessary to perform effective data analysis.
Knowing how to apply various aggregation functions to grouped data enables data analysts to extract useful insights from large data sets.

If you’re new to the world of Python and Pandas, you’ve come to the right place. Here are two popular free courses you should check out:

What is the Pandas groupBy Function?
Understanding the Dataset & Problem Statement
First Look at Pandas GroupBy
The Split-Apply-Combine Strategy
Loop Over GroupBy Groups
Applying Functions to GroupBy Groups
Conclusion

What is the Pandas groupBy Function?

The groupby function in Pandas is a tool that helps you organize data into groups based on certain criteria, like the values in a column. This makes it easier to analyze and summarize your data.

Let me take an example to elaborate on this. Let’s say we are trying to analyze the weight of a person in a city. We can easily get a fair idea of their weight by determining the mean weight of all the city dwellers. But here‘s a question – would the weight be affected by the gender of a person?

We can group the city dwellers into different gender groups and compute their mean weight. This would give us a better insight into the weight of a person living in the city. But we can probably get an even better picture if we further separate these gender groups into different age groups and then take their mean weight (because a teenage boy’s weight could differ from that of an adult male)!

You can see how separating people into separate groups and then applying a statistical value allows us to make better analyses than just looking at the statistical value of the entire population. This is what makes GroupBy so great!

GroupBy allows us to group our data based on different features and get a more accurate idea about your data. It is a one-stop shop for deriving deep insights from your data!

Understanding the Dataset & Problem Statement

We will be working with the Big Mart Sales dataset from our DataHack platform. It contains attributes related to the products sold at various stores of BigMart. The aim is to find out the sales of each product at a particular store.

Right, let’s import the libraries and explore the data:

import pandas as pd

import numpy as np

df = pd.read_csv(‘train_v9rqX0R.csv’)

Python Code:

We have some nan values in our dataset. These are mostly in the Item_Weight and Outlet_Size. I will handle the missing values for Outlet_Size right now, but we’ll handle the missing values for Item_Weight later in the article using the GroupBy function!

First Look at Pandas GroupBy

Let’s group the dataset based on the outlet location type using GroupBy, the syntax is simple we just have to use pandas dataframe.groupby:

df.groupby('Outlet_Location_Type')

Output:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018C4E098288>

GroupBy has conveniently returned a DataFrameGroupBy object. It has split the data into separate groups. However, it won’t do anything unless it is being told explicitly to do so. So, let’s find the count of different outlet location types:

df.groupby('Outlet_Location_Type').count()

We did not tell GroupBy which column we wanted it to apply the aggregation function on, so we applied it to multiple columns (all the relevant columns) and returned the output.

But fortunately, GroupBy object supports column indexing just like a pandas Dataframe!

So let’s find out the total sales for each location type:

df.groupby('Outlet_Location_Type')['Item_Outlet_Sales']

Output:

Here, GroupBy has returned a SeriesGroupBy object. No computation will be done until we specify the agg function:

df.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].sum()

Output:

Awesome! Now, let’s understand the work behind the GroupBy function in Pandas.

The Split-Apply-Combine Strategy

You just saw how quickly you can get an insight into grouped data using the Pandas GroupBy function. But, behind the scenes, a lot is taking place, which is important to understand to gauge the true power of GroupBy.

GroupBy employs the Split-Apply-Combine strategy coined by Hadley Wickham in his paper in 2011. Using this strategy, a data analyst can break down a big problem into manageable parts, perform operations on individual parts and combine them back together to answer a specific question.

I want to show you how this strategy works in GroupBy by working with a sample dataset to get the average height for males and females in a group. Let’s create that dataset:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters:

data = {
    'Gender': ['m', 'f', 'f', 'm', 'f', 'm', 'm'],
    'Height': [172, 171, 169, 173, 170, 175, 178]
}
df_sample = pd.DataFrame(data)
df_sample

Output:

Splitting the data into separate groups:

f_filter = df_sample['Gender']=='f'

print(df_sample[f_filter])


m_filter = df_sample['Gender']=='m'

print(df_sample[m_filter])

Applying the operation that we need to perform (average in this case):

f_avg = df_sample[f_filter]['Height'].mean()

m_avg = df_sample[m_filter]['Height'].mean()

print(f_avg,m_avg)

Output:

170.0 174.5

Finally, combining the result to output a DataFrame:

df_output = pd.DataFrame({'Gender':['f','m'],'Height':[f_avg,m_avg]})

df_output

Output:

All these three steps can be achieved by using GroupBy with just a single line of code! Here’s how:

df_sample.groupby('Gender').mean()

Output:

Now that is smart! Have a look at how GroupBy did that in the image below:

You can see how GroupBy simplifies our task by doing all the work behind the scenes without us having to worry about a thing!

Now that you understand the Split-Apply-Combine strategy let’s dive deeper into the GroupBy function and unlock its full potential.

Loop Over GroupBy Groups

Remember the GroupBy object we created at the beginning of this article? Don’t worry, we’ll create it again:

 obj = df.groupby( 'Outlet_Location_Type' )

 obj

Output:

We can display the indices in each group by calling the groups on the GroupBy object:

obj.groups

Output:

We can even iterate over all of the groups:

for name,group in obj:
    print(name,'contains',group.shape[0],'rows')

Output:

But what if you want to get a specific group out of all the groups? Well, don’t worry. Pandas has a solution for that too.

Just provide the specific group name when calling get_group on the group object. Here, I want to check out the features for the ‘Tier 1’ group of locations only:

obj.get_group('Tier 1')

Output:

Now isn’t that wonderful! You have the entire Tier 1 features to work with and derive wonderful insights! But wait, didn’t I say that GroupBy is lazy and doesn’t do anything unless explicitly specified? Alright then, let’s see GroupBy in action with the aggregate functions.

Applying Functions to GroupBy Groups

The apply step is unequivocally the most important step of a Pandas GroupBy function where we can perform a variety of operations using aggregation, transformation, filtration, or even with your own function!

Let’s have a look at these in detail.

Aggregation

We have looked at some aggregation functions in the article so far, such as mean, mode, and sum. These perform statistical operations on a set of data. Have a glance at all the aggregate functions in the Pandas package:

count() – Number of non-null observations
sum() – Sum of values
mean() – Mean of values
median() – Arithmetic median of values
min() – Minimum
max() – Maximum
mode() – Mode
std() – Standard deviation
var() – Variance

But the agg() function in Pandas gives us the flexibility to perform several statistical computations all at once! Here is how it works:

df.groupby('Outlet_Location_Type').agg([np.mean,np.median])

Output:

We can even run GroupBy with multiple indexes to get better insights from our data

df.groupby(['Outlet_Location_Type', 'Outlet_Establishment_Year'], as_index=False).agg(
    {'Outlet_Size': pd.Series.mode,
     'Item_Outlet_Sales': np.mean
     }
)

Notice that I have used different aggregation functions for different column names by passing them in a dictionary with the corresponding operation to be performed. This allowed me to group and apply computations on nominal and numeric features simultaneously.

Also, I have changed the value of the as_index parameter to False. This way, the grouped index would not be output as an index.

We can even rename the aggregated columns to improve their comprehensibility, and we get a multi-index dataframe:

df.groupby(['Outlet_Type', 'Item_Type']).agg(
    mean_MRP=('Item_MRP', np.mean),
    mean_Sales=('Item_Outlet_Sales', np.mean)
)

It is amazing how a name change can improve the understandability of the output!

Transformation

Transformation allows us to perform some computation on the groups as a whole and then return the combined DataFrame. This is done using the transform() function.

We will try to compute the null values in the Item_Weight column using the transform() function.

The Item_Fat_Content and Item_Type will affect the Item_Weight, don’t you think? So, let’s group the DataFrame by these columns and handle the missing weights using the mean of these groups:

df['Item_Weight'] = df.groupby(['Item_Fat_Content', 'Item_Type'])['Item_Weight'].transform(
    lambda x: x.fillna(x.mean())
)

“Using the Transform function, a DataFrame calls a function on itself to produce a DataFrame with transformed values.”

You can read more about the transform() function in this article.

Filtration

Filtration allows us to discard certain values based on computation and return only a subset of the group. We can do this using the filter() function in Pandas.

Let’s take a look at the number of rows in our DataFrame presently:

 df.shape

Output:

(8523, 12)

If I wanted only those groups that have item weights within 3 standard deviations, I could use the filter function to do the job:

def filter_func(x):
    return x['Item_Weight'].std() < 3

df_filter = df.groupby(['Item_Weight']).filter(filter_func)
df_filter.shape

Output:

(8510, 12)

GroupBy has conveniently returned a DataFrame with only those groups that have Item_Weight less than 3 standard deviations.

Applying Our Own Functions

Pandas’ apply() function applies a function along an axis of the DataFrame. When using it with the GroupBy function, we can apply any function to the grouped result.

For example, if I wanted to center the Item_MRP values with the mean of their establishment year group, I could use the apply() function to do just that”:

df_apply = df.groupby(['Outlet_Establishment_Year'])['Item_MRP'].apply(lambda x: x - x.mean())
df_apply

Output:

Here, the values have been centered, and you can check whether the item was sold at an MRP above or below the mean MRP for that year.

Conclusion

I’m sure you can see how amazing the Pandas GroupBy function is and how useful it can be for analyzing your data. I hope this article helped you understand the function better! But practice makes perfect, so start with the super impressive datasets on our very own DataHack platform. Moving forward, you can read about how you can analyze your data using a pivot table in Pandas.

Hope you like the article and know you have clear understanding of the topics, pandas groupby aggregate, group by in pandas, groupby aggregate pandas.

Key Takeaways

Groupby() is a powerful function in pandas that allows you to group data based on a single column or more.
You can apply many operations to a groupby object, including aggregation functions like sum(), mean(), and count(), as well as lambda function and other custom functions using apply().
The resulting output of a groupby() operation can be a pandas Series or dataframe, depending on the operation and data structure.

Q1. Can we use groupby without aggregate function in pandas?

A. Yes, we can use groupby without an aggregate function in pandas. In this case, groupby will return a GroupBy object that can be used to perform further operations.

Q2. Can we use groupby without aggregate function in pandas?

A. Groupby and groupby agg are both methods in pandas that allow us to group a DataFrame by one or more columns and perform operations on the resulting groups. However, there are some important differences between the two methods. Groupby returns a GroupBy object, which can be used to perform a variety of operations on the groups, such as applying functions, resetting index, or filtering. Whereas groupby agg is a method specifically for performing aggregation operations on a grouped DataFrame. It allows us to specify one or more aggregation functions to apply to each group and returns a DataFrame containing the results.

Q3. How can we handle missing values in a grouped DataFrame?

A. You can use the dropna method to handle missing values in a grouped DataFrame. This method can be applied before or after the grouping process to ensure that the missing values do not affect the analysis.

Q4. What is the role of categorical data in groupby operations?

A. Categorical data plays a significant role in groupby operations as it allows for efficient grouping and aggregation. When using categorical data, pandas can perform the groupby operation faster and use less memory. You can convert a column to a categorical type using astype('category') before grouping.

Q5. How can we work with unique values in groupby operations?

A. To get unique values within each group, you can use the nunique method, which returns the number of unique values in each group.

Q6. How can we use dictionaries in groupby operations?

A. Dictionaries can be used in groupby operations to specify different aggregation functions for different columns. You can pass a dict to the agg method where keys are column names and values are aggregation functions.

Aniruddha Bhandari

Beginner Data Exploration Python Structured Data Technique

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Aditya Mathur

Thanks for sharing, helpful article for quick reference.

Show 1 reply

Glad you found it useful!

Ruff

Well, the sample data used should be provided in the article, That would be a great help and aid in understanding the topic.

Hi Ruff, I have used the Big Mart Sales dataset from the DataHack platform, you can donwload it from there.

2moles

thanks for sharing...awesome explaination

Kartik Shenoy

Amazing article! Thanks for sharing

Siva p

Very useful.Thanks for sharing

Mary Jane Rutkowski

This is one of the best explanations I have read. Thank you!

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Pandas Groupby for Data Aggregation

Table of contents

What is the Pandas groupBy Function?

Understanding the Dataset & Problem Statement

First Look at Pandas GroupBy

The Split-Apply-Combine Strategy

Loop Over GroupBy Groups

Applying Functions to GroupBy Groups

Aggregation

Transformation

Filtration

Applying Our Own Functions

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme