The Clever Ingredient that decides the rise and the fall of your Machine Learning Model- Exploratory Data Analysis

Megha Setia 24 Nov, 2020 • 5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Well! We all love cakes. If you take a deeper look at the baking process, you will notice how the proper amalgamation of the several ingredients and one clever leavening agent -Baking Powder can decide the rise and the fall of your cake.

`Baking the cake` might sound off-track in the technical article but I believe it to be quite relatable and a delicious analogy to understand the importance of EDA in the Data Science Pipeline.

When Baking the cake is to the Data Science Pipeline then Clever Leavening Agent(Baking Powder) is to Exploratory Data Analysis.

Before your mouth starts watering for a Cake as mine already is, Let’s Understand.

What exactly is Exploratory Data Analysis?

Exploratory Data Analysis is an approach for Data Analysis that employs a variety of techniques to-

Gain intuition about the data.
Conduct sanity checks. (To be sure that insights we are drawing are actually from the right dataset).
Find out where data is missing.
Check if there are any outliers.
Summarize the data.

Let’s take the famous `BLACK FRIDAY SALES` case study to understand, Why do we need EDA.

Exploratory Data Analysis -Black Friday Sales Data

The core problem is to understand customer behavior by predicting the purchase amount. But isn’t it too abstract and leaves you baffling on what to do with the data, especially when you have so many different products with various categories.

Before reading further, give a little thought to this question- Would you put all the ingredients available in the kitchen as it is in the oven to bake the cake?

Obviously, The answer is no! Before you take the entire dataset as it is in consideration to bake it in the Machine Learning Model, you would want to

Draw out important Insights
1. Variable identification (whether data contains Categorical or Numerical variables or a mix of both).
2. The behavior of variables (whether variables have 0-10 or 0-1million values).
3. Relationship between variables (How variables are dependent on each other).
Check Data Consistency
1. To ensure all data present. (If we have collected data for three years, any week missing can be a problem in later stages.)
2. Are there any missing values present?
3. Are there any outliers in the dataset? (eg: a person with age 2000 years is definitely an anomaly)
Feature Engineering
1. Feature Engineering(To create new features from the existing raw features in the dataset).

** EDA in an essence can break or make any machine learning model.**

Steps In Exploratory Data Analysis

Exploratory Data Analysis process

There are 5 steps in EDA :->

Variable Identification: In this step, we identify every variable by discovering its type. According to our needs, we can change the datatype of any variable.
~Statistics play an important role in Data Analysis. It is a set of rules and concepts for analysis and interpretation of the data. There are different types of Analysis that need to be done as per requirements.~Let’s study them
Univariate Analysis: In Univariate Analysis, we study individual characteristics of every feature/variable available in the dataset. There are two types of features – Continuous and Categorical. In the image below, I have given a cheat sheet of various graphical techniques that can be applied to analyze them.

Continuous Variable:

To showcase Univariate analysis on one of the Continuous variables of the Black Friday Sale Dataset- `Purchase`, I have created a function which takes Data as input and Plot a KDE graph explaining the characteristics of the feature.

Categorical Variable

To showcase Univariate analysis on the Categorical variables of the Black Friday Sale Dataset- `City_Category` and `Marital_Status`, I have created a function that takes Data and Features as input which returns a count plot explaining the frequency of the categories in the feature.
Bivariate Analysis: In Bivariate Analysis, we study the relationship between any two variables which can be categorical-continuous,categorical-categorical, or continuous-continuous( as shown in the cheatsheet given below along with the graphical techniques used to analyze them).

In Black Friday Sales, we have categorical independent variables and continuous target variables, So we can do categorical-continuous Analysis to understand the relationship between them.

Inference:
From the above two analysis, we have observed in Univariate Analysis that a number of customers are maximum in city category B. But Bivariate Analysis when done between `City_Category` and `Purchase` shows a different story that average Purchase is maximum from city category C. Hence these inferences can give us better intuition about the data which in turn help in better Data preparation and Feature Engineering of the features.

It is important to note that just relying on Univariate and Bivariate Analysis can be quite misleading, So to verify the inferences drawn from these two can be validated with Hypothesis Testing. We can do a t-test, chi-square test, Anova which allows us to quantify whether two samples are significantly similar or different from each other. Here I have created a function to analyze continuous and categorical relationships that return t-statistic value.

In Univariate Analysis we observe that there is a significant difference between the number of customers who are married and unmarried. From t-test, we get t-statistic value 0.89 which is greater than significance level i.e 0.05 which shows that there is no significant difference between average purchase of singles and married .
Missing Value Treatment : Primary reason for this step is to find out if there is any specific reason why these values are missing and how we treat them. Because If we don’t treat them, then they can interfere with the pattern running in the data which in turn can degrade the model’s performance. Some of the ways in which missing values can be treated are :- Filling them with mean, median, mode and you can use imputers.
Outlier Removal : It is essential that we understand the presence of outliers as some of the predictive models are sensitive to them and we need to treat them accordingly.

End Notes

In this article, I have briefly discussed the importance of EDA in the Data Science pipeline and steps that are involved in proper analysis.I have also showcased how wrong or incomplete analysis can be quite misleading and can considerably affect the performance machine learning models.

“If you don’t roast your data, you are just another person with an opinion.”;)

Megha Setia 24 Nov 2020

Beginner Data Exploration Data Visualization Python Technique

Responses From Readers

Sanya Narang 11 Oct, 2020

This piece of knowledge is amazing.

M RACHIT 12 Oct, 2020

Thank you for this content, I was googling this information from many days.

Aashish 12 Oct, 2020

Really informative and crisp explanation

Arsalan 12 Oct, 2020

Amazing Work, Good and Detailed Explanation

Tanushree vinayak 12 Oct, 2020

Amazing doc 🙂 well detailed touches the necessary points succinctly.. Thanks for the share 👍

Abdullah 11 Apr, 2021

Wish you explained the univariate and bivariate analysis' codes in detail. Great article nevertheless!

The Clever Ingredient that decides the rise and the fall of your Machine Learning Model- Exploratory Data Analysis

Introduction

What exactly is Exploratory Data Analysis?

Check Data Consistency

Steps In Exploratory Data Analysis

Continuous Variable:

Categorical Variable

End Notes

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

The Clever Ingredient that decides the rise and the fall of your Machine Learning Model- Exploratory Data Analysis

Introduction

What exactly is Exploratory Data Analysis?

Check Data Consistency

Steps In Exploratory Data Analysis

Continuous Variable:

Categorical Variable

End Notes

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices