Well! We all love cakes. If you take a deeper look at the baking process, you will notice how the proper amalgamation of the several ingredients and one clever leavening agent -Baking Powder can decide the rise and the fall of your cake.
`Baking the cake` might sound off-track in the technical article but I believe it to be quite relatable and a delicious analogy to understand the importance of EDA in the Data Science Pipeline.
When Baking the cake is to the Data Science Pipeline then Clever Leavening Agent(Baking Powder) is to Exploratory Data Analysis.
Before your mouth starts watering for a Cake as mine already is, Let’s Understand.
What exactly is Exploratory Data Analysis?
Exploratory Data Analysis is an approach for Data Analysis that employs a variety of techniques to-
- Gain intuition about the data.
- Conduct sanity checks. (To be sure that insights we are drawing are actually from the right dataset).
- Find out where data is missing.
- Check if there are any outliers.
- Summarize the data.
Let’s take the famous `BLACK FRIDAY SALES` case study to understand, Why do we need EDA.
The core problem is to understand customer behavior by predicting the purchase amount. But isn’t it too abstract and leaves you baffling on what to do with the data, especially when you have so many different products with various categories.
Before reading further, give a little thought to this question- Would you put all the ingredients available in the kitchen as it is in the oven to bake the cake?
Obviously, The answer is no! Before you take the entire dataset as it is in consideration to bake it in the Machine Learning Model, you would want to
- Draw out important Insights
- Variable identification (whether data contains Categorical or Numerical variables or a mix of both).
- The behavior of variables (whether variables have 0-10 or 0-1million values).
- Relationship between variables (How variables are dependent on each other).
Check Data Consistency
- To ensure all data present. (If we have collected data for three years, any week missing can be a problem in later stages.)
- Are there any missing values present?
- Are there any outliers in the dataset? (eg: a person with age 2000 years is definitely an anomaly)
- Feature Engineering
- Feature Engineering(To create new features from the existing raw features in the dataset).
** EDA in an essence can break or make any machine learning model.**
Steps In Exploratory Data Analysis
There are 5 steps in EDA :->
- Variable Identification: In this step, we identify every variable by discovering its type. According to our needs, we can change the datatype of any variable.
~Statistics play an important role in Data Analysis. It is a set of rules and concepts for analysis and interpretation of the data. There are different types of Analysis that need to be done as per requirements.~Let’s study them
- Univariate Analysis: In Univariate Analysis, we study individual characteristics of every feature/variable available in the dataset. There are two types of features – Continuous and Categorical. In the image below, I have given a cheat sheet of various graphical techniques that can be applied to analyze them.
To showcase Univariate analysis on one of the Continuous variables of the Black Friday Sale Dataset- `Purchase`, I have created a function which takes Data as input and Plot a KDE graph explaining the characteristics of the feature.
To showcase Univariate analysis on the Categorical variables of the Black Friday Sale Dataset- `City_Category` and `Marital_Status`, I have created a function that takes Data and Features as input which returns a count plot explaining the frequency of the categories in the feature.
- Bivariate Analysis: In Bivariate Analysis, we study the relationship between any two variables which can be categorical-continuous,categorical-categorical, or continuous-continuous( as shown in the cheatsheet given below along with the graphical techniques used to analyze them).
In Black Friday Sales, we have categorical independent variables and continuous target variables, So we can do categorical-continuous Analysis to understand the relationship between them.
From the above two analysis, we have observed in Univariate Analysis that a number of customers are maximum in city category B. But Bivariate Analysis when done between `City_Category` and `Purchase` shows a different story that average Purchase is maximum from city category C. Hence these inferences can give us better intuition about the data which in turn help in better Data preparation and Feature Engineering of the features.
It is important to note that just relying on Univariate and Bivariate Analysis can be quite misleading, So to verify the inferences drawn from these two can be validated with Hypothesis Testing. We can do a t-test, chi-square test, Anova which allows us to quantify whether two samples are significantly similar or different from each other. Here I have created a function to analyze continuous and categorical relationships that return t-statistic value.
In Univariate Analysis we observe that there is a significant difference between the number of customers who are married and unmarried. From t-test, we get t-statistic value 0.89 which is greater than significance level i.e 0.05 which shows that there is no significant difference between average purchase of singles and married .
- Missing Value Treatment : Primary reason for this step is to find out if there is any specific reason why these values are missing and how we treat them. Because If we don’t treat them, then they can interfere with the pattern running in the data which in turn can degrade the model’s performance. Some of the ways in which missing values can be treated are :- Filling them with mean, median, mode and you can use imputers.
- Outlier Removal : It is essential that we understand the presence of outliers as some of the predictive models are sensitive to them and we need to treat them accordingly.
In this article, I have briefly discussed the importance of EDA in the Data Science pipeline and steps that are involved in proper analysis.I have also showcased how wrong or incomplete analysis can be quite misleading and can considerably affect the performance machine learning models.
“If you don’t roast your data, you are just another person with an opinion.”;)
You can also read this article on our Mobile APP