Understanding how

**EDA**is done in PythonVarious steps involved in the Exploratory Data Analysis

Performing EDA on a given dataset

**Exploratory data analysis** popularly known as **EDA** is a process of performing some initial investigations on the dataset to discover the structure and the content of the given dataset. It is often known as * Data Profiling*. It is an unavoidable step in the entire journey of data analysis right from the business understanding part to the deployment of the models created.

EDA is where we get the basic understanding of the data in hand which then helps us in the further process of *Data Cleaning & Data Preparation*.

We will be covering a wide range of topics under EDA starting from the basic data exploration (structure based) to the normalization and the standardization of the data. In this article, we will be using the **Python** programming language to perform the EDA steps.

Letâ€™s see what all we are going to cover!

Introducing the Dataset

Importing the Python Libraries

Loading the Dataset in Python

Structured Based Data Exploration

Handling Duplicates

Handling Outliers

Handling Missing Values

Univariate Analysis

Bivariate Analysis

For this article, we will be using the __Black Friday dataset__ which can be downloaded from here.

Letâ€™s import all the python libraries we will be needing for our analysis namely *NumPy*, *Pandas*, *Matplotlib* and *Seaborn*.

Now let’s load our dataset into Python. We will be reading the data from a CSV (comma-separated values) file into a **Pandas DataFrame** naming it as

Letâ€™s begin with the basic exploration of the data we have!

It is the very first step in EDA which can also be referred to as * Understanding the MetaData*! That’s correct, â€˜Data about the Dataâ€™.

It is here that we get the description of the data we have in our data frame.

Letâ€™s try now.

**Display the FIRST 5 Observations**

**Display the Number of Variables & Number of Observations**

/>df.shape() gives us a tuple having 2 values.

df.dtypes

This gives us the * type of variables *in our dataset.

**Count the Number of Non-Missing Values for each Variable**

df.count()

This gives the number of non-missing values for each variable and is extremely useful while *handling missing values* in a data frame.

Now to know about the *characteristics of the data* set we will use the df.describe() method which by default gives the summary of all the *numerical* variables present in our data frame.

df.describe()

Using the *df.describe()* method we get the following characteristics of the numerical variables namely to count (number of non-missing values), mean, standard deviation, and the 5 point summary which includes minimum, first quartile, second quartile, third quartile, and maximum.

**What about the categorical variables?**

df.describe(include = 'all')

By providing the *include *argument and assigning it the value *â€˜allâ€™* we get the summary of the categorical variables too. For the categorical variables, we get the characteristics: count (number of non-missing values) , unique (number of unique values), top (the most frequent value), and the frequency of the most frequent value.

df.info()

By just this one command of *df.info()* we get the __complete information__ of the data in hand.

With this, we are done with the **Structure-Based** Exploratory Data Analysis and now it’s time to get into the **Content Based** Exploratory Data Analysis.

This involves 2 steps: * Detecting duplicates* and

**To check for the duplicates in our data **

df.duplicated()

Hereby duplicates mean the exact same **observations** repeating themselves. As we can see that there are no duplicate observations in our data and hence each observation is unique.

However,Â **to remove the duplicates(if any)Â **we can use the code :

df.drop_duplicates()

Further, we can see that there are duplicate values in some of the variables like *User_ID*. How can we remove those?

df.drop_duplicates(subset='User_ID')

This by default keeps just the first occurrence of the duplicated value in the *User_ID* variable and drops the rest of them. Hold On! Here we do not want to remove the duplicate values from the *User_ID* variable **permanently **so just to see the output and not make any permanent change in our data frame we can write the command as:

df.drop_duplicates(subset='User_ID' , inplace=False)

As we can see, the values in the *User_ID* variable are all unique now.

So this is how detection and removal of duplicated observations/values are done in a data frame.

*What are Outliers?* Outliers are the __extreme values__ on the low and the high side of the data. Handling Outliers involves 2 steps: Detecting outliers and Treatment of outliers.

**Detecting Outliers**

For this we consider any variable from our data frame and determine the *upper cut off*Â and the *lower cutoff *with the help of any of the 3 methods namely :

- Percentile Method
- IQR Method
- Standard Deviation Method

Letâ€™s consider the *Purchase* variable. Now we will be determining if there are any outliers in our data set using the **IQR(Interquartile range) Method**. What is this method about? You will get to know about it as we go along the process so let’s start. Finding the minimum(p0), maximum(p100), first quartile(q1), second quartile(q2), the third quartile(q3), and the iqr(interquartile range) of the values in the Purchase variable.

p0=df.Purchase.min() p100=df.Purchase.max() q1=df.Purchase.quantile(0.25) q2=df.Purchase.quantile(0.5) q3=df.Purchase.quantile(0.75) iqr=q3-q1

Now since we have all the values we need to find the lower cutoff(**lc**) and the upper cutoff(**uc**) of the values.

lc = q1 - 1.5*iqr uc = q3 + 1.5*iqr

lc

uc

We have the uppercut off and the lower cutoff, what now? We will be using the convention :

**If lc < p0 â†’ There are NO Outliers on the lower side**

**If uc > p100 â†’ There are NO Outliers on the higher side**

print( "p0 = " , p0 ,", p100 = " , p100 ,", lc = " , lc ,", uc = " , uc)

Clearly lc < p0 so there are no outliers on the lower side. But uc < p100 so there are outliers on the higher side. We can get a pictorial representation of the outlier by drawing the **box plot**.

df.Purchase.plot(kind='box')

Now since we have detected the outliers it is time to treat those.

Do not worry about the data loss as here we are not going to remove any value from the variable but rather **clip** them. In this process, we replace the values falling outside the range with the lower or the upper cutoff accordingly. By this, the outliers are removed from the data and we get all the data within the range.

Clipping all values greater than the upper cutoff to the upper cutoff :

df.Purchase.clip(upper=uc)

To finally treat the outliers and make the changes permanent :

df.Purchase.clip(upper=uc,inplace=True) df.Purchase.plot(kind='box')

What are Missing Values? Missing Values are the **unknown values** in the data. This involves 2 steps: Detecting the missing values and Treatment of the Missing Values

**Detecting the Missing Values**

df.isna()

*df.isna()* returns *True* for the missing values and *False *for the non-missing values.

Here we are going to find out the __percentage of missing values__ in each variable.

df.isna().sum()/df.shape[0]

And we get from the output that we do have missing values in our data frame in 2 variables: *Product_Category_2* and *Product_Category_3*, so detection is done.

To treat the missing values we can opt for a method from the following :

- Drop the variable
- Drop the observation(s)
- Missing Value Imputation

For variable *Product_Category_2*, 31.56% of the values are missing. We should not drop such a large number of observations nor should we drop the variable itself hence we will go for imputation. __Data Imputation__ is done on the __Series__. Here we replace the missing values with some value which could be static, mean, median, mode, or an output of a predictive model.

Since it is a __categorical variable__, let’s impute the values by *mode*.

df.Product_Category_2.mode()[0] df.Product_Category_2.fillna(df.Product_Category_2.mode()[0],inplace=True)

Done!

df.isna().sum()

For variable *Product_Category_3*, 69.67% of the values are missing which is a lot hence we will go for dropping this variable.

df.dropna(axis=1,inplace=True)

df.dtypes

Analysis using Charts

In this type of analysis, we use a *single variable* and plot charts on it. Here the charts are created to see the *distribution* and the *composition* of the data depending on the type of variable namely categorical or numerical.

**For Continuous Variables:Â **To see the distribution of data we create Box plots and Histograms.

__Distribution of PurchaseÂ __

**Histogram**

df.Purchase.hist() plt.show()

We created this histogram using the *hist()* method of the *Series* but there is another method too known as *plot()* by which we can create many more charts.

df.Purchase.plot(kind='hist' , grid = True) plt.show()

We have another way to create this chart by directly using **matplotlib**!

plt.hist(df.Purchase) plt.grid(True) plt.show()

**Box Plot**

df.Purchase.plot(kind='box') plt.show()

plt.boxplot(df.Purchase) plt.show()

**For Categorical Variables :Â **

- To see the distribution of data we create frequency plots like Bar charts, Horizontal Bar charts, etc.
- To see the composition of data we create Pie charts.

__Composition of Gender__

df.groupby('Gender').City_Category.count().plot(kind='pie') plt.show()

__Distribution of Marital_Status__

sns.countplot(df.Marital_Status) plt.show()

__Composition of City_Category__

df.groupby('City_Category').City_Category.count().plot(kind='pie') plt.show()

__Distribution of Age__

sns.countplot(df.Age) plt.show()

__Composition of Â Stay_In_Current_City_Years__

df.groupby('Stay_In_Current_City_Years').City_Category.count().plot(kind='pie') plt.show()

__Distribution of Occupation__

sns.countplot(df.Occupation) plt.show()

__Distribution of Product_Category_1__

df.groupby('Product_Category_1').City_Category.count().plot(kind='barh') plt.show()

In this type of analysis, we take *two variables* at a time and create charts on them. Since we have 2 types of variables Categorical and Numerical so there can be 3 cases in bivariate analysis :

**Numerical & Numerical:Â **To see the relationship between the 2 variables we create Scatter Plots and a Correlation Matrix with a Heatmap on the top.

__Scatter Plot__

Since there is only 1 numerical variable in our dataset so we cannot create the Scatter plot here. But how can we do so? Letâ€™s take a **hypothetical example** such that we consider all the numeric variables(having dtype as int or float) here as numerical variables.

Considering 2 categorical variables *Product_Category_1* and *Product_Category_2*

df.plot(x='Product_Category_1',y='Product_Category_2',kind = 'scatter') plt.show()

plt.scatter(x=df.Product_Category_1 , y=df.Product_Category_2) plt.show()

Finding a correlation between all the numeric variables.

df.select_dtypes(['float64' , 'int64']).corr()

Creating a heatmap using *Seaborn* on the top of the correlation matrix obtained above to visualize the correlation between the different numerical columns of the data. This is done when we have a large number of variables.

sns.heatmap(df.select_dtypes(['float64' , 'int64']).corr(),annot=True) plt.show()

**Numerical & Categorical**

- To see the composition of data we create bar and line charts.
- To see the comparison between the 2 variables we create bar and line charts.

**Comparison between Purchase and Occupation: Bar Chart**

df.groupby('Occupation').Purchase.sum().plot(kind='bar') plt.show()

summary=df.groupby('Occupation').Purchase.sum() plt.bar(x=summary.index , height=summary.values) plt.show()

sns.barplot(x=summary.index , y=summary.values) plt.show()

__Comparison between Purchase and Age: Line Chart__

df.groupby('Age').Purchase.sum().plot(kind='line') plt.show()

__Composition of Purchase by Gender: Pie Chart__

df.groupby('Gender').Purchase.sum().plot(kind='pie') plt.show()

__Comparison between Purchase and City_Category: Area Chart__

df.groupby('City_Category').Purchase.sum().plot(kind='area') plt.show()

__Comparison between Purchase and Stay_In_Current_City_Years: Horizontal Bar Chart__

df.groupby('Stay_In_Current_City_Years').Purchase.sum().plot(kind='barh') plt.show()

__Comparison between Purchase and Marital_Status__

sns.boxplot(x='Marital_Status',y='Purchase',data=df) plt.show()

**Categorical & Categorical:Â **To see the relationship between the 2 variables we create a crosstab and a heatmap on top.

** Relationship between Age and Gender:Â **Creating a crosstab showing the date for Age and Gender

pd.crosstab(df.Age,df.Gender)

** Heatmap**: Creating a Heat Map on the top of the crosstab.

sns.heatmap(pd.crosstab(df.Age,df.Gender)) plt.show()

__Relationship between City_Category and Stay_In_Current_City_Years__

sns.heatmap(pd.crosstab(df.City_Category,df.Stay_In_Current_City_Years)) plt.show()

Finally, we have come to the end of this article. In this article, we took a sample data set and performed exploratory data analysis on it using the Python programming language using the Pandas DataFrame. However, this was just a basic idea on how EDA is done you can definitely explore it to as much extent as you want and try performing the steps on bigger datasets as well.

Read more articles on our blog.

You can connect with me on LinkedIn.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,