# Rapid-Fire EDA process using Python for ML Implementation

**Understand the ML best practice and project roadmap**

When a customer wants to implement ML(Machine Learning) for the identified business problem(s) after multiple discussions along with the following stakeholders from both sides – Business, Architect, Infrastructure, Operations, and others. This is quite normal for any new product/application development.

But in the ML world, this is quite different. because, for new application development, we have to have a set of requirements in the form of sprint plans or traditional SDLC form and it depends on the customer for the next release plan.

But in ML implementation we need to initiate the below activity first

## Table of contents

**Identify the data source(s) and ****Data Collection**

- Organization’s key application(s) – it would be Internal or External application or web-sites
- It would be streaming data from the web (Twitter/Facebook – any Social media)

Once you’re comfortable with the available data, you can start work on the rest of the **Machine Learning process model. **

**Machine Learning process**

Let’s jump into the EDA process (Step 3) in the above picture. In the data preparation, EDA gets most of the effort and unavoidable steps. Will zoom in to this in detail now. Are You READY!!!!

## Exploratory Data Analysis(EDA)

**What is EDA?** **Exploratory Data Analysis: **this is unavoidable and one of the major step to fine-tune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages.

**Out Come of this phase as below**

- Understanding the given dataset and helps clean up the given dataset.
- It gives you a clear picture of the features and the relationships between them.
- Providing guidelines for essential variables and leaving behind/removing non-essential variables.
- Handling Missing values or human error.
- Identifying outliers.
- EDA process would be maximizing insights of a dataset.

This process is time-consuming but very effective, the below activities are involved during this phase, it would be varied and depends on the available data and acceptance from the customer.

Hope now you have some idea, let’s implement all these using the **Automobile – Predictive Analysis **dataset.

#### Import Key Packages

print("######################################") print(" Import Key Packages ") print("######################################") import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from IPython.display import display import statsmodels as sm from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV from sklearn.linear_model import LinearRegression,Ridge,Lasso from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor from sklearn.metrics import r2_score,mean_squared_error from sklearn import preprocessing

######################################

Import Key Packages

######################################

**1.Load .csv files**

**EDA (Exploratory Data Analysis)**

**2.Dataset Information**

print("############################################") print(" Info Of the Data Set") print("############################################") df_cars.info()

*Observation:*

*we could see that the features/column/fields and its data type, along with Null count**horsepower and name features are object type in the given data set*

**Let go and see the given data set file**

**3.Data Cleaning/Wrangling:**

the process of cleaning and unifying messy and complex data sets for easy access and analysis.

**Action:**

- replace(‘?’,’NaN’)
- Converting “horsepower” Object type into int

df_cars.horsepower = df_cars.horsepower.str.replace('?','NaN').astype(float) df_cars.horsepower.fillna(df_cars.horsepower.mean(),inplace=True) df_cars.horsepower = df_cars.horsepower.astype(int) print("######################################################################") print(" After Cleaning and type covertion in the Data Set") print("######################################################################") df_cars.info()

*Observation:*

*we could see that the features/column/fields and its data type, along with Null count**horsepower is now int type**name still as an Object type in the given data set, since we’re going to drop during the EDA phase*

**4.Group by names**

- Correcting the brand name (Since misspelled, we have to correct it)

df_cars['name'] = df_cars['name'].str.replace('chevroelt|chevrolet|chevy','chevrolet') df_cars['name'] = df_cars['name'].str.replace('maxda|mazda','mazda') df_cars['name'] = df_cars['name'].str.replace('mercedes|mercedes-benz|mercedes benz','mercedes') df_cars['name'] = df_cars['name'].str.replace('toyota|toyouta','toyota') df_cars['name'] = df_cars['name'].str.replace('vokswagen|volkswagen|vw','volkswagen') df_cars.groupby(['name']).sum().head()

**After corrected the names**

**5.Summary of Statistics**

display(df_cars.describe().round(2))

**6. Dealing with Missing Values**

Fill in the missing values of horsepower by mean of horsepower value.

meanhp = df_cars['horsepower'].mean() df_cars['horsepower'] = df_cars['horsepower'].fillna(meanhp)

**7.Skewness and kurtosis **

Finding the Skewness and Kurtosis of mpg feature

print("Skewness: %f" %df_cars['mpg'].skew()) print("Kurtosis: %f" %df_cars['mpg'].kurt())

Skewness: 0.457066

Kurtosis: -0.510781

**8. Categorical variable Move**

Replacing the categorical variable with actual values

df_cars['origin'] = df_cars['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'}) df_cars.head()

**9. Create Dummy Variables**

Values like ‘america’ cannot be read into an equation. So we create 3 simple true or false columns with titles equivalent to “Is this car America?”, “Is this care European?” and “Is this car Asian?”. These will be used as independent variables without imposing any kind of ordering between the three regions. Let’s apply the below code.

cData = pd.get_dummies(df_cars, columns=['origin']) cData

**10. Removing Columns **

For this analysis, we won’t be needing the car name feature, so we can drop it.

df_cars = df_cars.drop('name',axis=1)

**11.Univariate Analysis: **“Uni” +“Variate” **Univariate, **means one variable or feature analysis. The **univariate** analysis basically tells us how data in each feature is distributed. just sample as below.

sns_plot = sns.distplot(df_cars["mpg"])

**12. Bivariate Analysis: **“Bi” +“Variate” **Bi-variate, **means two variables or features are analyzed together, that how they are related to each other. Generally, we use to perform to find the relationship between the dependent and independent variable. Even you can perform this with any two variables/features in the given dataset to understand how they related to each other.

fig, ax = plt.subplots(figsize = (5, 5)) sns.countplot(x = df_cars.origin.values, data=df_cars) labels = [item.get_text() for item in ax.get_xticklabels()] labels[0] = 'America' labels[1] = 'Europe' labels[2] = 'Asia' ax.set_xticklabels(labels) ax.set_title("Cars manufactured by Countries") plt.show()

**# Exploring the range and distribution of numerical Variables**

fig, ax = plt.subplots(6, 2, figsize = (15, 13)) sns.boxplot(x= df_cars["mpg"], ax = ax[0,0]) sns.distplot(df_cars['mpg'], ax = ax[0,1]) sns.boxplot(x= df_cars["cylinders"], ax = ax[1,0]) sns.distplot(df_cars['cylinders'], ax = ax[1,1]) sns.boxplot(x= df_cars["displacement"], ax = ax[2,0]) sns.distplot(df_cars['displacement'], ax = ax[2,1]) sns.boxplot(x= df_cars["horsepower"], ax = ax[3,0]) sns.distplot(df_cars['horsepower'], ax = ax[3,1]) sns.boxplot(x= df_cars["weight"], ax = ax[4,0]) sns.distplot(df_cars['weight'], ax = ax[4,1]) sns.boxplot(x= df_cars["acceleration"], ax = ax[5,0]) sns.distplot(df_cars['acceleration'], ax = ax[5,1]) plt.tight_layout()

**Plot Numerical Variables**

plt.figure(1)

f,axarr = plt.subplots(4,2, figsize=(10,10)) mpgval = df_cars.mpg.values axarr[0,0].scatter(df_cars.cylinders.values, mpgval) axarr[0,0].set_title('Cylinders') axarr[0,1].scatter(df_cars.displacement.values, mpgval) axarr[0,1].set_title('Displacement') axarr[1,0].scatter(df_cars.horsepower.values, mpgval) axarr[1,0].set_title('Horsepower') axarr[1,1].scatter(df_cars.weight.values, mpgval) axarr[1,1].set_title('Weight') axarr[2,0].scatter(df_cars.acceleration.values, mpgval) axarr[2,0].set_title('Acceleration') axarr[2,1].scatter(df_cars["model_year"].values, mpgval) axarr[2,1].set_title('Model Year') axarr[3,0].scatter(df_cars.origin.values, mpgval) axarr[3,0].set_title('Country Mpg') # Rename x axis label as USA, Europe and Japan axarr[3,0].set_xticks([1,2,3]) axarr[3,0].set_xticklabels(["USA","Europe","Asia"]) # Remove the blank plot from the subplots axarr[3,1].axis("off") f.text(-0.01, 0.5, 'Mpg', va='center', rotation='vertical', fontsize = 12) plt.tight_layout() plt.show()

**Observation: **

So let’s find out more information from these 7 charts

- Well nobody manufactures 7 cylinders. Why…Does anyone know?
- 4 cylinder has better mileage performance than other and most manufactured ones.
- 8 cylinder engines have a low mileage count… of course, they focus more on pickup( fast cars).
- 5 cylinders, performance-wise, compete none neither 4 cylinders nor 6 cylinders.
- Displacement, weight, horsepower are inversely related to mileage.
- More horsepower means low mileage.
- Year on Year Manufacturers has focussed on increasing the mileage of the engines.
- Cars manufactured in Japan majorly focus more on mileage.

**13.Multi-Variate Analysis: **means more than two variables or features are analyzed together. that how they are related to each other.

sns.set(rc={'figure.figsize':(11.7,8.27)}) cData_attr = df_cars.iloc[:, 0:7] sns.pairplot(cData_attr, diag_kind='kde') # to plot density curve instead of the histogram on the diagram # Kernel density estimation(kde)

**Observation **

*Between ‘mpg’ and other attributes indicates the relationship is not really linear.**However, the plots also indicate that linearity would still capture quite a bit of useful information/pattern.*- Several assumptions of classical linear regression seem to be violated, including the assumption of no Heteroscedasticity

**14.Distributions of the variables/features.**

df_cars.hist(figsize=(12,8),bins=20) plt.show()

**Observation**

- The acceleration of the cars in the data is normally distributed and most of the cars have an acceleration of 15 meters per second squared.
- Half of the total number of cars (51.3%) in the data has 4 cylinders.
- Our output/dependent variable (mpg) is slightly skewed to the right.

Let’s visualize the distribution of the features of the cars

**15.Correlation – **By Heatmap the relationship between the features.

How to read? very simple

- Dark color represents a positive correlation,
- Light color/ white is a towards the negative correlation.

plt.figure(figsize=(10,6)) sns.heatmap(df_cars.corr(),cmap=plt.cm.Reds,annot=True) plt.title('Heatmap displaying the relationship betweennthe features of the data', fontsize=13) plt.show()

**Relationship between the Miles Per Gallon (mpg) and the other features.**

- We can see that there is a relationship between the mpg variable and the other variables and this satisfies the first assumption of Linear regression.
**Strong Negative**correlation between displacement, horsepower, weight, and cylinders.- This implies that, as any one of those variables increases, the mpg decreases.

**Strong Positive**correlations between the displacement, horsepower, weight, and cylinders.- This violates the non-multicollinearity assumption of Linear regression.
- Multicollinearity hinders the performance and accuracy of our regression model. To avoid this, we have to get rid of some of these variables by doing feature selection.

- The other variables.ie.acceleration, model, and origin are
**NOT**highly correlated with each other.

So, I trust that you were able to understand the EDA in full flow here, still, there are many more functions in it, if you’re doing the EDA process clearly and precisely, there is 99% of grantee that you could build your model selection, hyperparameter tuning, and deployment process effectively without further cleaning, cleansing on your data set. You have to continuously monitor the data and model output is sustainable to predict or classify or cluster.

## Frequently Asked Questions

**Q1. What does EDA mean in data?**

A. EDA stands for Exploratory Data Analysis. It is a crucial step in data analysis where analysts examine and summarize the main characteristics, patterns, and relationships within a dataset. EDA involves techniques such as data visualization, statistical analysis, and data cleaning to gain insights, detect anomalies, identify trends, and formulate hypotheses before applying further modeling or analysis techniques.

**Q2. Why do we perform exploratory data analysis?**

A. Exploratory Data Analysis (EDA) is performed to understand and gain insights from the data before conducting further analysis or modeling. It helps in identifying patterns, trends, and relationships within the dataset. EDA also helps in detecting and handling missing or erroneous data, validating assumptions, selecting appropriate modeling techniques, and making informed decisions about data preprocessing, feature engineering, and model selection.

Will get back to you all with another topic shortly, until then bye! Cheers! – Shanthababu!