Shanthababu Pandian — April 8, 2021
Beginner Data Exploration Data Visualization Python Structured Data

Understand the ML best practice and project roadmap

When a customer wants to implement ML(Machine Learning) for the identified business problem(s) after multiple discussions along with the following stakeholders from both sides – Business, Architect, Infrastructure, Operations, and others. This is quite normal for any new product/application development. 

But in the ML world, this is quite different. because, for new application development, we have to have a set of requirements in the form of sprint plans or traditional SDLC form and it depends on the customer for the next release plan.

Understand the ML best practice EDA

But in ML implementation we need to initiate the below activity first

Identify the data source(s) and Data Collection

    • Organization’s key application(s) – it would be Internal or External application or web-sites
    • It would be streaming data from the web (Twitter/Facebook – any Social media)
Identify the data source(s) and Data Collection EDA

Once you’re comfortable with the available data, you can start work on the rest of the Machine Learning process model. 

Machine Learning process

Machine Learning process EDA

Let’s jump into the EDA process (Step 3) in the above picture. In the data preparation, EDA gets most of the effort and unavoidable steps. Will zoom in to this in detail now. Are You READY!!!!

EDA process

Exploratory Data Analysis(EDA)

What is EDA? Exploratory Data Analysis: this is unavoidable and one of the major step to fine-tune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages. 

Out Come of this phase as below

  • Understanding the given dataset and helps clean up the given dataset.
  • It gives you a clear picture of the features and the relationships between them.
  • Providing guidelines for essential variables and leaving behind/removing non-essential variables.
  • Handling Missing values or human error.
  • Identifying outliers.
  • EDA process would be maximizing insights of a dataset.

This process is time-consuming but very effective, the below activities are involved during this phase, it would be varied and depends on the available data and acceptance from the customer.

EDA constituents

Hope now you have some idea, let’s implement all these using the Automobile – Predictive Analysis dataset.

Import Key Packages

print("######################################")
print("       Import Key Packages            ")
print("######################################")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import statsmodels as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error
from sklearn import preprocessing

######################################
Import Key Packages
######################################

1.Load .csv files

df_cars = pd.read_csv("auto-mpg.csv")

Let’s see Data through Data Frame   

df_cars.head(5)
head EDA
EDA (Exploratory Data Analysis)

2.Dataset Information

print("############################################")
print("          Info Of the Data Set")
print("############################################")
df_cars.info()
Dataset information

Observation:

  • we could see that the features/column/fields and its data type, along with Null count
  • horsepower and name features are object type in the given data set

Let go and see the given data set file

3.Data Cleaning/Wrangling:
the process of cleaning and unifying messy and complex data sets for easy access and analysis.

Action:

  • replace(‘?’,’NaN’)
  • Converting “horsepower” Object type into int
df_cars.horsepower = df_cars.horsepower.str.replace('?','NaN').astype(float)
df_cars.horsepower.fillna(df_cars.horsepower.mean(),inplace=True)
df_cars.horsepower = df_cars.horsepower.astype(int)
print("######################################################################")
print("          After Cleaning and type covertion in the Data Set")
print("######################################################################")
df_cars.info()
cleaning and type coversion

Observation:

 

  • we could see that the features/column/fields and its data type, along with Null count
  • horsepower is now int type 
  • name still as an Object type in the given data set, since we’re going to drop during the EDA phase

 

4.Group by names

  • Correcting the brand name (Since misspelled, we have to correct it)
df_cars['name'] = df_cars['name'].str.replace('chevroelt|chevrolet|chevy','chevrolet')
df_cars['name'] = df_cars['name'].str.replace('maxda|mazda','mazda')
df_cars['name'] = df_cars['name'].str.replace('mercedes|mercedes-benz|mercedes benz','mercedes')
df_cars['name'] = df_cars['name'].str.replace('toyota|toyouta','toyota')
df_cars['name'] = df_cars['name'].str.replace('vokswagen|volkswagen|vw','volkswagen')
df_cars.groupby(['name']).sum().head()

After corrected the names

EDA After corrected the names
5.Summary of Statistics
display(df_cars.describe().round(2))
Summary of Statistics

6. Dealing with Missing Values

Fill in the missing values of horsepower by mean of horsepower value.

meanhp = df_cars['horsepower'].mean()
df_cars['horsepower'] = df_cars['horsepower'].fillna(meanhp)

7.Skewness and kurtosis 

Finding the Skewness and Kurtosis of mpg feature 

print("Skewness: %f" %df_cars['mpg'].skew())
print("Kurtosis: %f" %df_cars['mpg'].kurt())

Skewness: 0.457066
Kurtosis: -0.510781

8. Categorical variable Move

Replacing the categorical variable with actual values

df_cars['origin'] = df_cars['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
df_cars.head()
Categorical variable Move

 

9. Create Dummy Variables

Values like ‘america’ cannot be read into an equation. So we create 3 simple true or false columns with titles equivalent to “Is this car America?”, “Is this care European?” and “Is this car Asian?”. These will be used as independent variables without imposing any kind of ordering between the three regions. Let’s apply the below code.

cData = pd.get_dummies(df_cars, columns=['origin'])
cData
Create Dummy Variables

10. Removing Columns 

For this analysis, we won’t be needing the car name feature, so we can drop it.

df_cars = df_cars.drop('name',axis=1)
Removing Columns 

11.Univariate Analysis: “Uni” +“Variate”  Univariate, means one variable or feature analysis. The univariate analysis basically tells us how data in each feature is distributed. just sample as below.

sns_plot = sns.distplot(df_cars["mpg"])
Univariate Analysis

12. Bivariate Analysis: “Bi” +“Variate” Bi-variate, means two variables or features are analyzed together, that how they are related to each other. Generally, we use to perform to find the relationship between the dependent and independent variable. Even you can perform this with any two variables/features in the given dataset to understand how they related to each other.

fig, ax = plt.subplots(figsize = (5, 5))
sns.countplot(x = df_cars.origin.values, data=df_cars)
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[0] = 'America'
labels[1] = 'Europe'
labels[2] = 'Asia'
ax.set_xticklabels(labels)
ax.set_title("Cars manufactured by Countries")
plt.show()
Bivariate Analysis

# Exploring the range and distribution of numerical Variables

fig, ax = plt.subplots(6, 2, figsize = (15, 13))
sns.boxplot(x= df_cars["mpg"], ax = ax[0,0])
sns.distplot(df_cars['mpg'], ax = ax[0,1])
sns.boxplot(x= df_cars["cylinders"], ax = ax[1,0])
sns.distplot(df_cars['cylinders'], ax = ax[1,1])
sns.boxplot(x= df_cars["displacement"], ax = ax[2,0])
sns.distplot(df_cars['displacement'], ax = ax[2,1])
sns.boxplot(x= df_cars["horsepower"], ax = ax[3,0])
sns.distplot(df_cars['horsepower'], ax = ax[3,1])
sns.boxplot(x= df_cars["weight"], ax = ax[4,0])
sns.distplot(df_cars['weight'], ax = ax[4,1])
sns.boxplot(x= df_cars["acceleration"], ax = ax[5,0])
sns.distplot(df_cars['acceleration'], ax = ax[5,1])
plt.tight_layout()
Exploring the range and distribution of numerical Variables
Plot Numerical Variables
plt.figure(1)
f,axarr = plt.subplots(4,2, figsize=(10,10))
mpgval = df_cars.mpg.values
axarr[0,0].scatter(df_cars.cylinders.values, mpgval)
axarr[0,0].set_title('Cylinders')
axarr[0,1].scatter(df_cars.displacement.values, mpgval)
axarr[0,1].set_title('Displacement')
axarr[1,0].scatter(df_cars.horsepower.values, mpgval)
axarr[1,0].set_title('Horsepower')
axarr[1,1].scatter(df_cars.weight.values, mpgval)
axarr[1,1].set_title('Weight')
axarr[2,0].scatter(df_cars.acceleration.values, mpgval)
axarr[2,0].set_title('Acceleration')
axarr[2,1].scatter(df_cars["model_year"].values, mpgval)
axarr[2,1].set_title('Model Year')
axarr[3,0].scatter(df_cars.origin.values, mpgval)
axarr[3,0].set_title('Country Mpg')
# Rename x axis label as USA, Europe and Japan
axarr[3,0].set_xticks([1,2,3])
axarr[3,0].set_xticklabels(["USA","Europe","Asia"])
# Remove the blank plot from the subplots
axarr[3,1].axis("off")
f.text(-0.01, 0.5, 'Mpg', va='center', rotation='vertical', fontsize = 12)
plt.tight_layout()
plt.show()
Plot Numerical Variables

Observation: 

So let’s find out more information from these 7 charts

  • Well nobody manufactures 7 cylinders. Why…Does anyone know?
  • 4 cylinder has better mileage performance than other and most manufactured ones.
  • 8 cylinder engines have a low mileage count… of course, they focus more on pickup( fast cars).
  • 5 cylinders, performance-wise, compete none neither 4 cylinders nor 6 cylinders.
  • Displacement, weight, horsepower are inversely related to mileage.
  • More horsepower means low mileage.
  • Year on Year Manufacturers has focussed on increasing the mileage of the engines.
  • Cars manufactured in Japan majorly focus more on mileage.

13.Multi-Variate Analysis: means more than two variables or features are analyzed together. that how they are related to each other.

sns.set(rc={'figure.figsize':(11.7,8.27)})
cData_attr = df_cars.iloc[:, 0:7]
sns.pairplot(cData_attr, diag_kind='kde')   
# to plot density curve instead of the histogram on the diagram # Kernel density estimation(kde)
Multi-Variate Analysis

Observation 

  • Between ‘mpg’ and other attributes indicates the relationship is not really linear. 
  • However, the plots also indicate that linearity would still capture quite a bit of useful information/pattern. 
  • Several assumptions of classical linear regression seem to be violated, including the assumption of no Heteroscedasticity

14.Distributions of the variables/features.

df_cars.hist(figsize=(12,8),bins=20)
plt.show()
Distributions of the variables/features
Observation
  • The acceleration of the cars in the data is normally distributed and most of the cars have an acceleration of 15 meters per second squared.
  • Half of the total number of cars (51.3%) in the data has 4 cylinders.
  • Our output/dependent variable (mpg) is slightly skewed to the right.

Let’s visualize the distribution of the features of the cars

15.Correlation – By Heatmap the relationship between the features.

How to read? very simple

  • Dark color represents a positive correlation,
  • Light color/ white is a towards the negative correlation.
plt.figure(figsize=(10,6))
sns.heatmap(df_cars.corr(),cmap=plt.cm.Reds,annot=True)
plt.title('Heatmap displaying the relationship betweennthe features of the data',
         fontsize=13)
plt.show()
Heatmap the relationship

Relationship between the Miles Per Gallon (mpg) and the other features.

  • We can see that there is a relationship between the mpg variable and the other variables and this satisfies the first assumption of Linear regression.
  • Strong Negative correlation between displacement, horsepower, weight, and cylinders.
    • This implies that, as any one of those variables increases, the mpg decreases.
  • Strong Positive correlations between the displacement, horsepower, weight, and cylinders.
    • This violates the non-multicollinearity assumption of Linear regression.
    • Multicollinearity hinders the performance and accuracy of our regression model. To avoid this, we have to get rid of some of these variables by doing feature selection.
  • The other variables.ie.acceleration, model, and origin are NOT highly correlated with each other.

So, I trust that you were able to understand the EDA in full flow here, still, there are many more functions in it, if you’re doing the EDA process clearly and precisely, there is 99% of grantee that you could build your model selection, hyperparameter tuning, and deployment process effectively without further cleaning, cleansing on your data set. You have to continuously monitor the data and model output is sustainable to predict or classify or cluster.

Will get back to you all with another topic shortly, until then bye! Cheers! – Shanthababu!

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *