RapidFire EDA process using Python for ML Implementation
Understand the ML best practice and project roadmap
When a customer wants to implement ML(Machine Learning) for the identified business problem(s) after multiple discussions along with the following stakeholders from both sides – Business, Architect, Infrastructure, Operations, and others. This is quite normal for any new product/application development.
But in the ML world, this is quite different. because, for new application development, we have to have a set of requirements in the form of sprint plans or traditional SDLC form and it depends on the customer for the next release plan.
But in ML implementation we need to initiate the below activity first
Identify the data source(s) and Data Collection

 Organization’s key application(s) – it would be Internal or External application or websites
 It would be streaming data from the web (Twitter/Facebook – any Social media)
Once you’re comfortable with the available data, you can start work on the rest of the Machine Learning process model.
Machine Learning process
Let’s jump into the EDA process (Step 3) in the above picture. In the data preparation, EDA gets most of the effort and unavoidable steps. Will zoom in to this in detail now. Are You READY!!!!
Exploratory Data Analysis(EDA)
What is EDA? Exploratory Data Analysis: this is unavoidable and one of the major step to finetune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages.
Out Come of this phase as below
 Understanding the given dataset and helps clean up the given dataset.
 It gives you a clear picture of the features and the relationships between them.
 Providing guidelines for essential variables and leaving behind/removing nonessential variables.
 Handling Missing values or human error.
 Identifying outliers.
 EDA process would be maximizing insights of a dataset.
This process is timeconsuming but very effective, the below activities are involved during this phase, it would be varied and depends on the available data and acceptance from the customer.
Hope now you have some idea, let’s implement all these using the Automobile – Predictive Analysis dataset.
Import Key Packages
print("######################################") print(" Import Key Packages ") print("######################################") import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from IPython.display import display import statsmodels as sm from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV from sklearn.linear_model import LinearRegression,Ridge,Lasso from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor from sklearn.metrics import r2_score,mean_squared_error from sklearn import preprocessing
######################################
Import Key Packages
######################################
1.Load .csv files
df_cars = pd.read_csv("autompg.csv")
Let’s see Data through Data Frame
df_cars.head(5)
EDA (Exploratory Data Analysis)
2.Dataset Information
print("############################################") print(" Info Of the Data Set") print("############################################") df_cars.info()
Observation:
 we could see that the features/column/fields and its data type, along with Null count
 horsepower and name features are object type in the given data set
Let go and see the given data set file
3.Data Cleaning/Wrangling:
the process of cleaning and unifying messy and complex data sets for easy access and analysis.
Action:
 replace(‘?’,’NaN’)
 Converting “horsepower” Object type into int
df_cars.horsepower = df_cars.horsepower.str.replace('?','NaN').astype(float) df_cars.horsepower.fillna(df_cars.horsepower.mean(),inplace=True) df_cars.horsepower = df_cars.horsepower.astype(int) print("######################################################################") print(" After Cleaning and type covertion in the Data Set") print("######################################################################") df_cars.info()
Observation:
 we could see that the features/column/fields and its data type, along with Null count
 horsepower is now int type
 name still as an Object type in the given data set, since we’re going to drop during the EDA phase
4.Group by names
 Correcting the brand name (Since misspelled, we have to correct it)
df_cars['name'] = df_cars['name'].str.replace('chevroeltchevroletchevy','chevrolet') df_cars['name'] = df_cars['name'].str.replace('maxdamazda','mazda') df_cars['name'] = df_cars['name'].str.replace('mercedesmercedesbenzmercedes benz','mercedes') df_cars['name'] = df_cars['name'].str.replace('toyotatoyouta','toyota') df_cars['name'] = df_cars['name'].str.replace('vokswagenvolkswagenvw','volkswagen') df_cars.groupby(['name']).sum().head()
After corrected the names
display(df_cars.describe().round(2))
6. Dealing with Missing Values
Fill in the missing values of horsepower by mean of horsepower value.
meanhp = df_cars['horsepower'].mean() df_cars['horsepower'] = df_cars['horsepower'].fillna(meanhp)
7.Skewness and kurtosis
Finding the Skewness and Kurtosis of mpg feature
print("Skewness: %f" %df_cars['mpg'].skew()) print("Kurtosis: %f" %df_cars['mpg'].kurt())
Skewness: 0.457066
Kurtosis: 0.510781
8. Categorical variable Move
Replacing the categorical variable with actual values
df_cars['origin'] = df_cars['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'}) df_cars.head()
9. Create Dummy Variables
Values like ‘america’ cannot be read into an equation. So we create 3 simple true or false columns with titles equivalent to “Is this car America?”, “Is this care European?” and “Is this car Asian?”. These will be used as independent variables without imposing any kind of ordering between the three regions. Let’s apply the below code.
cData = pd.get_dummies(df_cars, columns=['origin']) cData
10. Removing Columns
For this analysis, we won’t be needing the car name feature, so we can drop it.
df_cars = df_cars.drop('name',axis=1)
11.Univariate Analysis: “Uni” +“Variate” Univariate, means one variable or feature analysis. The univariate analysis basically tells us how data in each feature is distributed. just sample as below.
sns_plot = sns.distplot(df_cars["mpg"])
12. Bivariate Analysis: “Bi” +“Variate” Bivariate, means two variables or features are analyzed together, that how they are related to each other. Generally, we use to perform to find the relationship between the dependent and independent variable. Even you can perform this with any two variables/features in the given dataset to understand how they related to each other.
fig, ax = plt.subplots(figsize = (5, 5)) sns.countplot(x = df_cars.origin.values, data=df_cars) labels = [item.get_text() for item in ax.get_xticklabels()] labels[0] = 'America' labels[1] = 'Europe' labels[2] = 'Asia' ax.set_xticklabels(labels) ax.set_title("Cars manufactured by Countries") plt.show()
# Exploring the range and distribution of numerical Variables
fig, ax = plt.subplots(6, 2, figsize = (15, 13)) sns.boxplot(x= df_cars["mpg"], ax = ax[0,0]) sns.distplot(df_cars['mpg'], ax = ax[0,1]) sns.boxplot(x= df_cars["cylinders"], ax = ax[1,0]) sns.distplot(df_cars['cylinders'], ax = ax[1,1]) sns.boxplot(x= df_cars["displacement"], ax = ax[2,0]) sns.distplot(df_cars['displacement'], ax = ax[2,1]) sns.boxplot(x= df_cars["horsepower"], ax = ax[3,0]) sns.distplot(df_cars['horsepower'], ax = ax[3,1]) sns.boxplot(x= df_cars["weight"], ax = ax[4,0]) sns.distplot(df_cars['weight'], ax = ax[4,1]) sns.boxplot(x= df_cars["acceleration"], ax = ax[5,0]) sns.distplot(df_cars['acceleration'], ax = ax[5,1]) plt.tight_layout()
plt.figure(1)
f,axarr = plt.subplots(4,2, figsize=(10,10)) mpgval = df_cars.mpg.values axarr[0,0].scatter(df_cars.cylinders.values, mpgval) axarr[0,0].set_title('Cylinders') axarr[0,1].scatter(df_cars.displacement.values, mpgval) axarr[0,1].set_title('Displacement') axarr[1,0].scatter(df_cars.horsepower.values, mpgval) axarr[1,0].set_title('Horsepower') axarr[1,1].scatter(df_cars.weight.values, mpgval) axarr[1,1].set_title('Weight') axarr[2,0].scatter(df_cars.acceleration.values, mpgval) axarr[2,0].set_title('Acceleration') axarr[2,1].scatter(df_cars["model_year"].values, mpgval) axarr[2,1].set_title('Model Year') axarr[3,0].scatter(df_cars.origin.values, mpgval) axarr[3,0].set_title('Country Mpg') # Rename x axis label as USA, Europe and Japan axarr[3,0].set_xticks([1,2,3]) axarr[3,0].set_xticklabels(["USA","Europe","Asia"]) # Remove the blank plot from the subplots axarr[3,1].axis("off") f.text(0.01, 0.5, 'Mpg', va='center', rotation='vertical', fontsize = 12) plt.tight_layout() plt.show()
Observation:
So let’s find out more information from these 7 charts
 Well nobody manufactures 7 cylinders. Why…Does anyone know?
 4 cylinder has better mileage performance than other and most manufactured ones.
 8 cylinder engines have a low mileage count… of course, they focus more on pickup( fast cars).
 5 cylinders, performancewise, compete none neither 4 cylinders nor 6 cylinders.
 Displacement, weight, horsepower are inversely related to mileage.
 More horsepower means low mileage.
 Year on Year Manufacturers has focussed on increasing the mileage of the engines.
 Cars manufactured in Japan majorly focus more on mileage.
13.MultiVariate Analysis: means more than two variables or features are analyzed together. that how they are related to each other.
sns.set(rc={'figure.figsize':(11.7,8.27)}) cData_attr = df_cars.iloc[:, 0:7] sns.pairplot(cData_attr, diag_kind='kde') # to plot density curve instead of the histogram on the diagram # Kernel density estimation(kde)
Observation
 Between ‘mpg’ and other attributes indicates the relationship is not really linear.
 However, the plots also indicate that linearity would still capture quite a bit of useful information/pattern.
 Several assumptions of classical linear regression seem to be violated, including the assumption of no Heteroscedasticity
14.Distributions of the variables/features.
df_cars.hist(figsize=(12,8),bins=20) plt.show()
 The acceleration of the cars in the data is normally distributed and most of the cars have an acceleration of 15 meters per second squared.
 Half of the total number of cars (51.3%) in the data has 4 cylinders.
 Our output/dependent variable (mpg) is slightly skewed to the right.
Let’s visualize the distribution of the features of the cars
15.Correlation – By Heatmap the relationship between the features.
How to read? very simple
 Dark color represents a positive correlation,
 Light color/ white is a towards the negative correlation.
plt.figure(figsize=(10,6)) sns.heatmap(df_cars.corr(),cmap=plt.cm.Reds,annot=True) plt.title('Heatmap displaying the relationship betweennthe features of the data', fontsize=13) plt.show()
Relationship between the Miles Per Gallon (mpg) and the other features.
 We can see that there is a relationship between the mpg variable and the other variables and this satisfies the first assumption of Linear regression.
 Strong Negative correlation between displacement, horsepower, weight, and cylinders.
 This implies that, as any one of those variables increases, the mpg decreases.
 Strong Positive correlations between the displacement, horsepower, weight, and cylinders.
 This violates the nonmulticollinearity assumption of Linear regression.
 Multicollinearity hinders the performance and accuracy of our regression model. To avoid this, we have to get rid of some of these variables by doing feature selection.
 The other variables.ie.acceleration, model, and origin are NOT highly correlated with each other.
So, I trust that you were able to understand the EDA in full flow here, still, there are many more functions in it, if you’re doing the EDA process clearly and precisely, there is 99% of grantee that you could build your model selection, hyperparameter tuning, and deployment process effectively without further cleaning, cleansing on your data set. You have to continuously monitor the data and model output is sustainable to predict or classify or cluster.
Will get back to you all with another topic shortly, until then bye! Cheers! – Shanthababu!