Exploratory Data Analysis – The Go-To Technique to Explore Your Data!
This article was published as a part of the Data Science Blogathon.
Introduction
Exploratory Data Analysis(EDA) is one of the most underrated and under-utilized approaches in any Data Science project. EDA is the first step that data scientists perform where they study the data and extract valuable information and non-obvious insights from the data which ultimately helps during model building.
Before you model the data and test it, you need to build a relationship with the data. You can build this relationship by exploring the data, by plotting the data against the target variable, and observe how your data is behaving. This process of analysis before modeling is called Exploratory Data Analysis.
In this article, we are going to perform a hands-on EDA on a complex dataset from Kaggle(Advanced House Prediction). The link to the dataset is given below:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The lifecycle of a Data Science Project
1) Exploratory Data Analysis
2) Feature Engineering
3) Feature Selection
4) Hyperparameter tuning
5) Model Building and deployment
Let us perform on this complex dataset which has around 81 independent features and 1 target variable(sale price). It is a Regression problem statement.
EDA will contain some basic steps like analyzing missing values, numerical and categorical features’ distribution, outliers, multicollinearity, etc. We will see each one of the steps one by one.
Missing Values
Most of the time the data we obtain contains missing values and we need to find whether there exists any relationship between missing data and the sale price(target variable). Depending on that we replace the missing value with something like the median of that column.
This is the python code to capture the missing values for a large dataset in a list where we replace the missing value with 1 and replace the non-missing value with 0 and plot against the median sale price to see whether there exists a relationship b/w null values and target variable or not.
LotFrontage 0.1774 % missing values Alley 0.9377 % missing values MasVnrType 0.0055 % missing values MasVnrArea 0.0055 % missing values BsmtQual 0.0253 % missing values BsmtCond 0.0253 % missing values BsmtExposure 0.026 % missing values BsmtFinType1 0.0253 % missing values BsmtFinType2 0.026 % missing values FireplaceQu 0.4726 % missing values GarageType 0.0555 % missing values GarageYrBlt 0.0555 % missing values GarageFinish 0.0555 % missing values GarageQual 0.0555 % missing values GarageCond 0.0555 % missing values PoolQC 0.9952 % missing values Fence 0.8075 % missing values MiscFeature 0.963 % missing values
Since there are many missing values, we need to find the relationship between null values and the target variable(sale price)
This is one of the plots which shows that null values of Lot frontage feature have an impact on the target variable as it is increasing with the sale price. So yes, there exists a relationship b/w the two and we need to replace the null values with something substantial like the median of that particular feature.
Numerical Features
Since this is a large dataset we need to visualize the different types of variables like date-time(year), discrete and continuous numerical feature, categorical feature, and their behavior with the target variable.
There are 39 numerical features in this dataset. The data type for string or a mix of string and numeric is given as an object which we can check by using the types attribute.
Date Time variable(year feature or temporal variable)
This is the python code to find the year features and see how those four features behave with respect to the target variable.
We can see here that as the yr sold increases, the cost decreases. Now, this has to be an anomaly since it is not possible so we need to do more analysis and come to better conclusions. This just shows the importance of EDA and how it can affect our conclusions.
Instead of comparing the sale price with the yr sold feature, let us compare the sale price and the difference of all year features.
Now we can compare the sale price(median) with the year built and the year of remodification and come to various conclusions like the value on the X-axis increases, the price decreases.
Discrete numerical features
Discrete variables are the variables whose values exist in a particular range or are countable in a finite amount of time.
I have kept the threshold value for unique variables in a feature as 25 and those should not be in the year feature. Now let us see if there exists a relationship b/w discrete features and the target variable.
We can see that one of the features like OverallQuality has a direct relation with the target variable.
Continuous numerical features
These are the type of features whose value can be basically anything till infinity. By using histograms, we analyze their distribution throughout the data set.
We can see that the distribution we obtained is skewed. During regression problem statements, it is necessary to convert the skewed distribution to a normal distribution as it increases the accuracy of the model.
Logarithmic transformation is one of the techniques to convert a skewed distribution to a normal distribution where we take the log of all values of that particular feature and convert it into a whole new log feature itself.
Outliers
The outlier is any data point that lies outside of the distribution of the data set.
The presence of outliers in the dataset can hamper the accuracy of the model. Algorithms like linear regression are very sensitive to outliers so it needs to be handled carefully.
The Standard Deviation method is a common method to identify and replace the outliers where any data point which lies outside the 3rd standard deviation is considered to be an outlier. Although that threshold standard deviation can change depending on the size of the data set.
Here in EDA, let us analyze the outliers in the data set using boxplot.
The black dots denote the outliers present which are away from the distribution. The lower line of the rectangular box is 25%ile and the upper line is 75%ile.
So those black dots are the values that need to be removed or replaced which we will see in feature engineering.
Categorical features
The data type for a categorical feature is an object and we can check that with types attribute of pandas.
We generally convert the categorical values of a feature into dummy variables so that our algorithm understands. This is called as One hot encoding. If the cardinality of a particular category is very high, then we do not use one-hot encoding as it might lead to a curse of dimensionality.
The feature is MSZoning and number of categories are 5 The feature is Street and number of categories are 2 The feature is Alley and number of categories are 3 The feature is LotShape and number of categories are 4 The feature is LandContour and number of categories are 4 The feature is Utilities and number of categories are 2 The feature is LotConfig and number of categories are 5 The feature is LandSlope and number of categories are 3 The feature is Neighborhood and number of categories are 25 The feature is Condition1 and number of categories are 9 The feature is Condition2 and number of categories are 8 The feature is BldgType and number of categories are 5 The feature is HouseStyle and number of categories are 8 The feature is RoofStyle and number of categories are 6 The feature is RoofMatl and number of categories are 8 The feature is Exterior1st and number of categories are 15 The feature is Exterior2nd and number of categories are 16 The feature is MasVnrType and number of categories are 5 The feature is ExterCond and number of categories are 5 The feature is Foundation and number of categories are 6 The feature is BsmtQual and number of categories are 5 The feature is BsmtCond and number of categories are 5 The feature is BsmtExposure and number of categories are 5 The feature is BsmtFinType1 and number of categories are 7 The feature is BsmtFinType2 and number of categories are 7 The feature is Heating and number of categories are 6 The feature is HeatingQC and number of categories are 5 The feature is CentralAir and number of categories are 2 The feature is Electrical and number of categories are 6 The feature is KitchenQual and number of categories are 4 The feature is Functional and number of categories are 7 The feature is FireplaceQu and number of categories are 6 The feature is GarageType and number of categories are 7 The feature is GarageFinish and number of categories are 4 The feature is GarageQual and number of categories are 6 The feature is GarageCond and number of categories are 6 The feature is PavedDrive and number of categories are 3 The feature is PoolQC and number of categories are 4 The feature is Fence and number of categories are 5 The feature is SaleType and number of categories are 9 The feature is SaleCondition and number of categories are 6
The threshold value of categories that I have chosen for this case to perform one-hot encoding is 10.
Now let us check whether there exists any relationship between the categorical features and the median of the target variable(sale price).
Multicollinearity
In any dataset, whenever the independent features are internally correlated with each other, it hampers the accuracy of the model because the individual contribution of the features cannot be obtained. This is called Multicollinearity.
This is a huge problem when it comes to algorithms like linear and logistic regression.
How to fix it?
We use the correlation matrix with heatmap to visualize the relationship of all the independent features with each other by their correlation coefficient values.
Generally, 0.7 is taken as the threshold which means if any 2 features have a correlation above 0.7, one of the two features can be dropped.
Conclusion
These were some important steps to perform in Exploratory Data Analysis and it also shows the importance of EDA when it comes to real-life projects. I hope everyone uses this technique while solving their project.
Happy Learning! 🙂
55 thoughts on "Exploratory Data Analysis – The Go-To Technique to Explore Your Data!"
KARTIK RAJA says: October 07, 2020 at 8:23 pm
Excellent work 🔥🔥 Clear explanation of topics. It would help me in revision of the topics I've learnt.Sara says: October 07, 2020 at 8:24 pm
Good job! Good analysation used.Maanav Bhavsar says: October 07, 2020 at 8:39 pm
Very well done👏🏻. Excellent stuff writtenpriyansh jain says: October 07, 2020 at 8:56 pm
Amazing content, keep up the good work.Aditi Jha says: October 07, 2020 at 9:00 pm
Very informative article, well written @Sameer287!!Spandan pandey says: October 07, 2020 at 9:11 pm
Very informative article and very engaging to read too, i rarely find good articles on exploratory data analysis, this was very well madeShubhangi Roy says: October 07, 2020 at 10:30 pm
Grreatttt work mannn..... it was a whole tough task to include all the imp data analytics topics and features in one, which you covered up very wellllll, Good Productive work...👍Hitanshu Samantaray says: October 08, 2020 at 2:14 am
It was a great read Sameer! A very detailed analysis and a really good explanation with graphics to make things even clearer Really helpful for someone like me who's just starting out in the field of Data ScienceManisha Jha says: October 08, 2020 at 7:07 am
I am a java developer and willing to learn Machine Learning as it is the new technology everyone is talking about. Since I dont know anything about it, I have been following your articles on Medium regarding the topic and by far I have got the general concepts or a good overview of ML. Thanks for posting. Keep up the good work.Pooja K says: October 08, 2020 at 10:49 am
Amazing work !ANURAG KASHYAP says: October 08, 2020 at 11:24 am
It is an information article about exploratory data analysis. I got a lot to learn. Please come up with articles on data science.Amar says: October 08, 2020 at 11:34 am
Nice insight to EDA and very well represented...Harsh Vardhan Guleria says: October 08, 2020 at 11:46 am
The article is very informative and to the point.Nachiket Talwar says: October 08, 2020 at 12:03 pm
Great article. Very infirmative Keep up the good work!Sony Jha says: October 08, 2020 at 12:14 pm
Well written @Sameer287Arshad Ahmad says: October 08, 2020 at 12:21 pm
Detailed explanation for importance of exploratory data analysis. Great article.Shamith Rao says: October 08, 2020 at 12:33 pm
Nice! Great work!Saif Khan says: October 08, 2020 at 12:39 pm
Very informative and very well written!Sahil Tripathi says: October 08, 2020 at 12:40 pm
Informative article 💯Subhash yadav says: October 08, 2020 at 12:43 pm
Good project for resolving problem and usefulAkash Deep says: October 08, 2020 at 1:15 pm
perfectly describedSatyam Mishra says: October 08, 2020 at 1:30 pm
Excellent workMayur says: October 08, 2020 at 3:01 pm
Detailed insightful article. Thanks for sharing!Aishwarya Jha says: October 08, 2020 at 3:10 pm
Very well articulated and informative too. Keep up the good work. Kudos!Samarth Tyagi says: October 08, 2020 at 3:34 pm
wow.. this is greatKashif Ahmed says: October 08, 2020 at 3:34 pm
Great Work!Samarth Tyagi says: October 08, 2020 at 3:35 pm
so well written.. nice blog manSandilya Ventrapragada says: October 08, 2020 at 3:52 pm
This is some invaluable information buddy! would love to see moreSandilya Ventrapragada says: October 08, 2020 at 3:54 pm
This is some invaluable information buddy !! would love to see more !Tanushree Ganorkar says: October 08, 2020 at 5:16 pm
Very Informative article and get to know more about EDA . Great work Keep it up .Vidhi Khatwani says: October 08, 2020 at 6:24 pm
Good job Sameer! Very informative!Arslan says: October 08, 2020 at 6:41 pm
Well written Sam. Good going.Prakhar Agarwal says: October 08, 2020 at 6:48 pm
Well, great work Sameer!! Really appreciate the way you have covered all the basics concepts.Aryan says: October 08, 2020 at 7:08 pm
Amazing insights man! Really beneficialUtkarsh Mishra says: October 08, 2020 at 7:08 pm
Great content. Very informative.Reva Chinchalkar says: October 08, 2020 at 7:10 pm
A very elaborate and insightful article you've written there!Jasmeet kaur says: October 08, 2020 at 7:14 pm
Awesome article, was very informative! I learnt a new stuff :)Swara says: October 08, 2020 at 7:19 pm
Amazing!! Very informative 👍🏻Sameer says: October 08, 2020 at 8:31 pm
Thanks everyone!Avi Shah says: October 08, 2020 at 8:45 pm
Very well written !Anika Sharma says: October 08, 2020 at 9:27 pm
Awesome article !!! Great job Sameer! It’s very informativeSanskruti says: October 08, 2020 at 10:36 pm
That's awesome article! Really very informative Great👍Sristi Uts says: October 08, 2020 at 10:48 pm
Great work 👏Aatish irani says: October 08, 2020 at 11:32 pm
Excellent work broOm.mengshetti says: October 09, 2020 at 3:26 am
Great work and amazing insights sir!🔥 Learnt something new from thisBharat Rathi says: October 09, 2020 at 7:44 am
This article provides one of the best explanation to EDA. Great Work!!Keny P. says: October 09, 2020 at 9:10 am
Insightful and practical information!Vanshika says: October 09, 2020 at 7:37 pm
Awesome work Sameer!!!!! 👏👏👏Vanshika says: October 09, 2020 at 7:38 pm
Awesome work sameer!!!!Manju says: October 09, 2020 at 7:48 pm
Excellent work Sameer !! 👍👍👍Anirudha Jamjute says: October 09, 2020 at 8:47 pm
Very informative. Keep it up Sameer. All the Very best wishes.Aaditya Anand says: October 09, 2020 at 9:09 pm
Very well articulated and it's quite informative too!! Great work @Sameer287Manjula Jha says: October 09, 2020 at 10:46 pm
Great read. Keep it up👌Aayush Dand says: October 10, 2020 at 10:22 am
Very well written. Keep up the work broEaswaran says: October 12, 2020 at 4:59 pm
Great work Sameer!! 🙌💥