“ Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” – Abraham Lincoln
What does this quote by the great Abraham Lincoln have to do with data exploration? Think about it – this quote stands true in most cases in real life, even in our field of data science.
We can’t build a machine learning model and hope to host a successful data science project without properly understanding and exploring the data. No matter how many fancy algorithms we deploy or how much computation we use, it’ll give spurious results until we do the most important activity – Data Exploration.
Data exploration helps us to understand our data, its structure, strengths, and weaknesses.
We have a countless number of techniques and tools available to perform data exploration. Here’s the kicker though – I’ve seen a lot of data science professionals skipping or skimming through the exploration stage. This is akin to going on a camping trip, having a world-class swiss knife, but only using it to cut fruits and vegetables with it, That’s missing the entire point!
So in this article, I have put together 10 powerful data exploration hacks, tips, and tricks to help you save time and quickly analyze the data at hand.
This is part 2 of my Data Science hacks, tips, and tricks series. I highly recommend reading the first part here.
I have also converted my learning into a free course that you can check out:
Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository: Data Science hacks, tips and tricks on GitHub.
We are posting these hacks daily on social media platforms like LinkedIn, Twitter, Facebook. Make sure to follow #avhackoftheday to get your daily dose of freshly brewed data science hacks, tips, and tricks!
Table of Contents
We’ll cover these data manipulation and data wrangling hacks, tips and tricks :
- Data Exploration Hack #1 – Pandas Profiling
- Data Exploration Hack#2 – Building Time Based Features
- Data Exploration Hack#3 – Heatmap over a Pandas DataFrame
- Data Exploration Hack#4 – Imputing missing values using KNNImputer
- Data Exploration Hack#5 – Plotting a Decision Tree
- Data Exploration Hack #6 – Binning Data
- Data Exploration Hack #7 – Funnel Charts
- Data Exploration Hack #8 – Pandas Crosstab
- Data Exploration Hack #9- Interactive plots
- Data Exploration Hack #10 – Bar Plot over Pandas DataFrame
Data Exploration Hack #1 – Pandas Profiling
The Pandas library has won the hearts of the majority of data scientists out there. Pandas Profiling provides you with an instant overall report of your data. It provides you with visualization of features, percentages of missing values, an indication of multicollinearity and much more.
It’s truly a handy tool for everyone!
Data Exploration Hack #2 – Building Time Based Features
A lot of the data we collect these days contains date and time variables. There is a lot of information such as – year, month, quarter, day of the week, hour, etc. that you can extract from these features and utilize it in your analysis. These features will enhance your analysis as well as your predictive model.
Data Exploration Hack #3 – Heatmap over a Pandas DataFrame
Another hack that’s going to impress your colleagues is plotting a heatmap over a Pandas dataframe. This helps you evaluate your results in just one glance and also provides you with a clean and elegant visualization that you can show to your manager and become a rockstar! We use Seaborn to accomplish this task.
Data Exploration Hack #4 – Imputing missing values using KNNImputer
KNNImputer is another great function added to the latest edition of Sklearn – 0.22. Usually, we tend to impute the missing values using univariate methods such as SimpleImputer.
Instead, we can use multivariate methods such as KNNImputer to complete this task. The KNNImputer imputes missing values using k-Nearest Neighbors. The missing values are imputed using the mean value from the nearest neighbors found in the training set.
Data Exploration Hack #5 – Plotting a Decision Tree
Honestly, this is one of the best updates in sklearn. Decision trees are one of the most intuitive algorithms to find the effects of independent variables. Using this function, you can easily plot a decision tree in just one line of code.
Go ahead and play around with the hyperparameters to get the optimum result!
Data Exploration Hack #6 – Binning Data
Binning can be really important in your data exploration activity. We typically use it to transform continuous variables into discrete ones.
Let’s take a look at an example from the Titanic dataset where we convert continuous variable ‘Age’ into a discrete variable ‘AgeGroup’. In this case, it’ll be more sensible to include AgeGroup as it’ll provide more insightful results. Let’s checkout the example in this video:
Data Exploration Hack #7 – Funnel Charts
As a product growth analyst, I am always curious about the journey of users through different stages. The Plotly library provides a great tool to visualize and understand the user journey through the funnel chart.
These charts also provide a way to understand the inconsistencies in the way of the user journey. The interactive funnel shows the number and percentage decline at every stage.
Data Exploration Hack #8 – Pandas Crosstab
Pandas Crosstab can be really beneficial to validate some basic hypotheses and form a more intuitive view of the data. It computes a simple cross-tabulation of two (or more) factors. By default, it’ll compute a frequency table if not aggregation function is provided.
Let’s deep dive into its code!
Data Exploration Hack #9- Interactive plots
Plots are a great way to visualize your data but what if I tell you that there’s an even better way to do it – using Interactive Plots!
The Cufflinks function binds plotly directly to Pandas dataframes. Therefore, you can make interactive charts without any hassle or long codes. You can hover over different plots and data points to see the exact numbers. This tip will definitely make you shine in front of your teammates!
Data Exploration Hack #10 – Bar Plot over Pandas DataFrame
A lot of people argue that Excel has far more options than Pandas in terms of exploring your data. Well, Pandas has some cool options too! You can plot bar charts over a Pandas dataframe which will help you understand and explore the data much more effectively.
You can explore a lot of options by tweaking the parameters of df.style.bar():
In this article, we covered 10 data exploration hacks, tips, and tricks across various tools and techniques to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.
Let me know your Data Science hacks, tips and tricks in the comments section below!You can also read this article on our Mobile APP