10 Powerful and Time-Saving Data Exploration Hacks, Tips and Tricks!

ram_dewani 14 Apr, 2020

5 min read

Introduction

“ Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” – Abraham Lincoln

What does this quote by the great Abraham Lincoln have to do with data exploration? Think about it – this quote stands true in most cases in real life, even in our field of data science.

We can’t build a machine learning model and hope to host a successful data science project without properly understanding and exploring the data. No matter how many fancy algorithms we deploy or how much computation we use, it’ll give spurious results until we do the most important activity – Data Exploration.

Data exploration helps us to understand our data, its structure, strengths, and weaknesses.

We have a countless number of techniques and tools available to perform data exploration. Here’s the kicker though – I’ve seen a lot of data science professionals skipping or skimming through the exploration stage. This is akin to going on a camping trip, having a world-class swiss knife, but only using it to cut fruits and vegetables with it, That’s missing the entire point!

So in this article, I have put together 10 powerful data exploration hacks, tips, and tricks to help you save time and quickly analyze the data at hand.

This is part 2 of my Data Science hacks, tips, and tricks series. I highly recommend reading the first part here.

I have also converted my learning into a free course that you can check out:

Data Science Hacks, Tips, and Tricks!

Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository: Data Science hacks, tips and tricks on GitHub.

We are posting these hacks daily on social media platforms like LinkedIn, Twitter, Facebook. Make sure to follow #avhackoftheday to get your daily dose of freshly brewed data science hacks, tips, and tricks!

Data Exploration Hack #1 – Pandas Profiling
Data Exploration Hack#2 – Building Time Based Features
Data Exploration Hack#3 – Heatmap over a Pandas DataFrame
Data Exploration Hack#4 – Imputing missing values using KNNImputer
Data Exploration Hack#5 – Plotting a Decision Tree
Data Exploration Hack #6 – Binning Data
Data Exploration Hack #7 – Funnel Charts
Data Exploration Hack #8 – Pandas Crosstab
Data Exploration Hack #9- Interactive plots
Data Exploration Hack #10 – Bar Plot over Pandas DataFrame

Data Exploration Hack #1 – Pandas Profiling

The Pandas library has won the hearts of the majority of data scientists out there. Pandas Profiling provides you with an instant overall report of your data. It provides you with visualization of features, percentages of missing values, an indication of multicollinearity and much more.

It’s truly a handy tool for everyone!

Code for Pandas profiling

Data Exploration Hack #2 – Building Time Based Features

A lot of the data we collect these days contains date and time variables. There is a lot of information such as – year, month, quarter, day of the week, hour, etc. that you can extract from these features and utilize it in your analysis. These features will enhance your analysis as well as your predictive model.

Code for Building Time Based Features

Data Exploration Hack #3 – Heatmap over a Pandas DataFrame

Another hack that’s going to impress your colleagues is plotting a heatmap over a Pandas dataframe. This helps you evaluate your results in just one glance and also provides you with a clean and elegant visualization that you can show to your manager and become a rockstar! We use Seaborn to accomplish this task.

Code for plotting a heatmap over a Pandas dataframe

Data Exploration Hack #4 – Imputing missing values using KNNImputer

KNNImputer is another great function added to the latest edition of Sklearn – 0.22. Usually, we tend to impute the missing values using univariate methods such as SimpleImputer.

Instead, we can use multivariate methods such as KNNImputer to complete this task. The KNNImputer imputes missing values using k-Nearest Neighbors. The missing values are imputed using the mean value from the nearest neighbors found in the training set.

Code for imputing missing values using KNNImputer

Data Exploration Hack #5 – Plotting a Decision Tree

Honestly, this is one of the best updates in sklearn. Decision trees are one of the most intuitive algorithms to find the effects of independent variables. Using this function, you can easily plot a decision tree in just one line of code.

Go ahead and play around with the hyperparameters to get the optimum result!

Code for plotting a decision tree

Data Exploration Hack #6 – Binning Data

Binning can be really important in your data exploration activity. We typically use it to transform continuous variables into discrete ones.

Let’s take a look at an example from the Titanic dataset where we convert continuous variable ‘Age’ into a discrete variable ‘AgeGroup’. In this case, it’ll be more sensible to include AgeGroup as it’ll provide more insightful results. Let’s checkout the example in this video:

https://youtu.be/WQagYXIFjns

Code for binning continuous features

Data Exploration Hack #7 – Funnel Charts

As a product growth analyst, I am always curious about the journey of users through different stages. The Plotly library provides a great tool to visualize and understand the user journey through the funnel chart.

These charts also provide a way to understand the inconsistencies in the way of the user journey. The interactive funnel shows the number and percentage decline at every stage.

Code for Funnel Charts

Data Exploration Hack #8 – Pandas Crosstab

Pandas Crosstab can be really beneficial to validate some basic hypotheses and form a more intuitive view of the data. It computes a simple cross-tabulation of two (or more) factors. By default, it’ll compute a frequency table if not aggregation function is provided.

Let’s deep dive into its code!

Code for Pandas crosstab

Data Exploration Hack #9- Interactive plots

Plots are a great way to visualize your data but what if I tell you that there’s an even better way to do it – using Interactive Plots!

The Cufflinks function binds plotly directly to Pandas dataframes. Therefore, you can make interactive charts without any hassle or long codes. You can hover over different plots and data points to see the exact numbers. This tip will definitely make you shine in front of your teammates!

Code for creating interactive plots

Data Exploration Hack #10 – Bar Plot over Pandas DataFrame

A lot of people argue that Excel has far more options than Pandas in terms of exploring your data. Well, Pandas has some cool options too! You can plot bar charts over a Pandas dataframe which will help you understand and explore the data much more effectively.

You can explore a lot of options by tweaking the parameters of df.style.bar():

Code for styling bar chart over Pandas dataframe

End Notes

In this article, we covered 10 data exploration hacks, tips, and tricks across various tools and techniques to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.

Let me know your Data Science hacks, tips and tricks in the comments section below!

r

ram_dewani 14 Apr, 2020

Data Exploration Data Visualization Intermediate Python Technique

Frequently Asked Questions

Responses From Readers

Sachchidanand Kumar 11 Apr, 2020

Hello Ram, Really great Insight on data analysis part ,I tried all the codes and hacks given, but i would like to know is there there any edge in using python for creating visualization tools as , there are already good visualizing tools like Power BI , Tableau , Qlik are available in the market. My second doubt was , while applying Pandas_Profiling I was getting an error message and some codes (especially the interactive graphs) did not appear . I am using Google Colab for executing these codes.

Himanshu goyal 25 Apr, 2020

Amazing,really i was horrified when there are so many features and you are performing heatmap to know know correlation.i am color blind too. and suddenly followed your post,i like pandas profiling too.but style method you have explained has made my life easier. Thanks for great article.