10 Awesome Data Manipulation and Wrangling Hacks, Tips and Tricks

Ram Dewani 05 Apr, 2023

5 min read

Introduction

“Efficiency is doing things right. Effectiveness is doing the right thing.” – Zig Zagler

As data scientists, we are often taught to be effective and do whatever it takes to get the job done. But ask yourself this – are we efficient at what we do on a day-to-day basis in a data science project? Is there any way to quicken the code we run or the monotonous processes we go through each day?

When I entered the field of data science, I was really happy to have the sexiest job of the 21st century. I dreamed of building state of the art machine learning models until I came face to face with a harsh reality – the majority of my time was going in cleaning and organizing data. I’m sure most of you have gone through this frustrating stage. It’s no secret – data preparation accounts for about 60-80% of a data scientists’ role.

This motivated me to take the path of becoming an efficient data scientist. I started searching for hacks to quickly build my code, tips to speed up the data preparation/cleaning process, and tricks to become far more efficient at data science tasks.

In this article, I will walk you through some of these data maniplution and data wrangling hacks, tips, and tricks that have served me well. I hope these help you in your journey and role as well!

I have also converted my learning into a free course that you can check out:

Data Science Hacks, Tips, and Tricks!

Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository – Data Science hacks, tips and tricks on GitHub.

We are posting these hacks daily on social media platforms like LinkedIn, Twitter, Facebook. Make sure to follow #avhackoftheday to get your daily dose of freshly brewed data science hacks, tips, and tricks!

Data Science Hack #1 – Select Data Type using Pandas
Data Science Hack #2 – Extract E-mail from text
Data Science Hack #3 – Remove Emojis from Text
Data Science Hack #4 – Image Augmentation
Data Science Hack #5 – Resizing Images
Data Science Hack #6 – Apply Pandas Operations in Parallel
Data Science Hack #7 – Pandas Melt
Data Science Hack #8 – Divide equal proportion of classes (Classification)
Data Science Hack #9 – Reading Data from multiple files
Data Science Hack #10 – Splitting Dataframe using str.split()

Data Science Hack #1 – Select Data Type using Pandas

At the start of my data science journey, I used to write an ‘if’ condition to separate out continuous and categorical variables for data analysis. This was a taxing task as it consumed a lot of unnecessary time and energy. Then I came across this simple Pandas hack which made my life so much simpler!

Code for selecting data type in 1 line of code

Data Science Hack #2 – Extract E-mail from text

One of the most important parts of digital marketing is getting E-mails IDs of your customers. Is there any way that I can extract these IDs? Of course there is – RegEx to the rescue!

This hack provides the regular expression you may use to extract E-mail ids from the text!

Code for extracting E-mails from text

Data Science Hack #3 – Remove Emojis from Text

Preprocessing is one of the key steps for improving the performance of any machine learning model. One of the main reasons for text preprocessing is to remove unwanted characters from text like punctuation, emojis, links and so on which are not required for our problem statement.

This hack will help you get rid of these unnecessary emojis!

Code for Removing emojis from text

Data Science Hack #4 – Image Augmentation

Deep Learning models usually require a lot of data for training. But acquiring massive amounts of data comes with its own challenges. Instead of spending days manually collecting data, you can make use of Image Augmentation techniques.

It is the process of generating new images. These new images are generated using the existing training images and hence we don’t have to collect them manually.

Code for image augmentation

Data Science Hack #5 – Resizing Images

While building an image classification model using deep learning, it is required that all the images should be of the same size. However, as the data comes from different sources, images may have different shapes.

So, to convert them to the same shape, we can use the resize function from the OpenCV library. This hack will help you convert the images of any shape to a specified shape:

Code for Image Resizing

Data Science Hack #6 – Apply Pandas Operations in Parallel

The traditional Pandas library is slow especially if you have a large dataset. Pandarallel is a simple and efficient tool to parallelize Pandas operations on all your available CPUs! This trick is certainly going to save loads of your precious time.

Code for applying Pandas operations in parallel

Data Science Hack #7 – Pandas Melt

Pandas’ melt function helps you to bring your dataframe into a tidy form. It gives you the functionality to unpivot a dataframe from wide to long format. In pd.melt(), one or more columns are used as identifiers. You can “Unmelt the data”, using pivot() function:

Code for Pandas Melt()

Data Science Hack #8 – Divide equal proportion of classes (Classification)

It is a very common mistake made by beginners – for classification problems, not splitting the classes into equal proportions in train and test set which often leads to spurious results. Sklearn provides an easy way to do it using the “stratify” parameter in the train_test_split function.

In this example, we pass stratify = y, and you can observe the difference of proportion in both cases – with stratify and without stratify.

Code for dividing equal proportion of classes

Data Science Hack #9 Reading Data from multiple files

A lot of times you may require to read multiple data files. For example, a retailer maintains his sales data in files split according to years. In this case, you’ll use glob, a module that finds all the pathnames matching a specified pattern according to the rules used by the Unix shell to read each file. Let’s see it in this example:

Code for reading multiple data files

Data Science Hack #10 Splitting Column using str.split()

str.split() is used to apply vectorized string functions on a Pandas dataframe column. Let’s say you want to split the names in a dataframe column into first name and last name. pandas.Series.str along with split( ) can be used to perform this task.

Code for splitting column

End Notes

In this article, we covered 10 data manipulation and data wrangling hacks, tips and tricks across various tools and techniques to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.

Let me know your Data Science hacks, tips and tricks in the comments section below!

R

Ram Dewani 05 Apr, 2023

Data Exploration Image Intermediate Listicle Python

Frequently Asked Questions

Responses From Readers

Dr.D.K.Samuel 26 Mar, 2020

Please give a way to embed jupyter animation s in power point as html or HTML5 video . Or even as MP4. Please

Rakshit Sakhuja 26 Mar, 2020

It's Awesome. I was not aware that there is also a way to do parallel operation in pandas. For image augmentation, you can explain what does wrap mode in rotation does?