10 Awesome Data Manipulation and Wrangling Hacks, Tips and Tricks

Last Updated : 05 Apr, 2023

5 min read

Introduction

“Efficiency is doing things right. Effectiveness is doing the right thing.” – Zig Zagler

As data scientists, we are often taught to be effective and do whatever it takes to get the job done. But ask yourself this – are we efficient at what we do on a day-to-day basis in a data science project? Is there any way to quicken the code we run or the monotonous processes we go through each day?

When I entered the field of data science, I was really happy to have the sexiest job of the 21st century. I dreamed of building state of the art machine learning models until I came face to face with a harsh reality – the majority of my time was going in cleaning and organizing data. I’m sure most of you have gone through this frustrating stage. It’s no secret – data preparation accounts for about 60-80% of a data scientists’ role.

This motivated me to take the path of becoming an efficient data scientist. I started searching for hacks to quickly build my code, tips to speed up the data preparation/cleaning process, and tricks to become far more efficient at data science tasks.

In this article, I will walk you through some of these data maniplution and data wrangling hacks, tips, and tricks that have served me well. I hope these help you in your journey and role as well!

I have also converted my learning into a free course that you can check out:

Data Science Hacks, Tips, and Tricks!

Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository – Data Science hacks, tips and tricks on GitHub.

We are posting these hacks daily on social media platforms like LinkedIn, Twitter, Facebook. Make sure to follow #avhackoftheday to get your daily dose of freshly brewed data science hacks, tips, and tricks!

We’ll cover these data manipulation and data wrangling hacks, tips and tricks :

Data Science Hack #1 – Select Data Type using Pandas
Data Science Hack #2 – Extract E-mail from text
Data Science Hack #3 – Remove Emojis from Text
Data Science Hack #4 – Image Augmentation
Data Science Hack #5 – Resizing Images
Data Science Hack #6 – Apply Pandas Operations in Parallel
Data Science Hack #7 – Pandas Melt
Data Science Hack #8 – Divide equal proportion of classes (Classification)
Data Science Hack #9 – Reading Data from multiple files
Data Science Hack #10 – Splitting Dataframe using str.split()

Data Science Hack #1 – Select Data Type using Pandas

At the start of my data science journey, I used to write an ‘if’ condition to separate out continuous and categorical variables for data analysis. This was a taxing task as it consumed a lot of unnecessary time and energy. Then I came across this simple Pandas hack which made my life so much simpler!

Code for selecting data type in 1 line of code

Data Science Hack #2 – Extract E-mail from text

One of the most important parts of digital marketing is getting E-mails IDs of your customers. Is there any way that I can extract these IDs? Of course there is – RegEx to the rescue!

This hack provides the regular expression you may use to extract E-mail ids from the text!

Code for extracting E-mails from text

Data Science Hack #3 – Remove Emojis from Text

Preprocessing is one of the key steps for improving the performance of any machine learning model. One of the main reasons for text preprocessing is to remove unwanted characters from text like punctuation, emojis, links and so on which are not required for our problem statement.

This hack will help you get rid of these unnecessary emojis!

Code for Removing emojis from text

Data Science Hack #4 – Image Augmentation

Deep Learning models usually require a lot of data for training. But acquiring massive amounts of data comes with its own challenges. Instead of spending days manually collecting data, you can make use of Image Augmentation techniques.

It is the process of generating new images. These new images are generated using the existing training images and hence we don’t have to collect them manually.

Code for image augmentation

Data Science Hack #5 – Resizing Images

While building an image classification model using deep learning, it is required that all the images should be of the same size. However, as the data comes from different sources, images may have different shapes.

So, to convert them to the same shape, we can use the resize function from the OpenCV library. This hack will help you convert the images of any shape to a specified shape:

Code for Image Resizing

Data Science Hack #6 – Apply Pandas Operations in Parallel

The traditional Pandas library is slow especially if you have a large dataset. Pandarallel is a simple and efficient tool to parallelize Pandas operations on all your available CPUs! This trick is certainly going to save loads of your precious time.

Code for applying Pandas operations in parallel

Data Science Hack #7 – Pandas Melt

Pandas’ melt function helps you to bring your dataframe into a tidy form. It gives you the functionality to unpivot a dataframe from wide to long format. In pd.melt(), one or more columns are used as identifiers. You can “Unmelt the data”, using pivot() function:

Code for Pandas Melt()

Data Science Hack #8 – Divide equal proportion of classes (Classification)

It is a very common mistake made by beginners – for classification problems, not splitting the classes into equal proportions in train and test set which often leads to spurious results. Sklearn provides an easy way to do it using the “stratify” parameter in the train_test_split function.

In this example, we pass stratify = y, and you can observe the difference of proportion in both cases – with stratify and without stratify.

Code for dividing equal proportion of classes

Data Science Hack #9 Reading Data from multiple files

A lot of times you may require to read multiple data files. For example, a retailer maintains his sales data in files split according to years. In this case, you’ll use glob, a module that finds all the pathnames matching a specified pattern according to the rules used by the Unix shell to read each file. Let’s see it in this example:

Code for reading multiple data files

Data Science Hack #10 Splitting Column using str.split()

str.split() is used to apply vectorized string functions on a Pandas dataframe column. Let’s say you want to split the names in a dataframe column into first name and last name. pandas.Series.str along with split( ) can be used to perform this task.

Code for splitting column

End Notes

In this article, we covered 10 data manipulation and data wrangling hacks, tips and tricks across various tools and techniques to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.

Let me know your Data Science hacks, tips and tricks in the comments section below!

Data Exploration Image Intermediate Listicle Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Dr.D.K.Samuel

Please give a way to embed jupyter animation s in power point as html or HTML5 video . Or even as MP4. Please

Rakshit Sakhuja

It's Awesome. I was not aware that there is also a way to do parallel operation in pandas. For image augmentation, you can explain what does wrap mode in rotation does?

Jeeva

it's very usefull.great guidence.nice.

Barnadip Dey

Hey that's are really useful tricks and trips, thanks for sharing.

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

10 Awesome Data Manipulation and Wrangling Hacks, Tips and Tricks

Introduction

Table of Contents

Data Science Hack #1 – Select Data Type using Pandas

Data Science Hack #2 – Extract E-mail from text

Data Science Hack #3 – Remove Emojis from Text

Data Science Hack #4 – Image Augmentation

Data Science Hack #5 – Resizing Images

Data Science Hack #6 – Apply Pandas Operations in Parallel

Data Science Hack #7 – Pandas Melt

Data Science Hack #8 – Divide equal proportion of classes (Classification)

Data Science Hack #9 Reading Data from multiple files

Data Science Hack #10 Splitting Column using str.split()

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv