- Excel is a brilliant tool to perform data cleaning and data preprocessing in any analytics project
- Here, we showcase 5 useful Excel tricks to clean your data and become an efficient and better analyst
Any valuable gem – gold, silver, diamond – all of these were on no value once, with dirt lurking on them. It is only after proper cleaning and nurturing that their true potential is realized. The same analogy applies to the data we collect for our analytics and data science projects.
Expecting to work with a perfectly clean dataset in the industry is as rare as experiencing Halley’s comet in our lifetime. It’s that rare! This is where the power of Microsoft Excel comes to the fore.
A lot of you must be wondering – why Excel? Why rely on this when you have Python or R for data cleaning? Let me share my own experience with you to answer this.
I started my journey in the analytics field by learning advanced languages including Python. While it is very useful for working with complex datasets, I quickly realized that solving day-to-day problems didn’t require writing long codes in Python. A lot of these tasks can be performed in Excel with a click of a button. So why not rely on the Swiss Army Knife nature of Excel?
Here, I’ll be showcasing 5 supremely useful Excel tricks you can use to clean your data. I had a blast exploring these and I hope you find them useful for your own analysis!
This is the fourth article in my Excel for Analysts series. I highly recommend going through the previous articles to become a more efficient analyst:
- 5 Excel Tricks You’ll Love Working with as an Analyst
- 5 Handy Excel Tricks for Conditional Formatting Every Analyst Should Know
- 3 Classic Excel Tricks to Become an Efficient Analyst
I encourage you to check out the below resources if you’re a beginner in Excel and Business Analytics:
Table of Contents
We’ll be covering 5 data cleaning techniques in Excel in this article:
- Excel Trick #1 – Remove Duplicate Values from your Dataset
- Excel Trick #2 – Dealing with Empty Rows
- Excel Trick #3 – Applying the Clean Function
- Excel Trick #4 – Applying the Trim function
- Excel Trick #5 – Lowercase all the Words
Excel Trick #1 – Remove Duplicate Values from your Dataset
One of the most common discrepancies in our dataset is the presence of duplicate values. These may be due to an error in data collection or simply because of the repetition of the transaction.
Let’s take an example to understand this. Let’s say you order a pizza from Domino’s but due to some technical issue, it gets cancelled. You order the same pizza again. Chances are that there’s a duplicate transaction entered in the system.
Dealing with duplicate values in Excel is one of the sweetest tasks, very simple and elegant. So let’s check it out.
I’m really interested in social media analytics so I have taken up a small subset of social media posts that were posted on Facebook by Analytics Vidhya. We’ll clean this data for further analysis.
Step 1: Check if you have duplicate rows in the table
We will first check whether we have duplicate values in our table or not.
Go to Conditional Formatting -> Highlight cell rules -> Duplicate values:
Here you can select any desired formatting and then press OK:
We notice that there are two duplicate values here which means that these two posts were re-posted on the platform. There is a high chance that these may have been posted again after a period of time due to their high popularity or due to an ongoing campaign.
Step 2: Remove the duplicates
Once you have detected these duplicate values, it is now time to remove them. This, again, is very straightforward in Excel.
Go to Data in the Ribbon -> Select Delete Duplicates:
Excel asks you for which particular column you want to carry out this task. Select the column and press OK:
That is how simple Excel makes our tasks! Let’s move forward to another important data cleaning technique.
Excel Trick #2 – Dealing with Empty Rows
An empty cell is a common scenario in datasets but they may create havoc during the time of statistical modeling. So, it is always better to deal with them in the beginning during the data cleaning stage of your analytics project.
Step 1: Find the empty rows
We will begin by finding the empty rows (if there are any).
Go to Find & Select in the Ribbon -> Select Go To Special:
In this case, we will select the option – Blanks:
Step 2: Fill the empty cells
For our use case, we will simply input the value – “N/A”.
Just type the text and press CTRL + ENTER to fill all the cells in one go. Otherwise, you may enter every value individually by simply pressing ENTER (not ideal, of course):
Great! So far, we have removed duplicated values and treated the empty columns. Now, let’s work through some text cleaning functions in Excel.
Excel Trick #3 – Applying the Clean Function
When we import data from external sources, it is certainly not in its most beautiful form because it may contain some unwanted characters. It is usually because the text is imported from an application having characters that are non-printable in our operating system:
We use the CLEAN() function to remove line breaks and other non-printable characters. We pass the text that we want to clean to this function. To understand this function better, refer to the video or follow along with the steps:
Let’s check this out in our example where we want to remove line breaks from the social media dataset. We simply use the formula as
Awesome! We are one step closer to having a clean dataset. But wait, notice that we have a lot of unwanted trailing and leading spaces. Don’t worry – Excel has a very simple solution for this as well.
Excel Trick #4 – Applying the Trim Function
As we saw in the above Excel trick, the exported data may contain some undesirable characters but it can also have multiple spaces (as you can see in the below image). These multiple spaces may not be visible to the naked eye but they may create a ruckus during analysis so it is time we remove them:
Excel’s TRIM() function is used to remove all the extra spaces in the text except for the single spaces between words. To understand better, you can follow the video or the below steps:
We will input the text from which we want to remove the extra spaces. Let us check this example where remove the undesirable spaces:
Pro Tip: What you can do is instead of applying the CLEAN and TRIM functions separately, you can nest the function like this:
Great work guys! We are at the last trick to check out another powerful yet simple data cleaning function in Excel.
Excel Trick #5 – Lowercase all the Words
If you look closely at the social media posts in the dataset, you will notice that these are inconsistent. Some of them are all lowercase and some are uppercase. To maintain uniformity and consistency, let us convert all of these into lowercase.
To do that, simply use the function LOWER() and input the string which you want to lowercase:
You may also make use of the functions like UPPER() and PROPER(), whichever suits your use case. Do try these out for yourself.
In this article, we covered five industry-relevant techniques to clean data in Excel. I hope these techniques will help you with day-to-day niche tasks and save you a lot of time as a business analyst or a data science professional.
Do you know or use any other data cleaning functions in Excel? Let me know in the comments!