Data Cleansing: How To Clean Data With Python!

Siddharth Last Updated : 11 Jun, 2021

7 min read

This article was published as a part of the Data Science Blogathon

Introduction

Data Cleansing is the process of analyzing data for finding incorrect, corrupt, and missing values and abluting it to make it suitable for input to data analytics and various machine learning algorithms.

It is the premier and fundamental step performed before any analysis could be done on data. There are no set rules to be followed for data cleansing. It totally depends upon the quality of the dataset and the level of accuracy to be achieved.

Reasons for data corruption:

Data is collected from various structured and unstructured sources and then combined, leading to duplicated and mislabeled values.
Different data dictionary definitions for data stored at various locations.
Manual entry error/Typos.
Incorrect capitalization.
Mislabelled categories/classes.

Data Quality

Data Quality is of utmost importance for the analysis. There are several quality criteria that need to be checked upon:

Data Quality Attributes

Completeness: It is defined as the percentage of entries that are filled in the dataset. The percentage of missing values in the dataset is a good indicator of the quality of the dataset.
Accuracy: It is defined as the extent to which the entries in the dataset are close to their actual values.
Uniformity: It is defined as the extent to which data is specified using the same unit of measure.
Consistency: It is defined as the extent to which the data is consistent within the same dataset and across multiple datasets.
Validity: It is defined as the extent to which data conforms to the constraints applied by the business rules. There are various constraints:

Data Profiling Report

Data Profiling is the process of exploring our data and finding insights from it. Pandas profiling report is the quickest way to extract complete information about the dataset. The first step for data cleansing is to perform exploratory data analysis.

How to use pandas profiling:

Step 1: The first step is to install the pandas profiling package using the pip command:

pip install pandas-profiling

Step 2: Load the dataset using pandas:

import pandas as pd df = pd.read_csv(r"C:UsersDellDesktopDatasethousing.csv")

Step 3: Read the first five rows:

df.head()

Step 4: Generate the profiling report using the following commands:

from pandas_profiling import ProfileReport

prof = ProfileReport(df)prof.to_file(output_file='output.html')

Profiling Report:

The profiling report consists of five parts: overview, variables, interactions, correlation, and missing values.

1. Overview gives the general statistics about the number of variables, number of observations, missing values, duplicates, and number of categorical and numeric variables.

2. Variable information tells detailed information about the distinct values, missing values, mean, median, etc. Here statistics about a categorical variable and a numerical variable is shown:

3. Correlation is defined as the degree to which two variables are related to each other. The profiling report describes the correlation of different variables with each other in form of a heatmap.

4.Interactions: This part of the report shows the interactions of the variables with each other. You can select any variable on the respective axes.

5. Missing values: It depicts the number of missing values in each column.

Data Cleansing Techniques

Now we have a piece of detailed knowledge about the missing data, incorrect values, and mislabeled categories of the dataset. We will now see some of the techniques used for cleaning data. It totally depends upon the quality of the dataset, results to be obtained on how you deal with your data. Some of the techniques are as follows:

Handling missing values:

Handling missing values is the most important step of data cleansing. The first question you should ask yourself is that why is the data missing? Is it missing just because it was not recorded by the data entry operator or is it intentionally left empty? You can also go through the documentation to find the reason for the same.

There are different ways to handle these missing values:

1. Drop missing values: The easiest way to handle them is to simply drop all the rows that contain missing values. If you don’t want to figure out why the values are missing and just have a small percentage of missing values you can just drop them using the following command:

df.dropna()

It is not advisable although because every data is important and holds great significance to the overall results. Usually, the percentage of missing entries in a particular column is high. So dropping it is not a good option.

2. Imputation: Imputation is the process of replacing the null/missing values with some value. For numeric columns, one option is to replace each missing entry in the column with the mean value or median value. Another option could be generating random numbers between a range of values suitable for the column. The range could be between the mean and standard deviation of the column. You can simply import an imputer from the scikit-learn package and perform imputation as follows:

from sklearn.impute import SimpleImputer

#Imputation

my_imputer = SimpleImputer()

imputed_df = pd.DataFrame(my_imputer.fit_transform(df))

Handling Duplicates:

Duplicate rows occur usually when the data is combined from multiple sources. It gets replicated sometimes. A common problem is when users have the same identity number or the form has been submitted twice.

The solution to these duplicate tuples is to simply remove them. You can use the unique() function to find out the unique values present in the column and then decide which values need to be scraped.

Encoding:

Character encoding is defined as the set of rules defined for the one-to-one mapping from raw binary byte strings to human-readable text strings. There are several encoding available – ASCII, utf-8, US-ASCII, utf-16, utf-32, etc.

You might observe that some of the text character fields have irregular and unrecognizable patterns. This is because utf-8 is the default python encoding. All code is in utf-8. Therefore when the data is clubbed from multiple structured and unstructured sources and saved at a commonplace, irregular pattern in the text are observed.

The solution to the above problem is to first find out the character encoding of the file with the help of chardet module in python as follows:

import chardet with open("C:/Users/Desktop/Dataset/housing.csv",'rb') as rawdata:       result = chardet.detect(rawdata.read(10000))  # check what the character encoding might be  print(result)

After finding the type of encoding, if it is different from utf-8, save the file using “utf-8” encoding using the following command.

 df.to_csv("C:/Users/Desktop/Dataset/housing.csv")

Scaling and Normalization

Scaling refers to transforming the range of data and shifting it to some other value range. This is beneficial when we want to compare different attributes on the same footing. One useful example could be currency conversion.

For example, we will create random 100 points from exponential distribution and then plot them. Finally, we will convert them to a scaled version using the python mlxtend package.

# for min_max scaling from mlxtend.preprocessing import minmax_scaling # plotting packages import seaborn as sns import matplotlib.pyplot as plt

Now scaling the values:

random_data = np.random.exponential(size=100) # mix-max scale the data between 0 and 1 scaled_version = minmax_scaling(random_data, columns=[0])

Finally, plotting the two versions.

Normalization refers to changing the distribution of the data so that it can represent a bell curve where the values of the attribute are equally distributed across the mean. The value of mean and median is the same. This type of distribution is also termed Gaussian distribution. It is necessary for those machine learning algorithms which assume the data is normally distributed.

Now, we will normalize data using boxcox function:

from scipy import stats normalized_data = stats.boxcox(random_data) # plot both together to comparefig,  ax=plt.subplots(1,2)sns.distplot(random_data, ax=ax[0],color='pink') ax[0].set_title("Random Data") sns.distplot(normalized_data[0], ax=ax[1],color='purple') ax[1].set_title("Normalized data")

Handling Dates

The date field is an important attribute that needs to be handled during the cleansing of data. There are multiple different formats in which data can be entered into the dataset. Therefore, standardizing the date column is a critical task. Some people may have treated the date as a string column, some as a DateTime column. When the dataset gets combined from different sources then this might create a problem for analysis.

The solution is to first find the type of date column using the following command.

df['Date'].dtype

If the type of the column is other than DateTime, convert it to DateTime using the following command:

import datetime  df['Date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%y")

Handling inconsistent data entry issues

There are a large number of inconsistent entries that cannot be found manually or through direct computations. For example, if the same entry is written in upper case or lower case or a mixture of upper case and lower case. Then such an entry should be standardized throughout the column.

One solution is to convert all the entries of a column to lowercase and trim the extra space from each entry. This can later be reverted after the analysis is complete.

# convert to lower case df['ReginonName'] = df['ReginonName'].str.lower() # remove trailing white spaces  df['ReginonName'] = df['ReginonName'].str.strip()

Another solution is to use fuzzy matching to find which strings in the column are closest to each other and then replacing all those entries with a particular threshold with the main entry.

Firstly we will find out the unique region names:

region = df['Regionname'].unique()

Then we calculate the scores using fuzzy matching:

import fuzzywuzzy fromfuzzywuzzy import process regions=fuzzywuzzy.process.extract("WesternVictoria",region,limit=10,scorer=fuzzywuzy.fuzz.token_sort_ratio)

Validating the process.

Once you have finished the data cleansing process, it is important to verify and validate that the changes you have made have not hampered the constraints imposed on the dataset.

And finally, … it doesn’t go without saying,

Thank you for reading!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Siddharth

Beginner Data Cleaning Data Exploration Data Visualization Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Data Cleansing: How To Clean Data With Python!

Introduction

Reasons for data corruption:

Data Quality

Data Quality Attributes

Data Profiling Report

How to use pandas profiling:

Profiling Report:

Data Cleansing Techniques

Handling missing values:

Handling Duplicates:

Encoding:

Scaling and Normalization

Handling Dates

Handling inconsistent data entry issues

Validating the process.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I