Effective Strategies for Handling Missing Values in Data Analysis

Nasima Tamboli 20 Jun, 2024
14 min read

Introduction

In data science and machine learning, dealing with missing values is a critical step to ensure accurate and reliable model predictions. This tutorial will guide you through the process of handling missing data, highlighting various imputation techniques to maintain data integrity. Additionally, we will explore how to do sentiment analysis in Python, showcasing practical examples and best practices. By the end of this tutorial, you will have a solid understanding of how to handle missing values and perform sentiment analysis using Python, ensuring robust data analysis and model performance.

handling missing values

Learning Objectives

  • Understand the critical importance of handling missing values in data science for ensuring robust and accurate model predictions.
  • Gain proficiency in various imputation techniques to effectively replace and manage missing data within datasets.Explore diverse strategies for handling missing values using advanced Python libraries and tools.
  • Enhance your knowledge and skills in addressing incomplete data within machine learning datasets to improve data quality.
  • Master practical methods and best practices for dealing with missing values in real-world datasets to ensure reliable analysis.
  • Learn how to do sentiment analysis in Python by leveraging advanced techniques for missing value imputation.
  • Implement sentiment analysis Python examples to demonstrate the impact of properly handling missing data on sentiment model accuracy.

This article was published as a part of the Data Science Blogathon.

What Is a Missing Value?

Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.

missing values

How is a Missing Value Represented in a Dataset?

Missing values in a dataset can be represented in various ways, depending on the source of the data and the conventions used. Here are some common representations:

  • NaN (Not a Number): In many programming languages and data analysis tools, missing values are represented as NaN. This is the default for libraries like Pandas in Python.
  • NULL or None: In databases and some programming languages, missing values are often represented as NULL or None. For instance, in SQL databases, a missing value is typically recorded as NULL.
  • Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is common in text-based data or CSV files where a field might be left blank.
  • Special Indicators: Datasets might use specific indicators like -999, 9999, or other unlikely values to signify missing data. This is often seen in older datasets or specific industries where such conventions were established.
  • Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values might be represented by spaces or blank fields.

Understanding the representation of missing values in your dataset is crucial for proper data cleaning and preprocessing. Identifying and handling these missing values accurately ensures that your data analysis and machine learning models perform optimally.

missing value in dataset

Why is Data Missing From the Dataset?

There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.

Some of the reasons are listed below:

  • Past data might get corrupted due to improper maintenance.
  • Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
  • The user has not provided the values intentionally
  • Item nonresponse: This means the participant refused to respond.

Types of Missing Values

Understanding the types of missing values in datasets is crucial for effectively handling missing data and ensuring accurate analyses:

Types of Missing Values

Missing Completely At Random (MCAR)

MCAR occurs when the probability of data being missing is uniform across all observations. There is no relationship between the missingness of data and any other observed or unobserved data within the dataset. This type of missing data is purely random and lacks any discernible pattern.

Example: In a survey about library books, some overdue book values in the dataset are missing due to human error in recording.

Missing At Random (MAR)

MAR data occurs when the probability of data being missing depends only on the observed data and not on the missing data itself. In other words, the missingness can be explained by variables for which you have complete information. There is a pattern in the missing values, but this pattern can be explained by other observed variables.

Example: In a survey, ‘Age’ values might be missing for those who did not disclose their ‘Gender’. Here, the missingness of ‘Age’ depends on ‘Gender’, but the missing ‘Age’ values are random among those who did not disclose their ‘Gender’.

Missing Not At Random (MNAR)

MNAR occurs when the missingness of data is related to the unobserved data itself, which is not included in the dataset. This type of missing data has a specific pattern that cannot be explained by observed variables.

Example: In a survey about library books, people with more overdue books might be less likely to respond to the survey. Thus, the number of overdue books is missing and depends on the number of books overdue.

Understanding the type of missing data is crucial because it determines the appropriate strategy for handling missing values and ensuring the integrity of statistical analyses. Techniques such as handling missing values, how to handle missing values, how to fill missing values in dataset, and missing value imputation are essential for mitigating biases and ensuring robust results in scenarios such as sentiment analysis python, python sentiment analysis, and how to do sentiment analysis in python.

How to Handle Missing Data?

Missing data is a common headache in any field that deals with datasets. It can arise for various reasons, from human error during data collection to limitations of data gathering methods. Luckily, there are strategies to address missing data and minimize its impact on your analysis. Here are two main approaches:

  • Deletion: This involves removing rows or columns with missing values. This is a straightforward method, but it can be problematic if a significant portion of your data is missing. Discarding too much data can affect the reliability of your conclusions.
  • Imputation: This replaces missing values with estimates. There are various imputation techniques, each with its strengths and weaknesses. Here are some common ones:
    • Mean/Median/Mode Imputation: Replace missing entries with the average (mean), middle value (median), or most frequent value (mode) of the corresponding column. This is a quick and easy approach, but it can introduce bias if the missing data is not randomly distributed.
    • K-Nearest Neighbors (KNN Imputation): This method finds the closest data points (neighbors) based on available features and uses their values to estimate the missing value. KNN is useful when you have a lot of data and the missing values are scattered.
    • Model-based Imputation: This involves creating a statistical model to predict the missing values based on other features in the data. This can be a powerful technique, but it requires more expertise and can be computationally expensive.pen_spark.

Why Do We Need to Care About Handling Missing Data?

It is important to handle the missing values appropriately.

  • Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
  • You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.
  • Missing data can lead to a lack of precision in the statistical analysis.

Let’s take an example of the Loan Prediction Practice Problem from Analytics Vidhya. You can download the dataset from the following link.

Checking for Missing Values in Python

The first step in handling missing values is to carefully look at the complete data and find all the missing values. The following code shows the total number of missing values in each column. It also shows the total number of missing values in the entire data set.

import pandas as pd
train_df = pd.read_csv("train.csv")
#Find the missing values from each column
print(train_df.isnull().sum())

From the above output, we can see that there are 6 columns – Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History having missing values.

#Find the total number of missing values from the entire dataset
train_df.isnull().sum().sum() 
#Output
149

There are 149 missing values in total.

List of Methods to Handle Missing Values in a Dataset

Here is a list of popular strategies to handle missing values in a dataset

  • Deleting the Missing Values
  • Imputing the Missing Values
  • Imputing the Missing Values for Categorical Features
  • Imputing the Missing Values using Sci-kit Learn Library
  • Using “Missingness” as a Feature

Handling Missing Values

Analyze each column with missing values carefully to understand the reasons behind the missing of those values, as this information is crucial to choose the strategy for handling the missing values.

There are 2 primary ways of handling missing values:

  • Deleting the Missing values
  • Imputing the Missing Values

Deleting the Missing value

Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.

If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while missing observations are assumed to be completely random (MCAR) and addressed through pairwise deletion.)

The disadvantage of this method is one might end up deleting some useful data from the dataset.

There are 2 ways one can delete the missing data values:

Deleting the entire row (listwise deletion)

If a row has many missing values, you can drop the entire row. If every row has some (column) value missing, you might end up deleting the whole data. The code to drop the entire row is as follows:

df = train_df.dropna(axis=0)
df.isnull().sum()

Output:

OUT:
Loan_ID  0
Gender  0
Married  0
Dependents  0
Education  0
Self_Employed 0
ApplicantIncome  0
CoapplicantIncome  0
LoanAmount  0
Loan_Amount_Term  0
Credit_History  0
Property_Area  0
Loan_Status  0
dtype: int64

Deleting the entire column

If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:

df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()

Output:

Loan_ID  0
Gender  13
Married  3
Education  0
Self_Employed 32
ApplicantIncome  0
CoapplicantIncome  0
LoanAmount  22
Loan_Amount_Term  14
Credit_History  50
Property_Area  0
Loan_Status  0
dtype: int64

Imputing the Missing Value

There are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.

Replacing with an arbitrary value

If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’.

#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()

Output:

0

Replacing with the mean

This is the most common method of imputing missing values of numeric columns. If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first. You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’ with the mean of the respective column values.


#Replace the missing values for numerical columns with mean
train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean())

Output:

Loan_ID  
Gender  13
Married  3
Dependents  15
Education  0
Self_Employed 32
ApplicantIncome  0
CoapplicantIncome  0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area  0
Loan_Status  0
dtype: int64

Replacing with the mode

Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’

#Replace the missing values for categorical columns with mode
train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])
train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0])
train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0])
train_df.isnull().sum()

Output:

OUT:
Loan_ID 0
Gender  0
Married 0
Dependents  0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount  0
Loan_Amount_Term  0
Credit_History  0
Property_Area 0
Loan_Status 0
dtype: int64

Replacing with the median

The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.

train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())

Replacing with the previous value – forward fill

In some cases, imputing the values with the previous value instead of the mean, mode, or median is more appropriate. This is called forward fill. It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’

IN:
import pandas as pd
import numpy as np
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test

Output:

OUT:
0 0.0
1 1.0
2 Nan
3 Nan
4 Nan
5 5.0
dtype: float64
IN:
# Forward-Fill
test.fillna(method=‘ffill')

Output:

OUT:
0 0.0
1 1.0
2 1.0
3 1.0
4 1.0
5 5.0
dtype: float64

Replacing with the next value – backward fill

In backward fill, the missing value is imputed using the next value.

IN:
# Backward-Fill
test.fillna(method=‘bfill')

Output:

OUT:
0 0.0
1 1.0
2 5.0
3 5.0
4 5.0
5 5.0
dtype: float64

Interpolation

Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’

IN:
test.interpolate()

Output:

OUT:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
dtype: float64

How to Impute Missing Values for Categorical Features?

There are two ways to impute missing values for categorical features as follows:

Impute the Most Frequent Value

We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant.

import pandas as pd
import numpy as np
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]})
X
Shape

Output:

0 square
1 square
2 oval
3 circle
4 NaN

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)

Output:


array([['square'],
       ['square'],
       ['oval'],
       ['circle'],
       ['square']], dtype=object)

As you can see, the missing value is imputed with the most frequent value, ’square.’

Impute the Value “Missing”

We can impute the value “missing,” which treats it as a separate category.

imputer = SimpleImputer(strategy='constant', fill_value='missing')
imputer.fit_transform(X)

Output:


array([['square'],
       ['square'],
       ['oval'],
       ['circle'],
       ['missing']], dtype=object)

In any of the above approaches, you will still need to OneHotEncode the data (or you can also use another encoder of your choice). After One Hot Encoding, in case 1, instead of the values ‘square,’ ‘oval,’ and’ circle,’ you will get three feature columns. And in case 2, you will get four feature columns (4th one for the ‘missing’ category). So it’s like adding the missing indicator column in the data. There is another way to add a missing indicator column, which we will discuss further.

How to Impute Missing Values Using Sci-kit Learn Library?

We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.

Univariate Approach

In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.

Let’s see an example:

import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

Output:

OUT: SimpleImputer()
IN:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

Output:

OUT:
[[4.          2.        ]
 [6.          3.666...]
 [7.          6.        ]]

Multivariate Approach

In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.

Let’s take an example of a titanic dataset.

Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also younger and people with higher fares are also older. In that case, it would make sense to impute low age for low fare values and high age for high fare values. So here, we are taking multiple features into account by following a multivariate approach.

import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
cols = ['SibSp', 'Fare', 'Age']
X = df[cols]
X

 
SibSpFareAge
017.250022.0
1171.283338.0
207.925026.0
3153.100035.0
408.050035.0
508.4583NaN
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
impute_it = IterativeImputer()
impute_it.fit_transform(X)

Output:

OUT:
array([[ 1.        ,  7.25      , 22.        ],
       [ 1.        , 71.2833    , 38.        ],
       [ 0.        ,  7.925     , 26.        ],
       [ 1.        , 53.1       , 35.        ],
       [ 0.        ,  8.05      , 35.        ],
       [ 0.        ,  8.4583    , 28.50639495]])

Let’s see how IterativeImputer works. For all rows in which ‘Age’ is not missing, sci-kit learn runs a regression model. It uses ‘Sib sp’ and ‘Fare’ as the features and ‘Age’ as the target. And then, for all rows for which ‘Age’ is missing, it makes predictions for ‘Age’ by passing ‘Sib sp’ and ‘Fare’ to the training model. So it actually builds a regression model with two features and one target and then makes predictions on any places where there are missing values. And those predictions are the imputed values.

Nearest Neighbors Imputations (KNNImputer)

Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean distance is used to find the nearest neighbors. Let’s take the above example of the titanic dataset to see how it works.

from sklearn.impute import KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
impute_knn.fit_transform(X)

Output:

OUT:
array([[ 1.    ,  7.25  , 22.    ],
       [ 1.    , 71.2833, 38.    ],
       [ 0.    ,  7.925 , 26.    ],
       [ 1.    , 53.1   , 35.    ],
       [ 0.    ,  8.05  , 35.    ],
       [ 0.    ,  8.4583, 30.5   ]])

In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing values. In this case, the last row has a missing value. And the third row and the fifth row have the closest values for the other two features. So the average of the ‘Age’ feature from these two rows is taken as the imputed value.

How to Use “Missingness” as a Feature?

In some cases, while imputing missing values, you can preserve information about which values were missing and use that as a feature. This is because sometimes, there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict. In such cases, you can add a missing indicator to encode the “missingness” as a feature in the imputed data set.

Where can we use this?

Suppose you are predicting the presence of a disease. Now, imagine a scenario where a missing age is a good predictor of the disease because we don’t have records for people in poverty. The age values are not missing at random. They are missing for people in poverty, and poverty is a good predictor of disease. Thus, missing age or “missingness” is a good predictor of disease.


import pandas as pd
import numpy as np
X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
X
 
Age
020.0
130.0
210.0
3NaN
410.0

from sklearn.impute
import SimpleImputer
# impute the mean
imputer = SimpleImputer()
imputer.fit_transform(X)

Output:

OUT:
array([[20. ],
       [30. ],
       [10. ],
       [17.5],
       [10. ]])


imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)

Output:

OUT:
array([[20. ,  0. ],
       [30. ,  0. ],
       [10. ,  0. ],
       [17.5,  1. ],
       [10. ,  0. ]])

In the above example, the second column indicates whether the corresponding value in the first column was missing or not. ‘1’ indicates that the corresponding value was missing, and ‘0’ indicates that the corresponding value was not missing.

If you don’t want to impute missing values but only want to have the indicator matrix, then you can use the ‘MissingIndicator’ class from scikit learn.

How to Handle Missing Data in Dataset?

Handling missing values in your dataset is crucial for accurate analysis. Here are effective methods to address them:

  • Delete Rows with Missing Values: This approach is straightforward but be cautious as it may lead to loss of valuable data.
  • Impute with Averages or Midpoints: Fill missing values with mean, median, or mode. However, be mindful of potential bias introduced by this method.
  • Use Advanced Techniques like K-Nearest Neighbors (KNN): Estimate missing values by finding similar data points using KNN. This method can preserve data integrity.

The best method depends on your data and analysis. Understanding missing data patterns (MCAR, MAR, MNAR) can help you choose wisely and ensure the reliability of your results.

Conclusion

Missing data is a pervasive challenge in real-life datasets, impacting the quality and accuracy of results in data analysis. Understanding the different types of missing data values — MCAR, MAR, and MNAR. And their potential impact on the analysis is crucial for researchers to select an appropriate method for handling the missing data and fill missing values in dataset. Each method, whether it’s deleting the missing value, imputing the missing value, or using advanced techniques like KNN or regression imputation for missing value imputation, has its advantages and disadvantages and is appropriate for different types of missing data values.

Key Takeaways

  • Understand the three types of missing data — Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).
  • Recognize how missing data can affect the quality and accuracy of your analysis results.
  • Choose appropriate methods such as deletion, imputation (mean, median, regression, or machine learning-based methods), or advanced techniques (e.g., KNN).
  • Each method has its advantages and disadvantages and is suitable for different types of missing data values.
  • Consider the potential biases and implications of the chosen method on your analysis.
  • Implement best practices for handling missing data to minimize the impact on your research outcomes.

Frequently Asked Questions

Q1. What are the types of missing values in data?

A. The three types of missing data are Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).

Q2. How do you handle missing values?

A. We can use different methods to handle missing data points, such as dropping missing values, imputing them using machine learning, or treating missing values as a separate category.

Q3. How does pairwise deletion handle missing data?

A. Pairwise deletion handles missing values by using only the observations with complete data in each pairwise correlation or regression analysis. This method assumes that the missing data is MCAR, and it is appropriate when the missing data is not too large.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Nasima Tamboli 20 Jun, 2024

Freelance Software Engineer Data Science Enthusiast and Content Writer Loves Coding E-mail: [email protected] LinkedIn:https://www.linkedin.com/in/nasima-tamboli

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

mojde bay
mojde bay 24 May, 2022

Thanks. It helped a lot and I enjoyed reading it.

Anand Shaw
Anand Shaw 21 Feb, 2024

Can you make a video on this topic