If you aspire to work as a data scientist, understanding how to effectively address the issue of missing values is essential. Missing data is a prevalent challenge in numerous real-world datasets and can significantly skew the outcomes of machine learning models or compromise their accuracy. This article delves into the concept of missing data, elucidating how it is typically represented and the various factors contributing to its occurrence. Exploring the diverse categories of missing data, it also provides comprehensive guidance on strategies for handle missing values, supplemented by illustrative examples drawn from datasets.
Learning Objectives
Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.
Source: analyticsindiamag
In the dataset, the blank shows the missing values.
In Pandas, usually, missing values are represented by NaN. It stands for Not a Number.
Source: medium
The above image shows the first few records of the Titanic dataset extracted and displayed using Pandas.
There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
Formally the missing values are categorized as follows:
Source: theblogmedia
In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no relationship between the missing data and any other values observed or unobserved (the data which is not recorded) within the given dataset. That is, missing values are completely independent of other data. There is no pattern.
In the case of MCAR data, the value could be missing due to human error, some system/equipment failure, loss of sample, or some unsatisfactory technicalities while recording the values. For Example, suppose in a library there are some overdue books. Some values of overdue books in the computer system are missing. The reason might be a human error, like the librarian forgetting to type in the values. So, the missing values of overdue books are not related to any other variable/data in the system. It should not be assumed as it’s a rare case. The advantage of such data is that the statistical analysis remains unbiased.
MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is missing only within sub-samples of the data, and there is some pattern in the missing values.
For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal their age.)
So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing value itself.
Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender. In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the missing data.
Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can not explain it, then it is considered to be Missing Not At Random (MNAR).
If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of people to provide the required information. A specific group of respondents may not answer some questions in a survey.
For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this case, the missing value of the number of overdue books depends on the people who have more books overdue.
Another example is that people having less income may refuse to share some information in a survey or questionnaire.
In the case of MNAR as well, the statistical analysis might result in bias.
It is important to handle the missing values appropriately.
Let’s take an example of the Loan Prediction Practice Problem from Analytics Vidhya. You can download the dataset from the following link.
(https://courses.analyticsvidhya.com/courses/loan-prediction-practice-problem-using-python)
The first step in handling missing values is to carefully look at the complete data and find all the missing values. The following code shows the total number of missing values in each column. It also shows the total number of missing values in the entire data set.
From the above output, we can see that there are 6 columns – Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History having missing values.
#Find the total number of missing values from the entire dataset
train_df.isnull().sum().sum()
149
There are 149 missing values in total.
Here is a list of popular strategies to handle missing values in a dataset
Now that you have found the missing data, how do you handle the missing values?
Analyze each column with missing values carefully to understand the reasons behind the missing of those values, as this information is crucial to choose the strategy for handling the missing values.
There are 2 primary ways of handling missing values:
Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.
If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while missing observations are assumed to be completely random (MCAR) and addressed through pairwise deletion.)
The disadvantage of this method is one might end up deleting some useful data from the dataset.
There are 2 ways one can delete the missing data values:
Deleting the entire row (listwise deletion)
If a row has many missing values, you can drop the entire row. If every row has some (column) value missing, you might end up deleting the whole data. The code to drop the entire row is as follows:
df = train_df.dropna(axis=0)
df.isnull().sum()
OUT: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64
Deleting the entire column
If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:
df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()
Loan_ID 0
Gender 13
Married 3
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
There are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.
Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’.
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
0
Replacing with the mean
This is the most common method of imputing missing values of numeric columns. If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first. You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’ with the mean of the respective column values.
#Replace the missing values for numerical columns with mean
train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean())
Loan_ID
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
Replacing with the mode
Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’
#Replace the missing values for categorical columns with mode
train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])
train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0])
train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0])
train_df.isnull().sum()
OUT: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64
Replacing with the median
The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.
train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
Replacing with the previous value – forward fill
In some cases, imputing the values with the previous value instead of the mean, mode, or median is more appropriate. This is called forward fill. It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’
IN:
import pandas as pd
import numpy as np
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test
OUT: 0 0.0 1 1.0 2 Nan 3 Nan 4 Nan 5 5.0 dtype: float64
IN:
# Forward-Fill
test.fillna(method=‘ffill')
OUT: 0 0.0 1 1.0 2 1.0 3 1.0 4 1.0 5 5.0 dtype: float64
Replacing with the next value – backward fill
In backward fill, the missing value is imputed using the next value.
IN: # Backward-Fill test.fillna(method=‘bfill')
OUT: 0 0.0 1 1.0 2 5.0 3 5.0 4 5.0 5 5.0 dtype: float64
Interpolation
Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’
IN: test.interpolate()
OUT: 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 5 5.0 dtype: float64
There are two ways to impute missing values for categorical features as follows:
We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant.
import pandas as pd
import numpy as np
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]})
X
Shape
0 square 1 square 2 oval 3 circle 4 NaN
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)
OUT: array([['square'], ['square'], ['oval'], ['circle'], ['square']], dtype=object)
As you can see, the missing value is imputed with the most frequent value, ’square.’
We can impute the value “missing,” which treats it as a separate category.
imputer = SimpleImputer(strategy='constant', fill_value='missing')
imputer.fit_transform(X)
OUT: array([['square'], ['square'], ['oval'], ['circle'], ['missing']], dtype=object)
In any of the above approaches, you will still need to OneHotEncode the data (or you can also use another encoder of your choice). After One Hot Encoding, in case 1, instead of the values ‘square,’ ‘oval,’ and’ circle,’ you will get three feature columns. And in case 2, you will get four feature columns (4th one for the ‘missing’ category). So it’s like adding the missing indicator column in the data. There is another way to add a missing indicator column, which we will discuss further.
We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.
In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.
Let’s see an example:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
OUT: SimpleImputer()
IN: X = [[np.nan, 2], [6, np.nan], [7, 6]] print(imp.transform(X))
OUT: [[4. 2. ] [6. 3.666...] [7. 6. ]]
In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.
Let’s take an example of a titanic dataset.
Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also younger and people with higher fares are also older. In that case, it would make sense to impute low age for low fare values and high age for high fare values. So here, we are taking multiple features into account by following a multivariate approach.
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
cols = ['SibSp', 'Fare', 'Age']
X = df[cols]
X
SibSp | Fare | Age | |
---|---|---|---|
0 | 1 | 7.2500 | 22.0 |
1 | 1 | 71.2833 | 38.0 |
2 | 0 | 7.9250 | 26.0 |
3 | 1 | 53.1000 | 35.0 |
4 | 0 | 8.0500 | 35.0 |
5 | 0 | 8.4583 | NaN |
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
impute_it = IterativeImputer()
impute_it.fit_transform(X)
OUT: array([[ 1. , 7.25 , 22. ], [ 1. , 71.2833 , 38. ], [ 0. , 7.925 , 26. ], [ 1. , 53.1 , 35. ], [ 0. , 8.05 , 35. ], [ 0. , 8.4583 , 28.50639495]])
Let’s see how IterativeImputer works. For all rows in which ‘Age’ is not missing, sci-kit learn runs a regression model. It uses ‘Sib sp’ and ‘Fare’ as the features and ‘Age’ as the target. And then, for all rows for which ‘Age’ is missing, it makes predictions for ‘Age’ by passing ‘Sib sp’ and ‘Fare’ to the training model. So it actually builds a regression model with two features and one target and then makes predictions on any places where there are missing values. And those predictions are the imputed values.
Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean distance is used to find the nearest neighbors. Let’s take the above example of the titanic dataset to see how it works.
from sklearn.impute import KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
impute_knn.fit_transform(X)
OUT: array([[ 1. , 7.25 , 22. ], [ 1. , 71.2833, 38. ], [ 0. , 7.925 , 26. ], [ 1. , 53.1 , 35. ], [ 0. , 8.05 , 35. ], [ 0. , 8.4583, 30.5 ]])
In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing values. In this case, the last row has a missing value. And the third row and the fifth row have the closest values for the other two features. So the average of the ‘Age’ feature from these two rows is taken as the imputed value.
In some cases, while imputing missing values, you can preserve information about which values were missing and use that as a feature. This is because sometimes, there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict. In such cases, you can add a missing indicator to encode the “missingness” as a feature in the imputed data set.
Where can we use this?
Suppose you are predicting the presence of a disease. Now, imagine a scenario where a missing age is a good predictor of the disease because we don’t have records for people in poverty. The age values are not missing at random. They are missing for people in poverty, and poverty is a good predictor of disease. Thus, missing age or “missingness” is a good predictor of disease.
import pandas as pd
import numpy as np
X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
X
Age | |
---|---|
0 | 20.0 |
1 | 30.0 |
2 | 10.0 |
3 | NaN |
4 | 10.0 |
from sklearn.impute
import SimpleImputer
# impute the mean
imputer = SimpleImputer()
imputer.fit_transform(X)
OUT: array([[20. ], [30. ], [10. ], [17.5], [10. ]])
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)
OUT: array([[20. , 0. ], [30. , 0. ], [10. , 0. ], [17.5, 1. ], [10. , 0. ]])
In the above example, the second column indicates whether the corresponding value in the first column was missing or not. ‘1’ indicates that the corresponding value was missing, and ‘0’ indicates that the corresponding value was not missing.
If you don’t want to impute missing values but only want to have the indicator matrix, then you can use the ‘MissingIndicator’ class from scikit learn.
Missing data is a problem everyone faces while dealing with real-life data. It can impact the quality and accuracy of our results. Understanding the different types of missing data values and how t and their potential impact on the analysis is crucial for researchers to select an appropriate method for handling the missing data and handle missing value. Each method has its advantages and disadvantages and is appropriate for different types of missing data values.
Key Takeaways
A. The three types of missing data are Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).
A. We can use different methods to handle missing data points, such as dropping missing values, imputing them using machine learning, or treating missing values as a separate category.
A. Pairwise deletion is a method of handling missing values where only the observations with complete data are used in each pairwise correlation or regression analysis. This method assumes that the missing data is MCAR, and it is appropriate when the missing data is not too large.
To handle missing values in data:
Delete: Remove rows with missing values, but this can lead to loss of data.
Impute: Fill in missing values with statistical measures like mean, median, or regression predictions.
Use advanced techniques like EM algorithm or deep learning for more accurate imputation. Choose based on data nature and analysis goals.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Thanks. It helped a lot and I enjoyed reading it.
Can you make a video on this topic