In data science and machine learning, dealing with missing values is a critical step to ensure accurate and reliable model predictions. This tutorial will guide you through the process of handling missing data, highlighting various imputation techniques to maintain data integrity. Additionally, we will explore how to do sentiment analysis in Python, showcasing practical examples and best practices. By the end of this tutorial, you will have a solid understanding of how to handle missing values and perform sentiment analysis using Python, ensuring robust data analysis and model performance. Also, in this article you will get clear understanding about the how to handle missing values in dataset with that by handle missing values you will able to refine the data. You by the end of this article get fully understandable how to handle missing data.
This article was published as a part of the Data Science Blogathon
Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.
Missing values in a dataset can be represented in various ways, depending on the source of the data and the conventions used. Here are some common representations:
NaN
. This is the default for libraries like Pandas in Python.NULL
or None
. For instance, in SQL databases, a missing value is typically recorded as NULL
.""
). This is common in text-based data or CSV files where a field might be left blank.Understanding the representation of missing values in your dataset is crucial for proper data cleaning and preprocessing. Identifying and handling these missing values accurately ensures that your data analysis and machine learning models perform optimally.
There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
Understanding the types of missing values in datasets is crucial for effectively handling missing data and ensuring accurate analyses:
MCAR occurs when the probability of data being missing is uniform across all observations. There is no relationship between the missingness of data and any other observed or unobserved data within the dataset. This type of missing data is purely random and lacks any discernible pattern.
Example: In a survey about library books, some overdue book values in the dataset are missing due to human error in recording.
MAR data occurs when the probability of data being missing depends only on the observed data and not on the missing data itself. In other words, the missingness can be explained by variables for which you have complete information. There is a pattern in the missing values, but this pattern can be explained by other observed variables.
Example: In a survey, ‘Age’ values might be missing for those who did not disclose their ‘Gender’. Here, the missingness of ‘Age’ depends on ‘Gender’, but the missing ‘Age’ values are random among those who did not disclose their ‘Gender’.
MNAR occurs when the missingness of data is related to the unobserved data itself, which is not included in the dataset. This type of missing data has a specific pattern that cannot be explained by observed variables.
Example: In a survey about library books, people with more overdue books might be less likely to respond to the survey. Thus, the number of overdue books is missing and depends on the number of books overdue.
Understanding the type of missing data is crucial because it determines the appropriate strategy for handling missing values and ensuring the integrity of statistical analyses. Techniques such as handling missing values, how to handle missing values, how to fill missing values in dataset, and missing value imputation are essential for mitigating biases and ensuring robust results in scenarios such as sentiment analysis python, python sentiment analysis, and how to do sentiment analysis in python.
Missing data is a common headache in any field that deals with datasets. It can arise for various reasons, from human error during data collection to limitations of data gathering methods. Luckily, there are strategies to address missing data and minimize its impact on your analysis. Here are two main approaches:
It is important to handle the missing values appropriately.
Let’s take an example of the Loan Prediction Practice Problem from Analytics Vidhya. You can download the dataset from the following link.
The first step in handling missing values is to carefully look at the complete data and find all the missing values. The following code shows the total number of missing values in each column. It also shows the total number of missing values in the entire data set.
import pandas as pd
train_df = pd.read_csv("train.csv")
#Find the missing values from each column
print(train_df.isnull().sum())
From the above output, we can see that there are 6 columns – Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History having missing values.
#Find the total number of missing values from the entire dataset
train_df.isnull().sum().sum()
#Output
149
There are 149 missing values in total.
Here is a list of popular strategies to handle missing values in a dataset
Analyze each column with missing values carefully to understand the reasons behind the missing of those values, as this information is crucial to choose the strategy for handling the missing values.
There are 2 primary ways of handling missing values:
Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.
If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while missing observations are assumed to be completely random (MCAR) and addressed through pairwise deletion.)
The disadvantage of this method is one might end up deleting some useful data from the dataset.
There are 2 ways one can delete the missing data values:
If a row has many missing values, you can drop the entire row. If every row has some (column) value missing, you might end up deleting the whole data. The code to drop the entire row is as follows:
df = train_df.dropna(axis=0)
df.isnull().sum()
Output:
OUT:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:
df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()
Output:
Loan_ID 0
Gender 13
Married 3
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
There are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.
If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’.
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
Output:
0
This is the most common method of imputing missing values of numeric columns. If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first. You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’ with the mean of the respective column values.
#Replace the missing values for numerical columns with mean
train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean())
Output:
Loan_ID
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’
#Replace the missing values for categorical columns with mode
train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])
train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0])
train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0])
train_df.isnull().sum()
Output:
OUT:
Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.
train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
In some cases, imputing the values with the previous value instead of the mean, mode, or median is more appropriate. This is called forward fill. It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’
IN:
import pandas as pd
import numpy as np
test = pd.Series(range(6))
test.loc[2:4] = np.nan
test
Output:
OUT:
0 0.0
1 1.0
2 Nan
3 Nan
4 Nan
5 5.0
dtype: float64
IN:
# Forward-Fill
test.fillna(method=‘ffill')
Output:
OUT:
0 0.0
1 1.0
2 1.0
3 1.0
4 1.0
5 5.0
dtype: float64
In backward fill, the missing value is imputed using the next value.
IN:
# Backward-Fill
test.fillna(method=‘bfill')
Output:
OUT:
0 0.0
1 1.0
2 5.0
3 5.0
4 5.0
5 5.0
dtype: float64
Interpolation
Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’
IN:
test.interpolate()
Output:
OUT:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
dtype: float64
There are two ways to impute missing values for categorical features as follows:
We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant.
import pandas as pd
import numpy as np
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]})
X
Shape
Output:
0 square 1 square 2 oval 3 circle 4 NaN
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)
Output:
array([['square'],
['square'],
['oval'],
['circle'],
['square']], dtype=object)
As you can see, the missing value is imputed with the most frequent value, ’square.’
We can impute the value “missing,” which treats it as a separate category.
imputer = SimpleImputer(strategy='constant', fill_value='missing')
imputer.fit_transform(X)
Output:
array([['square'],
['square'],
['oval'],
['circle'],
['missing']], dtype=object)
In any of the above approaches, you will still need to OneHotEncode the data (or you can also use another encoder of your choice). After One Hot Encoding, in case 1, instead of the values ‘square,’ ‘oval,’ and’ circle,’ you will get three feature columns. And in case 2, you will get four feature columns (4th one for the ‘missing’ category). So it’s like adding the missing indicator column in the data. There is another way to add a missing indicator column, which we will discuss further.
We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.
In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.
Let’s see an example:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Output:
OUT: SimpleImputer()
IN:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))
Output:
OUT: [[4. 2. ] [6. 3.666...] [7. 6. ]]
In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.
Let’s take an example of a titanic dataset.
Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also younger and people with higher fares are also older. In that case, it would make sense to impute low age for low fare values and high age for high fare values. So here, we are taking multiple features into account by following a multivariate approach.
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)
cols = ['SibSp', 'Fare', 'Age']
X = df[cols]
X
SibSp | Fare | Age | |
---|---|---|---|
0 | 1 | 7.2500 | 22.0 |
1 | 1 | 71.2833 | 38.0 |
2 | 0 | 7.9250 | 26.0 |
3 | 1 | 53.1000 | 35.0 |
4 | 0 | 8.0500 | 35.0 |
5 | 0 | 8.4583 | NaN |
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
impute_it = IterativeImputer()
impute_it.fit_transform(X)
Output:
OUT:
array([[ 1. , 7.25 , 22. ],
[ 1. , 71.2833 , 38. ],
[ 0. , 7.925 , 26. ],
[ 1. , 53.1 , 35. ],
[ 0. , 8.05 , 35. ],
[ 0. , 8.4583 , 28.50639495]])
Let’s see how IterativeImputer works. For all rows in which ‘Age’ is not missing, sci-kit learn runs a regression model. It uses ‘Sib sp’ and ‘Fare’ as the features and ‘Age’ as the target. And then, for all rows for which ‘Age’ is missing, it makes predictions for ‘Age’ by passing ‘Sib sp’ and ‘Fare’ to the training model. So it actually builds a regression model with two features and one target and then makes predictions on any places where there are missing values. And those predictions are the imputed values.
Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean distance is used to find the nearest neighbors. Let’s take the above example of the titanic dataset to see how it works.
from sklearn.impute import KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
impute_knn.fit_transform(X)
Output:
OUT:
array([[ 1. , 7.25 , 22. ],
[ 1. , 71.2833, 38. ],
[ 0. , 7.925 , 26. ],
[ 1. , 53.1 , 35. ],
[ 0. , 8.05 , 35. ],
[ 0. , 8.4583, 30.5 ]])
In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing values. In this case, the last row has a missing value. And the third row and the fifth row have the closest values for the other two features. So the average of the ‘Age’ feature from these two rows is taken as the imputed value.
In some cases, while imputing missing values, you can preserve information about which values were missing and use that as a feature. This is because sometimes, there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict. In such cases, you can add a missing indicator to encode the “missingness” as a feature in the imputed data set.
Where can we use this?
Suppose you are predicting the presence of a disease. Now, imagine a scenario where a missing age is a good predictor of the disease because we don’t have records for people in poverty. The age values are not missing at random. They are missing for people in poverty, and poverty is a good predictor of disease. Thus, missing age or “missingness” is a good predictor of disease.
import pandas as pd
import numpy as np
X = pd.DataFrame({'Age':[20, 30, 10, np.nan, 10]})
X
Age | |
---|---|
0 | 20.0 |
1 | 30.0 |
2 | 10.0 |
3 | NaN |
4 | 10.0 |
from sklearn.impute
import SimpleImputer
# impute the mean
imputer = SimpleImputer()
imputer.fit_transform(X)
Output:
OUT:
array([[20. ],
[30. ],
[10. ],
[17.5],
[10. ]])
imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X)
Output:
OUT:
array([[20. , 0. ],
[30. , 0. ],
[10. , 0. ],
[17.5, 1. ],
[10. , 0. ]])
In the above example, the second column indicates whether the corresponding value in the first column was missing or not. ‘1’ indicates that the corresponding value was missing, and ‘0’ indicates that the corresponding value was not missing.
If you don’t want to impute missing values but only want to have the indicator matrix, then you can use the ‘MissingIndicator’ class from scikit learn.
Handling missing values in your dataset is crucial for accurate analysis. Here are effective methods to address them:
The best method depends on your data and analysis. Understanding missing data patterns (MCAR, MAR, MNAR) can help you choose wisely and ensure the reliability of your results.
Missing data is a pervasive challenge in real-life datasets, impacting the quality and accuracy of results in data analysis. Understanding the different types of missing data values — MCAR, MAR, and MNAR. And their potential impact on the analysis is crucial for researchers to select an appropriate method for handling the missing data and fill missing values in dataset. Each method, whether it’s deleting the missing value, imputing the missing value, or using advanced techniques like KNN or regression imputation for missing value imputation, has its advantages and disadvantages and is appropriate for different types of missing data values.
Hope you like the article, and get clear understanding about how to handle missing dataset and you can get how to handle missing data so you can get to know about this to handling missing values and handle missing data.
A. The three types of missing data are Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).
A. We can use different methods to handle missing data points, such as dropping missing values, imputing them using machine learning, or treating missing values as a separate category.
A. Pairwise deletion handles missing values by using only the observations with complete data in each pairwise correlation or regression analysis. This method assumes that the missing data is MCAR, and it is appropriate when the missing data is not too large.
Identify missing data type: MCAR, MAR, or MNAR.
Delete: Remove rows (listwise) or cases for calculations (pairwise).
Impute: Replace missing values with mean, median, mode, last/next value, similar cases, or multiple estimates.
Analyze: Use maximum likelihood or model-based methods.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.