This article was published as a part of the Data Science Blogathon.

Missing data in machine learning is a type of data that contains “None” or “NaN” type of values. One should take care of the missing data while dealing with machine learning algorithms and training. Missing data can be filled using basic python programming, pandas library, and a sci-kit learn library named SimpleImputer. Handling missing values using the sci-kit learns library SimpleImputer is the easiest and most convenient method of all the other missing data handling methods.

In simple words, SimpleImputer is a sci-kit library used to fill in the missing values in the datasets. As the name suggests, the class performs simple imputations on the dataset, and it replaces the missing data with another value based on a given strategy.

The basic syntax or structure of a SimpleImputer initialization is:

`SimpleImputer`

(*,missing_values=nan,strategy='mean',fill_value=None,verbose=0,copy=True,add_indicator=False)

**missing_values: **indicates the missing values in the dataset. By default np.nan is set in the missing_values, which means all the values containing np.nan will be considered as missing values.

**strategy: **it is the method by using which we want to fill in the missing values.the value of the strategy could be “mean”, “median”, “most_frequent”, or “constant”.

**fill_value**: It is a parameter only used when the strategy is set to be constant. When the strategy is constant, the Nan values will be replaced by the value passes in fill_value.

**verbose**: It is the parameter that is used to control the verbosity of the SimpleImputer. By default, the value of verbose is set to 0.

**copy**: If this parameter is set to be True, then the copy of the dataset will be created, else imputation will be done without copying.

**add_indicator**: If this parameter is set to be True, the MissingIndicator transform will stack onto the output of the imputer’s transform.

To start with the SimpleImputer library, first, we must install and import the library from the sci-kit learn.

To install the library from sci-kit learn, use the code below:

pip install scikit-learn

Once the library is installed in the machine, it should be imported to the Python IDE you are using. Use the code below to import the library:

# importing sklearn import sklearn # importing simpleimputer from sklearn.impute import SimpleImputer

Using the strategy “Mean” in SimpleImputer allows us to impute the missing value by the mean of the particular dataset. This strategy can only be used on a numerical dataset.

Let’s suppose we have a numerical column named “Age” in our data set in which some of the values are missing. Then using the Mean strategy will allow us to fill in the missing values in the column by the mean of all age values.

**Code:**

imputer = SimpleImputer(missing_values=np.nan, strategy='mean') imputer.fit(df['age']) df['age']= imputer.fit_transform(df['age'])

**Example:**

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] print(imp_mean.transform(age))

The **Output** of the particular code would be:

[[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]

While working with mean strategy imputation, the scenario of an outlier should be considered as the mean strategy counts the mean of the values and fill the missing values by the counted mean values, but in the case of an outlier, it is possible that due to the outlier the mean can be shifted to one side and it is biased, which results in in-accurate mean value imputation.

Using the strategy “Median” in the SimpleImputer allows us to impute the missing value by the median value of the particular dataset. This strategy can only be used on the numerical dataset.

Let’s suppose we have a column age in our dataset in which we have a missing value. Using the strategy median will allow us to fill the missing values by the median of the values from the column age.

**Code:**

imputer = SimpleImputer(missing_values=np.nan, strategy='median') imputer.fit(df['age']) df['age']= imputer.fit_transform(df['age'])

**Example:**

imp_median = SimpleImputer(missing_values=np.nan, strategy='median') imp_median.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] print(imp_median.transform(age))

The **Output **of the particular code would be:

[[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]

Most frequent imputation is a technique that is used for handling categorical missing data. This technique is used when we have missing values in a categorical column.

Using a most frequent imputation technique on the particular categorical column will allow us to fill the missing values bu the most frequent value from the column occurring in the dataset.

**Code:**

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent') imputer.fit(df['category']) df['category']= imputer.fit_transform(df['category'])

**Example:**

imp_mf = SimpleImputer(missing_values=np.nan, strategy='median') imp_mf.fit([['one', 'two', 'three'], ['four', np.nan, 'six'], ['two', 'five', 'two']]) category = [[np.nan, 'two', 'two'], ['four', np.nan, 'six'], ['ten', np.nan, 'nine']] print(imp_mf.transform(category))

The **Output **of the particular code would be:

[[ 'two' 'two' 'two' ] [ 'four' 'two' 'six' ] [ 'ten' 'two' 'nine' ]]

Constant imputation is a technique in simple imputer using which we can fill the missing value by any desired value we want. This can be used on strings and numerical datasets.

Passing the desired value to the fill_value parameter, we can fill all the missing values present in the dataset by the value passed in the fill_value parameter.

**Code:**

imputer = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=20) imputer.fit(df['age']) df['age']= imputer.fit_transform(df['age'])

**Example:**

imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=20) imp_constant.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) age = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] print(imp_constant.transform(age))

The **Output **of the particular code would be:

[[20. 2. 3.] [ 4. 20. 6.] [10. 20. 9.]]

In this article, the handling of missing data with the class SimpleImputer is discussed in detail. A total of 4 strategies, mean median, most_frequent, and constant, can be used to fill in the missing value and are discussed in the code example above.

Some **Key Takeaways **From this article are:

1. We should consider an outlier scenario while working with a meaningful strategy, as outliers can impact the data imputed and may result in a less accurate model with unexpected behavior. (avoid using mean strategy in case of outliers).

2. Mean and Median is a strategy that only can be used on numerical data, and the most frequent strategy can be used only on categorical data. They are one of the easiest and lower computational methods.

3. Constant strategy can be used when we have a better understanding of the dataset, and we already know the impact of imputing the missing values by our desired number or string. It can be used on strings and numerical data.

**The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression
##

##

##

##

##

##

##

##

##

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python
##

##

##

##

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models
##

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values
##

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask