Raghav Agrawal — June 1, 2021
Beginner Data Cleaning Data Science Python Structured Data

This article was published as a part of the Data Science Blogathon

Introduction

Interpolation is a technique in Python used to estimate unknown data points between two known data points. Interpolation is mostly used to impute missing values in the dataframe or series while preprocessing data.

Interpolation is also used in Image Processing when expanding an image you can estimate the pixel value with help of neighboring pixels.

interpolation values

 

Table of Contents

  • when to use Interpolation?
  • Interpolation to fill missing values in series data
    • Linear Interpolation
    • Polynomial Interpolation
    • Interpolation through Padding
  • Interpolation to fill missing values in DataFrame
    • Linear Method
    • Backward Direction
    • Interpolation through Padding
  • Filling Missing Values in Time-Series Data
  • EndNote

When to use Interpolation

we can use Interpolation to find missing value with help of its neighbors. When imputing missing values with average does not fit best, we have to move to a different technique and the technique most people find is Interpolation.

Interpolation is mostly used while working with time-series data because in time-series data we like to fill missing values with previous one or two values. for example, suppose temperature, now we would always prefer to fill today’s temperature with the mean of the last 2 days, not with the mean of the month. We can also use Interpolation for calculating the moving averages.

Using Interpolation to fill Missing Values in Series Data

Pandas series is a one-dimensional array which is capable to store elements of various data types like list. We can easily create series with help of a list, tuple, or dictionary. To perform all Interpolation methods we will create a pandas series with some NaN values and try to fill missing values with different methods of Interpolation.

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7])

1) Linear Interpolation

Linear Interpolation simply means to estimate a missing value by connecting dots in a straight line in increasing order. In short, It estimates the unknown value in the same increasing order from previous values. The default method used by Interpolation is Linear so while applying it we did not need to specify it.

a.interpolate()

The output you can observe as

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    7.0

Hence, Linear interpolation works in the same order. Remember that it does not interpret using the index, it interprets values by connecting points in a straight line.

2) Polynomial Interpolation

In Polynomial Interpolation you need to specify an order. It means that polynomial interpolation is filling missing values with the lowest possible degree that passes through available data points. The polynomial Interpolation curve is like the trigonometric sin curve or assumes it like a parabola shape. 

a.interpolate(method="polynomial", order=2)

If you pass an order as 1 then the output will similar to linear because the polynomial of order 1 is linear.

3) Interpolation through Padding

Interpolation with help of padding simply means filling missing values with the same value present above them in the dataset. If the missing value is in the first row then this method will not work. While using this technique you also need to specify the limit which means how many NaN values to fill.

So, if you are working on a real-world project and want to fill missing values with previous values you have to specify the limit as to the number of rows in the dataset.

a.interpolate(method="pad", limit=2)

You will see the output coming as below.

0    0.0
1    1.0
2    1.0
3    3.0
4    4.0
5    5.0
6    7.0

The missing value is replaced by the same value as present before to it.

 

Using Interpolation to fill Missing Values in Pandas DataFrame

DataFrame is a widely used python data structure that stores the data in form of rows and columns. When performing data analysis we always store the data in a table which is known as a dataframe. Dataframe can contain huge missing values in many columns so let us understand how we can use Interpolation to fill missing values in the dataframe.

import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
                   "B":[None, 3, 57, 3, None],
                   "C":[20, 16, None, 3, 8],
                   "D":[14, 3, None, None, 6]})
interpolation missing values

1) Linear Interpolation in forwarding Direction

The linear method ignores the index and treats missing values as equally spaced and finds the best point to fit the missing value after previous points. If the missing value is at first index then it will leave it as Nan. let’s apply it to our dataframe.

df.interpolate(method ='linear', limit_direction ='forward')

the output you can observe in the below figure.

linear interpolation
Forward Interpolation

If you only want to perform interpolation in the single column then it is also simple and follows the below code.

df['C'].interpolate(method="linear")

2) Linear Interpolation in Backward Direction

Now, the method is the same, only the order in which we want to perform changes. Now the method will work from the end of the dataframe or understand it as a bottom to top approach.

df.interpolate(method ='linear', limit_direction ='backward')

You will get the same output as in the below figure.

backward
backward Interpolation

3) Interpolation with Padding

We have already seen that to use padding we have to specify the limit of NaN values to be filled. we have a maximum of 2 NaN values in the dataframe so our limit will be 2.

df.interpolate(method="pad", limit=2)

After running the above code, it will fill missing values with previous present values and gives the output as shown in the figure below.

interpolation with padding

 

Filling Missing Values in Time-Series Data

Time-series data is data that follows some special trend or seasonality. Analyzing Time series data is a little bit different than normal data frames. Whenever we have time-series data, Then to deal with missing values we cannot use mean imputation techniques. Interpolation is a powerful method to fill missing values in time-series data.

df = pd.DataFrame({'Date': pd.date_range(start='2021-07-01', periods=10, freq='H'), 'Value':range(10)})
df.loc[2:3, 'Value'] = np.nan

Filling missing values in forwarding and backward method

The simplest method to fill values using interpolate is the same as we apply on a column of dataframe.

df['value'].interpolate(method="linear")

But the method is not used when we have a date column because we will fill missing values according to date which makes sense while filling missing values in time series data.

df.set_index('Date')['Value'].interpolate(method="linear")

The same code with a few modifications can be used as a backfill to fill missing values in the backward direction.

df.set_index('Date')['Value'].fillna(method="backfill", axis=None)

EndNote

We have learned various methods to use interpolate function in Python to fill missing values in series as well as in Dataframe. Interpolation in most cases supposed to be the best technique to fill missing values. I hope you got to know the power of interpolation and understand how to use it. If you have any kind of query using interpolate function please put it down in the comment section, I will be happier to help you out.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *