Interpolation Techniques Guide & Benefits | Data Analysis (Updated 2024)

Raghav Agrawal Last Updated : 15 Oct, 2024
7 min read

Introduction

This beginner’s tutorial is about interpolation. Interpolation in Python is a technique used to estimate unknown data points between two known data points. In Python, Interpolation Formula is a technique mostly used to impute missing values in the data frame or series while preprocessing data. You can use this method to estimate missing data points in your data using Python in Power BI or machine learning algorithms. Interpolation is also used in Image Processing when expanding an image, where you can estimate the pixel value with the help of neighboring pixels.

Interpolation

Learning Objectives

  • In this tutorial on data science and machine learning, we will learn to handle missing data and preprocess data before using it in the machine learning model.
  • We will also learn about handling missing data with python and python pandas library, i.e., pandas interpolate and scipy library.

This article was published as a part of the Data Science Blogathon.

What is Interpolation?

Interpolation is like filling in the gaps. It’s a way to estimate missing values between known data points, like guessing the temperature at 3 pm when you only have readings for 2 pm and 4 pm.

What is Interpolation

Types of Interpolation

Nearest neighbor:

It’s similar to choosing your closest pal! To estimate the temperature at 3pm, you can simply take the temperature reading from the nearest time slot, such as 2pm, when you have data points for 2pm and 4pm. This can be straightforward yet challenging, particularly when the data points are not evenly distributed.

Linear Interpolation:

Involves connecting points using a straight line. Picture having temperatures recorded at 2pm and 4pm. In order to estimate the temperature at 3pm, you can connect the two points with a line and determine the value on that line at 3pm. This is more polished than nearest neighbor but cannot accurately represent abrupt angles. It is a mathematical formula used to predict values within a specific range by analyzing the linear connections between existing data points.

Spline Interpolation:

Picture linking the points with curved lines rather than straight ones. This is more complex than linear interpolation but is able to more accurately depict data with curves and bends. Spline interpolation involves using polynomial functions to interpolate between data points, creating a smooth curve that goes through each data point.

When to Use Interpolation?

We can use Interpolation to find missing value/null with the help of its neighbors. When imputing missing values with average does not fit best, we have to move to a different technique, and the technique most people find is Interpolation.

Interpolation is mostly used while working with time-series data because, in time-series data, we like to fill missing values with the previous one or two values. for example, suppose temperature, now we would always prefer to fill today’s temperature with the mean of the last 2 days, not with the mean of the month. We can also use Interpolation for calculating the moving averages.

What is the Interpolation Formula?

Interpolation Formula: Given two data points (x1, y1) and (x2, y2), where x1 < x < x2, the interpolated value of y at a point x is calculated as:

y=y1+(xx1)∗x2−x1y2−y1​

Using Interpolation to Fill Missing Values in Series Data

Pandas series is a one-dimensional array that is capable of storing elements of various data types like lists. We can easily create a series with the help of a list, tuple, or dictionary. To perform all Interpolation methods we will create a pandas series with some NaN values and try to fill missing values with some interpolated values by the implementation of the interpolate methods or some other different methods of Interpolation.

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7])

Linear Interpolation

Linear Interpolation simply means to estimate a missing value by connecting dots in a straight line in increasing order. In short, It estimates the unknown value in the same increasing order from previous values. The default method used by Interpolation is Linear. So while applying it, we need not specify it.

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7])
print(a.interpolate())

The output you can observe as

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    7.0

Hence, Linear interpolation works in the same order. Remember that it does not interpret using the index; it interprets values by connecting points in a straight line.

Polynomial Interpolation

In Polynomial Interpolation, you need to specify an order. It means that polynomial interpolation fills missing values with the lowest possible degree that passes through available data points. The polynomial Interpolation curve is like the trigonometric sin curve or assumes it like a parabola shape.

a.interpolate(method="polynomial", order=2)

If you pass an order as 1, then the output will be similar to linear because the polynomial of order 1 is linear.

Interpolation Through Padding

Interpolation with the help of padding simply means filling missing values with the same value present above them in the dataset. If the missing value is in the first row, then this method will not work. While using this technique, you also need to specify the limit, which means how many NaN values to fill.

So, if you are working on a real-world project and want to fill missing values with previous values, you have to specify the limit as to the number of rows in the dataset.

a.interpolate(method="pad", limit=2)

You will see the output coming as below.

0 0.0 1 1.0 2 1.0 3 3.0 4 4.0 5 5.0 6 7.0

The missing data is replaced by the same value as present before to it.

Using Interpolation to Fill Missing Values in Pandas DataFrame

DataFrame is a widely used python data structure that stores the data in the form of rows and columns. When performing data analysis we always store the data in a table which is known as a data frame. The dropna() function is generally used to drop all the null values in a dataframe. A data frame can contain huge missing values in many columns, so let us understand how we can use Interpolation to fill in missing values in the data frame.

(Note: To save changes, you can use inplace = True in python )

import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
                   "B":[None, 3, 57, 3, None],
                   "C":[20, 16, None, 3, 8],
                   "D":[14, 3, None, None, 6]})
interpolation of missing values

Linear Interpolation in the Forwarding Direction

The linear method ignores the index and treats missing values as equally spaced, and finds the best point to fit the missing value after previous points. If the missing value is at the first index, then it will leave it as Nan. let’s apply dataframe.interpolate to our data frame.

df.interpolate(method ='linear', limit_direction ='forward')

the output you can observe in the below figure.

linear interpolation forwarding

If you only want to perform interpolation Formula in a single column, then it is also simple and follows the below code.

df['C'].interpolate(method="linear")

Linear Interpolation in Backward Direction (bfill)

Now, the method is the same, only the order in which we want to perform changes. Now the method will work from the end of the data frame or understand it as a bottom-to-top approach.

df.interpolate(method ='linear', limit_direction ='backward')

You will get the same output as in the below figure.

backward interpolation

Interpolation With Padding

We have already seen that to use padding, we have to specify the limit of NaN values to be filled. we have a maximum of 2 NaN values in the data frame, so our limit will be 2.

df.interpolate(method="pad", limit=2)

After running the above code, it will fill missing values with previous and present values and give the output, as shown in the figure below.

interpolation with padding

Filling Missing Values in Time-Series Data

Time-series(datetime) data is data that follows some special trend or seasonality. It makes sense to use the interpolation of the variable before and after a timestamp for a missing value. Analyzing Time series data is a little bit different than normal data frames. Whenever we have time-series data, Then to deal with missing values, we cannot use mean imputation techniques. Interpolation is a powerful method to fill in missing values in time-series data.

df = pd.DataFrame({'Date': pd.date_range(start='2021-07-01', periods=10, freq='H'), 'Value':range(10)})
df.loc[2:3, 'Value'] = np.nan

Syntax for Filling Missing Values in Forwarding and Backward Methods

The simplest method to fill values using interpolation is the same as we apply on a column of the dataframe.

df['value'].interpolate(method="linear")

But the method is not used when we have a date column because we will fill in missing values according to the date, which makes sense while filling in missing values in time series data.

df.set_index('Date')['Value'].interpolate(method="linear")

The same code with a few modifications can be used as a backfill to fill missing values in the backward direction.

df.set_index('Date')['Value'].fillna(method="backfill", axis=None)

Conclusion

We have learned various methods to use the interpolation function in Python to fill in missing values in series as well as in dataframe. It is very important for data scientists and analysts to know how to use the interpolate function, as handling missing values is a crucial part of their everyday job. Interpolation, in most cases supposed to be the best technique to fill in missing values. I hope you now know the power of interpolation and understand how to use it.

Key Takeaways

  • We can read excel and CSV files and can use interpolate function.
  • We can fill in missing values in both forward and backward directions.

Frequently Asked Questions

Q1.What is interpolation in AI?

Interpolation in AI helps fill in the gaps! It estimates missing data in images, sounds, or other information to make things smoother and more accurate for AI tasks.

Q2.What is the interpolation method for missing data?

There are multiple methods to interpolate missing data, like linear and polynomial interpolation.

Q3.What is the interpolation function in math?

Interpolation is like figuring out how much cookie dough you need between recipe amounts. It estimates values between known points.

Q4. What are the advantages of interpolation in Python?

Interpolation is a process of determining the unknown values that lie in between the known data points. It is mostly used to predict the unknown values data points.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

I am a software Engineer with a keen passion towards data science. I love to learn and explore different data-related techniques and technologies. Writing articles provide me with the skill of research and the ability to make others understand what I learned. I aspire to grow as a prominent data architect through my profession and technical content writing as a passion.

Responses From Readers

Christine Wathen
Christine Wathen

Hello, Can interpolation be used to fill missing values for Age, which is not time series data? I tried to fill the missing Age values with the mean, mode, and median, but when I do so, the distribution of the data changes.

Annabel
Annabel

Hi, I have a column which has last 2 values as NaN. If I use interplotion, it just uses the last 3rd value to fill these 2. But I wanted to fill these two missing value as a continues trend rather than static same value. How can I do it easily? Thank you.

Comments are Closed

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details