Master Polynomial Regression With Easy-to-Follow Tutorials
In this article, we will study the Polynomial Regression model and implement it using Python on sample data. I hope you are already familiar with Simple Linear Regression Algorithm and multiple polynomial regression. If not, then please visit our previous article and get a basic understanding of the linear regression model vs. polynomial regression and linear regression because polynomial regression is derived using the same concept of Linear regression with few modifications to increase accuracy.
- Explore the concept of polynomial regression in machine learning.
- Where and how to use polynomial regression.
- Comparison of polynomial and simple linear regression.
This article was published as a part of the Data Science Blogathon.
Table of contents
- Why Polynomial Regression?
- How Does Polynomial Regression Handle Non-Linear Data?
- Why Is Polynomial Regression Called Polynomial Linear Regression?
- Linear Regression Vs. Polynomial Regression
- Polynomial Regression With One Variable
- Playing With a Polynomial Degree
- Polynomial Regression With Multiple columns
- Frequently Asked Questions
Why Polynomial Regression?
A simple linear regression algorithm only works when the relationship between the data is linear. But suppose we have non-linear data, then linear regression will not be able to draw a best-fit line. Simple regression analysis fails in such conditions. Consider the below diagram, which has a non-linear relationship, and you can see the linear regression results on it, which does not perform well, meaning it does not come close to reality. Hence, we introduce polynomial regression to overcome this problem, which helps identify the curvilinear relationship between independent and dependent variables.
How Does Polynomial Regression Handle Non-Linear Data?
Polynomial regression is a form of Linear regression where only due to the Non-linear relationship between dependent and independent variables, we add some polynomial terms to linear regression to convert it into Polynomial regression.
In polynomial regression, the relationship between the dependent variable and the independent variable is modeled as an nth-degree polynomial function. When the polynomial is of degree 2, it is called a quadratic model; when the degree of a polynomial is 3, it is called a cubic model, and so on.
Suppose we have a dataset where variable X represents the Independent data and Y is the dependent data. Before feeding data to a mode in the preprocessing stage, we convert the input variables into polynomial terms using some degree.
Consider an example my input value is 35, and the degree of a polynomial is 2, so I will find 35 power 0, 35 power 1, and 35 power 2 this helps to interpret the non-linear relationship in data.
The equation of polynomials becomes something like this.
y = a0 + a1x1 + a2x12 + … + anx1n
The degree of order which to use is a Hyperparameter, and we need to choose it wisely. But using a high degree of polynomial tries to overfit the data, and for smaller values of degree, the model tries to underfit, so we need to find the optimum value of a degree. Polynomial Regression models are usually fitted with the method of least squares. The least square method minimizes the variance of the coefficients under the Gauss-Markov Theorem.
Why Is Polynomial Regression Called Polynomial Linear Regression?
If you see the equation of polynomial regression carefully, then we can see that we are trying to estimate the relationship between coefficients and y. And the values of x and y are already given to us, only we need to determine coefficients, and the degree of coefficient here is 1 only, and degree one represents simple linear regression Hence, Polynomial Regression is also known as Polynomial Linear Regression as it has a polynomial equation and this is only the simple concept behind this. I hope you got the point right.
Linear Regression Vs. Polynomial Regression
Now we know how polynomial regression works and helps to build a model over non-linear data. Let’s compare both algorithms practically and see the results.
First, we will generate the data using some equation ax^2 + bx + c, and then apply simple linear regression to it to form a linear equation. Then we will apply polynomial regression on top of it, which will make an easy comparison between the practical performance of both algorithms.
Initially, we will try it with only one input column and one output column. After having a brief understanding we will try it on high-dimensional data.
Polynomial Regression With One Variable
let’s make your hands dirty with some practical implementations
Step 1: Import all the libraries
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score
Step 2: Create and visualize the data
We have added some random noise in the data so that while modeling, it does not overfit it.
Step 3: Split data in the train and test set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
Step 4: Apply simple linear regression
Now we will analyze the prediction by fitting simple linear regression. We can see how worse the model is performing, It is not capable of estimating the points.
lr = LinearRegression() lr.fit(x_train, y_train) y_pred = lr.predict(x_test) print(r2_score(y_test, y_pred))
If you see the score, it will be near 15 percent to 20 percent, which is too much. If you plot the prediction line, it will be the same as we saw above, which is not capable of identifying or estimating the best-fit line.
plt.plot(x_train, lr.predict(x_train), color="r") plt.plot(X, y, "b.") plt.xlabel("X") plt.ylabel("Y") plt.show()
Step 5: Apply polynomial regression
Now we will convert the input to polynomial terms by using the degree as 2 because of the equation we have used, the intercept is 2. while dealing with real-world problems, we choose degree by heat and trial method.
#applying polynomial regression degree 2 poly = PolynomialFeatures(degree=2, include_bias=True) x_train_trans = poly.fit_transform(x_train) x_test_trans = poly.transform(x_test) #include bias parameter lr = LinearRegression() lr.fit(x_train_trans, y_train) y_pred = lr.predict(x_test_trans) print(r2_score(y_test, y_pred))
After converting to polynomial terms, we fit the linear regression which is now working as polynomial regression. If you print the x_train value and train transformed value, you will see the 3 polynomial terms. And the model is now performing descent well and if you see the coefficients and intercept value. our coefficient was 0.9, and it predicted 0.88 and the intercept was 2 it has given 1.9 which is very close to the original and the model can be said as a generalized model.
If we visualize the predicted line across the training data points, we can see how well it identifies the non-linear relationship in data.
X_new = np.linspace(-3, 3, 200).reshape(200, 1) X_new_poly = poly.transform(X_new) y_new = lr.predict(X_new_poly) plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions") plt.plot(x_train, y_train, "b.",label='Training points') plt.plot(x_test, y_test, "g.",label='Testing points') plt.xlabel("X") plt.ylabel("y") plt.legend() plt.show()
Playing With a Polynomial Degree
Now we will design a function that will help you to find the right value for a degree. here we apply all the preprocessing steps we have done above in a function and map the end prediction plot on it. All you need to do to pass is the degree and it will build a model and plot a graph of a particular degree. here we will create a pipeline of preprocessing steps that makes the process streamlined.
Source: Analytics Vidhya
from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline def polynomial_regression(degree): X_new=np.linspace(-3, 3, 100).reshape(100, 1) X_new_poly = poly.transform(X_new) polybig_features = PolynomialFeatures(degree=degree, include_bias=False) std_scaler = StandardScaler() lin_reg = LinearRegression() polynomial_regression = Pipeline([ ("poly_features", polybig_features), ("std_scaler", std_scaler), ("lin_reg", lin_reg), ]) polynomial_regression.fit(X, y) y_newbig = polynomial_regression.predict(X_new) #plotting prediction line plt.plot(X_new, y_newbig,'r', label="Degree " + str(degree), linewidth=2) plt.plot(x_train, y_train, "b.", linewidth=3) plt.plot(x_test, y_test, "g.", linewidth=3) plt.legend(loc="upper left") plt.xlabel("X") plt.ylabel("y") plt.axis([-3, 3, 0, 10]) plt.show()
when we run the function while passing high degrees like 10, 15, and 20, then the model tries to overfit the data means slowly the prediction line will leave its original essence and try to rely on training data points, and as there is some change in the training path, the line tries to catch the point.
This is a problem with a High degree of polynomial, which I want to show you practically, so it’s necessary to choose an optimum value of a degree. here I would like to recommend you try a different degree and analyze the results.
Polynomial Regression With Multiple columns
We have seen polynomial regression with one variable. most of the time, there will be many columns in input data, so how to apply polynomial regression and visualize the result in 3-dimensional space. It sometimes feels like a hectic task for most beginners, so let’s crack that out and understand how to perform polynomial regression in 3-d space.
Step 1: Creating a dataset
I am taking 2 input columns and one output column. the approach with multiple columns is the same.
# 3D polynomial regression x = 7 * np.random.rand(100, 1) - 2.8 y = 7 * np.random.rand(100, 1) - 2.8 z = x**2 + y**2 + 0.2*x + 0.2*y + 0.1*x*y +2 + np.random.randn(100, 1)
let’s visualize the data in 3-d space using a 3-D scatter plot (Plotly library).
import plotly.express as px df = px.data.iris() fig = px.scatter_3d(df, x=x.ravel(), y=y.ravel(), z=z.ravel()) fig.show()
Step 2: Applying linear regression
first, let’s try to estimate results with simple linear regression for better understanding and comparison.
- A numpy mesh grid is useful for converting 2 vectors to a coordinating grid, so we can extend this to 3-d instead of 2-d.
- Numpy v-stack is used to stack the arrays vertically(row-wise). This is equivalent to concatenating along axis 1.
let’s visualize the prediction of linear regression in 3-d space.
import plotly.graph_objects as go fig = px.scatter_3d(df, x=x.ravel(), y=y.ravel(), z=z.ravel()) fig.add_trace(go.Surface(x = x_input, y = y_input, z =z_final )) fig.show()
Step 3: Estimating results using polynomial regression
Now we will transform inputs to polynomial terms and see the powers
X_multi = np.array([x,y]).reshape(100,2) poly = PolynomialFeatures(degree=30) X_multi_trans = poly.fit_transform(X_multi) print("Input",poly.n_input_features_) print("Ouput",poly.n_output_features_) print("Powersn",poly.powers_)
After running the above code, you will get the powers of both x and y, and we can estimate the result as x power 0 and y power 0, x power 1 and y power 0, and so on. let’s apply the regression to these polynomial terms.
lr = LinearRegression() lr.fit(X_multi_trans, z) X_test_multi = poly.transform(final) z_final = lr.predict(X_multi_trans).reshape(10,10)
Now when we visualize the results of Polynomial regression, we can see how well the contour has plotted.
The plot looks beautiful. We can see in some places, the plot is up and down, meaning somewhere it is overfitting the data. So it takes some time to find the generalized term, and you have to do the heat and trial method.
I hope you now understand the intuition and practical implementation behind the algorithm. In this tutorial, we learned that Polynomial Regression is a form of Linear Regression known as a special case of Multiple linear regression, which estimates the relationship as an nth-degree polynomial. Polynomial Regression is sensitive to outliers, so the presence of one or two outliers can also badly affect the performance.
- A polynomial regression model is a machine learning model that can capture non-linear relationships between variables by fitting a non-linear regression line, which may not be possible with simple linear regression.
- It is used when linear regression models may not adequately capture the complexity of the relationship.
- It can be useful in various fields, such as finance, physics, engineering, and social sciences, where there may be nonlinear relationships between variables.
Frequently Asked Questions
A. A polynomial model is a type of regression model in which the relationship between the dependent variable and the independent variable(s) is modeled as an nth-degree polynomial function. In other words, instead of fitting a straight line (as in linear regression), a curve fits the data.
A. The degrees of freedom in a polynomial regression with degree ‘d’ are equal to the number of coefficients that need to be estimated minus one.
A. Although polynomial regression is good for modeling a non-linear relationship between the independent and dependent variables, it has some limitations. Here are some of them:
– Overfitting: Polynomial regression models can easily become overfit to the data, especially when using high-degree polynomials.
– Nonlinear relationships: Polynomial regression models are only appropriate for modeling nonlinear relationships that can be approximated by a polynomial curve.
R-squared in polynomial regression is a statistical measure that indicates how well a polynomial regression model fits the data. It represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Higher R-squared values indicate a better fit.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.