All you need to know about Polynomial Regression
This article was published as a part of the Data Science Blogathon
Hello, hope you are fine. In this article, we will study Polynomial regression and implement it using Python on sample data. I hope you are already familiar with Simple Linear Regression Algorithm, if not then please visit our previous article and get a basic understanding of Linear Regression because Polynomial Regression is derived using the same concept of Linear regression with few modifications to increase accuracy.
Table Of Contents
- Why Polynomial Regression?
- How It overcomes the problem of non-linear data?
- Why It is known as Linear Regression?
- Comparing Polynomial and simple Regression Practically
- With One Input Variable
- with Multiple Input variables
- End Notes
Why Polynomial Regression?
In simple linear regression algorithm only works when the relationship between the data is linear But suppose if we have non-linear data then Linear regression will not capable to draw a best-fit line and It fails in such conditions. consider the below diagram which has a non-linear relationship and you can see the Linear regression results on it, which does not perform well means which do not comes close to reality. Hence, we introduce polynomial regression to overcome this problem, which helps identify the curvilinear relationship between independent and dependent variables.
How Polynomial Regression Overcomes the problem of Non-Linear data?
Polynomial regression is a form of Linear regression where only due to the Non-linear relationship between dependent and independent variables we add some polynomial terms to linear regression to convert it into Polynomial regression.
Suppose we have X as Independent data and Y as dependent data. Before feeding data to a mode in preprocessing stage we convert the input variables into polynomial terms using some degree.
Consider an example my input value is 35 and the degree of a polynomial is 2 so I will find 35 power 0, 35 power 1, and 35 power 2 And this helps to interpret the non-linear relationship in data.
The equation of polynomial becomes something like this.
y = a0 + a1x1 + a2x12 + … + anx1n
The degree of order which to use is a Hyperparameter, and we need to choose it wisely. But using a high degree of polynomial tries to overfit the data and for smaller values of degree, the model tries to underfit so we need to find the optimum value of a degree.
Why Polynomial Regression is called Polynomial Linear Regression?
If you see the equation of polynomial regression carefully, then we can see that we are trying to estimate the relationship between coefficients and y. And the values of x and y are already given to us, only we need to determine coefficients and the degree of coefficient here is 1 only, and degree one represents simple linear regression Hence, Polynomial regression is also known as polynomial Linear regression. And this is only the simple concept behind this. I hope you got the point right?
Comparing Polynomial and simple Linear Regression Practically
Now we know how polynomial regression works and helps to build a model over non-linear data. Let’s compare both the algorithm practically and see the results.
first I will generate the data using some equation ax^2 + bx + c and then first we will apply simple linear regression to it and then we will apply polynomial regression on top the this which will make an easy comparison between the performance of both the algorithms.
First, we will try only with one input column and one output column and after having a brief understanding we will try it on high dimensional data.
Polynomial Regression with One Variable
let’s make your hands dirty with some practical implementations
Step-1) import all the libraries
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score
Step-2) Create and visualize the data
X = 6 * np.random.rand(200, 1) - 3 y = 0.8 * X**2 + 0.9*X + 2 + np.random.randn(200, 1) #equation used -> y = 0.8x^2 + 0.9x + 2 #visualize the data plt.plot(X, y, 'b.') plt.xlabel("X") plt.ylabel("Y") plt.show()
we have added some random noise in data so that while modeling it does not overfit it.
Step-3) split data in train and test set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
Step-4) Apply simple linear regression
Now we will analyze the prediction by fitting simple linear regression. We can see that how worse the model is performing, It is not capable to estimate the points.
lr = LinearRegression() lr.fit(x_train, y_train) y_pred = lr.predict(x_test) print(r2_score(y_test, y_pred))
If you see the score, it will near 15 percent to 20 percent which to too worse. If you plot the prediction line it will same as we saw above which is not capable to identify or estimate the best fit line.
plt.plot(x_train, lr.predict(x_train), color="r") plt.plot(X, y, "b.") plt.xlabel("X") plt.ylabel("Y") plt.show()
Step-5) Apply Polynomial Regression
Now we will convert the input to polynomial terms by using the degree as 2 because of the equation we have used, the intercept is 2. while dealing with real-world problems we choose degree by heat and trial method.
#applying polynomial regression degree 2 poly = PolynomialFeatures(degree=2, include_bias=True) x_train_trans = poly.fit_transform(x_train) x_test_trans = poly.transform(x_test) #include bias parameter lr = LinearRegression() lr.fit(x_train_trans, y_train) y_pred = lr.predict(x_test_trans) print(r2_score(y_test, y_pred))
After converting to polynomial terms we fit the linear regression which is now working as polynomial regression. If you print x_train value and train transformed value you will see the 3 polynomial terms. And the model is now performing descent well and if you see the coefficients and intercept value. our coefficient was 0.9 and it has predicted 0.88 and intercept was 2 and it has given 1.9 which is very close to the original and the model can be said as a generalized model.
If we visualize the predicted line across the training data points you can see how well it is identifying the non-linear relationship in data.
X_new = np.linspace(-3, 3, 200).reshape(200, 1) X_new_poly = poly.transform(X_new) y_new = lr.predict(X_new_poly) plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions") plt.plot(x_train, y_train, "b.",label='Training points') plt.plot(x_test, y_test, "g.",label='Testing points') plt.xlabel("X") plt.ylabel("y") plt.legend() plt.show()
Playing with Polynomial degree
Now we will design a function that will help you to find the right value of a degree. here we apply all the preprocessing steps we have done above in a function and map the end prediction plot on it. Only you need to pass is degree and it will build a model and plot a graph of a particular degree. here we will create a pipeline of preprocessing steps that makes the process streamlined.
from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline def polynomial_regression(degree): X_new=np.linspace(-3, 3, 100).reshape(100, 1) X_new_poly = poly.transform(X_new) polybig_features = PolynomialFeatures(degree=degree, include_bias=False) std_scaler = StandardScaler() lin_reg = LinearRegression() polynomial_regression = Pipeline([ ("poly_features", polybig_features), ("std_scaler", std_scaler), ("lin_reg", lin_reg), ]) polynomial_regression.fit(X, y) y_newbig = polynomial_regression.predict(X_new) #plotting prediction line plt.plot(X_new, y_newbig,'r', label="Degree " + str(degree), linewidth=2) plt.plot(x_train, y_train, "b.", linewidth=3) plt.plot(x_test, y_test, "g.", linewidth=3) plt.legend(loc="upper left") plt.xlabel("X") plt.ylabel("y") plt.axis([-3, 3, 0, 10]) plt.show()
when we run the function while passing high degrees like 10, 15, 20 then the model tries to overfit the data means slowly the prediction line will leave its original essence and try to rely on training data points and as there is some change in the training path the line tries to catch the point.
This is a problem with a High degree of polynomial which I want to show you practically so it’s necessary to choose an optimum value of a degree. here I would like to recommend you to try a different degree and analyze the results.
Polynomial Regression with Multiple columns
we have seen polynomial regression with one variable. most of the time there will be many columns in input data so how to apply polynomial regression and visualize the result in 3-dimensional space. It sometimes feels like a hectic task for most beginners so let’s crack that out and understand how to perform polynomial regression in 3-d space.
Step-1) Creating a data
I am taking 2 input columns and one output column. the approach with multiple columns is the same.
# 3D polynomial regression x = 7 * np.random.rand(100, 1) - 2.8 y = 7 * np.random.rand(100, 1) - 2.8 z = x**2 + y**2 + 0.2*x + 0.2*y + 0.1*x*y +2 + np.random.randn(100, 1)
let’s visualize the data in 3-d space.
import plotly.express as px df = px.data.iris() fig = px.scatter_3d(df, x=x.ravel(), y=y.ravel(), z=z.ravel()) fig.show()
Step-2) Applying Linear Regression
first, let’s try to estimate results with simple linear regression for better understanding and comparison.
- Numpy mesh grid is useful for converting 2 vectors to a coordinating grid so we can extend this to 3-d instead of 2-d.
- Numpy v-stack is used to stack the arrays vertically(row-wise). This is equivalent to concatenate along axis 1.
let’s visualize the prediction of linear regression in 3-d space.
import plotly.graph_objects as go fig = px.scatter_3d(df, x=x.ravel(), y=y.ravel(), z=z.ravel()) fig.add_trace(go.Surface(x = x_input, y = y_input, z =z_final )) fig.show()
Step-3) Estimate results using Polynomial Regression
Now we will transform inputs to polynomial terms and see the powers
X_multi = np.array([x,y]).reshape(100,2) poly = PolynomialFeatures(degree=30) X_multi_trans = poly.fit_transform(X_multi) print("Input",poly.n_input_features_) print("Ouput",poly.n_output_features_) print("Powersn",poly.powers_)
After running the above code you will get the powers of both x and y and we can estimate the result as x power 0 and y power 0, x power 1 and y power 0, and so on. let’s apply the regression to these polynomial terms.
lr = LinearRegression() lr.fit(X_multi_trans, z) X_test_multi = poly.transform(final) z_final = lr.predict(X_multi_trans).reshape(10,10)
Now when we visualize the results of Polynomial regression you can see how well the contour it has plotted.
The plot looks beautiful, we can see at some places the plot is up and down means somewhere it is overfitting the data so to find the generalized term takes some time and you have to do the heat and trial method.
Polynomial Regression is a form of Linear regression known as a special case of Multiple linear regression which estimates the relationship as an nth degree polynomial. Polynomial Regression is sensitive to outliers so the presence of one or two outliers can also badly affect the performance.
I hope you got the intuition and practical implementation behind the algorithm right and in a simple way. If you have any queries please post them in the comments section below. If you like my article, then give a read to other articles too. link
About the Author
I am pursuing my bachelor’s in computer science. I am very fond of Data science and big data. I love to work with data and learn new technologies. Please feel free to connect with me on Linkedin.