## Introduction

As a beginner in the world of data science, the first algorithm I was introduced to was Linear Regression. I applied it to different datasets and noticed both it’s advantages and limitations.

It assumed a linear relationship between the dependent and independent variables, which was rarely the case in reality. As an improvement over this model, I tried Polynomial Regression which generated better results (most of the time). But using Polynomial Regression on datasets with high variability chances to result in over-fitting.

* Source: Pingax*

My model always became too flexible, which does not work well with unseen data. I then came across another non-linear approach known as **Regression Splines**. It uses a combination of linear/polynomial functions to fit the data.

**In this article, we will go through some basics of linear and polynomial regression and study in detail the meaning of splines and their implementation in Python**.

*Note: To fully understand the concepts covered in this article, knowledge of linear and polynomial regression is required*. You can learn more about them here.

Let’s get started!

## Table of Contents

- Understanding the Data
- Quick Review of Linear Regression
- Polynomial Regression: Improvement over Linear Regression
- Walk-through of Regression Splines along with its Implementations
- Piece wise Step Functions
- Basis Functions
- Piece wise Polynomials
- Constraints and Splines
- Cubic and Natural Cubic splines
- Choosing the Number and Locations of the Knots
- Comparison of Regression Splines with Polynomial Regression

## Understanding the data

To understand the concepts, we will work on the wage prediction dataset which you can download here (this has been taken from the popular book: “Introduction to Statistical learning”).

Our dataset contains information like the ID, year, age, sex, marital status, race, education, region, job class, health, health insurance, log of wage and wage of various employees. **In order to focus on spline regression in detail, I will use only ‘age’ as the independent variable to predict the wage (dependent variable).**

Let’s start working on the data.

# import modules import pandas as pd import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt %matplotlib inline # read data_set data = pd.read_csv("Wage.csv") data.head()

data_x = data['age'] data_y = data['wage'] # Dividing data into train and validation datasets from sklearn.model_selection import train_test_split train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state = 1) # Visualize the relationship b/w age and wage import matplotlib.pyplot as plt plt.scatter(train_x, train_y, facecolor='None', edgecolor='k', alpha=0.3) plt.show()

What are your thoughts on the above scatter plot? Is it positively, negatively or not correlated at all? Please share your thoughts in the comments section below.

## Introduction to Linear Regression

Linear regression is the simplest and most widely used statistical technique for predictive modelling. It is a supervised learning algorithm for solving regression based tasks.

It is called a linear model as it establishes a linear relationship between the dependent and independent variables. It basically gives us a linear equation like the one below where we have our features as independent variables with coefficients:

Here, we have Y as our dependent variable, the X’s are the independent variables and all betas are the coefficients. Coefficients are the weights assigned to the features. They signify the importance of each of the features. For example, if the outcome of an equation is highly dependent upon one feature (X1) as compared to any other feature, it means the coefficient/weight of the feature (X1) would have a higher magnitude as compared to any other feature.

So, let’s try to understand linear regression with only one feature, i.e., only one independent variable. It is called Simple Linear Regression. Therefore, our equation becomes,

As we are using only ‘age’ to predict the ‘wages’ of the employees, **we will implement simple linear regression on the training dataset and calculate the error (RMSE) on the validation dataset.**

from sklearn.linear_model import LinearRegression # Fitting linear regression model x = train_x.reshape(-1,1) model = LinearRegression() model.fit(x,train_y) print(model.coef_) print(model.intercept_)

-> array([0.72190831])

-> 80.65287740759283

# Prediction on validation dataset valid_x = valid_x.reshape(-1,1) pred = model.predict(valid_x) # Visualisation # We will use 70 plots between minimum and maximum values of valid_x for plotting xp = np.linspace(valid_x.min(),valid_x.max(),70) xp = xp.reshape(-1,1) pred_plot = model.predict(xp) plt.scatter(valid_x, valid_y, facecolor='None', edgecolor='k', alpha=0.3) plt.plot(xp, pred_plot) plt.show()

We can now calculate the RMSE on the predictions.

from sklearn.metrics import mean_squared_error from math import sqrt rms = sqrt(mean_squared_error(valid_y, pred)) print(rms) -> 40.436

We can infer from the above graph that linear regression is not capturing all the signals available and is not the best method for solving this wage prediction.

Although linear models are relatively simple to describe and implement and have advantages over other approaches in terms of interpretation and inference, they** have significant limitations in terms of predictive power**. This is because they assume the linear combination between the dependent and independent variables which is almost always an approximation, and sometimes a poor one.

In the other methods we will see below, we will set aside the linearity assumption while still attempting to maintain as much interpretability as possible. We will do this by examining very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines.

## Improvement over Linear Regression: Polynomial Regression

Consider these visualisations –

The plots above seem to be using a lot more signals between wage and age as compared to the linear plot. These plots are not linear in shape, hence they use a non-linear equation instead of a linear equation for establishing the relationship between age and wage. **This type of regression technique, which uses a non linear function, is called Polynomial regression.**

Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables , as predictors. This approach provides a simple way to provide a non-linear fit to data.

**The standard method to extend linear regression to a non-linear relationship between the dependent and independent variables, has been to replace the linear model with a polynomial function.**

As we increase the power value, the curve obtained contains high oscillations which will lead to shapes that are over-flexible. **Such curves lead to over-fitting.**

# Generating weights for polynomial function with degree =2 weights = np.polyfit(train_x, train_y, 2) print(weights) -> array([ -0.05194765, 5.22868974, -10.03406116]) # Generating model with the given weights model = np.poly1d(weights) # Prediction on validation set pred = model(valid_x) # We will plot the graph for 70 observations only xp = np.linspace(valid_x.min(),valid_x.max(),70) pred_plot = model(xp) plt.scatter(valid_x, valid_y, facecolor='None', edgecolor='k', alpha=0.3) plt.plot(xp, pred_plot) plt.show()

Similarly, we can plot polynomial curves for different degree values.

Unfortunately, polynomial regression has a fair number of issues as well. As we increase the complexity of the formula, the number of features also increases which is sometimes difficult to handle. Also, polynomial regression has a tendency to drastically over-fit, even on this simple one dimensional data set.

There are other issues with polynomial regression. For example, it is inherently non-local, i.e., changing the value of Y at one point in the training set can affect the fit of the polynomial for data points that are very far away. Hence, to avoid the use of high degree polynomial on the whole dataset, we can substitute it with many different small degree polynomial functions.

## Walk-through of Regression Splines along with its Implementations

In order to overcome the disadvantages of polynomial regression, we can use an improved regression technique which, instead of building one model for the entire dataset, divides the dataset into multiple bins and fits each bin with a separate model. **Such a technique is known as Regression spline.**

Regression splines is one of the most important non linear regression techniques. In polynomial regression, we generated new features by using various ** polynomial functions on the existing features **which imposed a global structure on the dataset

**To overcome this, we can divide the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions.**

*.** Source: R-Bloggers*

**The points where the division occurs are called Knots**. Functions which we can use for modelling each piece/bin are known as Piecewise functions. There are various piecewise functions that we can use to fit these individual bins.

In the next few sub-sections, we will read about some of these piecewise functions.

### Piecewise Step Functions

One of the most common piecewise functions is a Step function. Step function is a function which remains constant within the interval. We can fit individual step functions to each of the divided portions in order to avoid imposing a global structure. Here we break the range of X into bins, and fit a different constant in each bin.

In greater detail, we create cut points C1 , C2, . . . , Ck in the range of X, and then construct K + 1 new variables.

where I( ) is an indicator function that returns a 1 if the condition is true and returns a 0 otherwise. For example, I(cK ≤ X ) equals 1 if cK ≤ X, otherwise it equals 0. For a given value of X, at most only one of C1, C2, . . ., CK can be non-zero, as X can only lie in any one of the bins.

# Dividing the data into 4 bins df_cut, bins = pd.cut(train_x, 4, retbins=True, right=True) df_cut.value_counts(sort=False) -> (17.938, 33.5] 504 (33.5, 49.0] 941 (49.0, 64.5] 511 (64.5, 80.0] 54 Name: age, dtype: int64

```
df_steps = pd.concat([train_x, df_cut, train_y], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
df_steps_dummies.head()
df_steps_dummies.columns = ['17.938-33.5','33.5-49','49-64.5','64.5-80']
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 4 bins
bin_mapping = np.digitize(valid_x, bins)
X_valid = pd.get_dummies(bin_mapping)
# Removing any outliers
X_valid = pd.get_dummies(bin_mapping).drop([5], axis=1)
# Prediction
pred2 = fit3.predict(X_valid)
# Calculating RMSE
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(valid_y, pred2))
print(rms)
->39.9
# We will plot the graph for 70 observations only
xp = np.linspace(valid_x.min(),valid_x.max()-1,70)
bin_mapping = np.digitize(xp, bins)
X_valid_2 = pd.get_dummies(bin_mapping)
pred2 = fit3.predict(X_valid_2)
```

# Visualisation fig, (ax1) = plt.subplots(1,1, figsize=(12,5)) fig.suptitle('Piecewise Constant', fontsize=14) # Scatter plot with polynomial regression line ax1.scatter(train_x, train_y, facecolor='None', edgecolor='k', alpha=0.3) ax1.plot(xp, pred2, c='b') ax1.set_xlabel('age') ax1.set_ylabel('wage') plt.show()

Binning has its obvious conceptual issues. Most prominently, we expect most phenomena we study to vary continuously with inputs. **Binned regression does not create continuous functions of the predictor, so in most cases we would expect no relationship between the input and output.**

For example, in the above graph, we can see that the first bin clearly misses the increasing trend of wage with age.

### Basis Functions

To capture non-linearity in regression models, we need to transform some, or all of the predictors. To avoid having to treat every predictor as linear, we want to apply a very general *family* of transformations to our predictors. The family should be flexible enough to adapt (when the model is fit) to a wide variety of shapes, but not *too* flexible as to over-fit.

**This concept of a family of transformations that can fit together to capture general shapes is called a basis function**. In this case, our objects are functions: b1 (X ), b2 (X ), . . . , bK (X ).

Instead of fitting a linear model in X, we fit the below model:

Now we’ll look into a very common choice for a basis function: **Piecewise Polynomials.**

### Piecewise Polynomials

Instead of fitting a constant function over different bins across the range of X, piecewise polynomial regression involves fitting separate low-degree polynomials over different regions of X. As we use lower degrees of polynomials, we don’t observe high oscillations of the curve around the data.

For example, a piecewise quadratic polynomial works by fitting a quadratic regression equation:

where the coefficients β0 , β1 and β2 differ in different parts of the range of X. A piecewise cubic polynomial, with a single knot at a point *c*, takes the below form:

In other words, we fit two different polynomial functions to the data: one on the subset of the observations with *xi < c*, and one on the subset of the observations with *xi ≥ c*.

The first polynomial function has coefficients β01, β11, β21, β31 and the second has coefficients β02, β12, β22, β32. **Each of these polynomial functions can be fit using the least squares error metric.**

Remember that this family of polynomial functions has 8 degrees of freedom, 4 for each polynomial (as there are 4 variables).

Using more knots leads to a more flexible piecewise polynomial, as we use different functions for every bin. These functions depend only on the distribution of data of that particular bin. In general,** if we place K different knots throughout the range of X, we will end up fitting K+1 different cubic polynomials**. We can use any low degree polynomial to fit these individual bins. For example, we can instead fit piecewise linear functions. In fact, **the stepwise functions used above are actually piecewise polynomials of degree 0**.

Now we will look at some necessary conditions and constraints that should be followed while forming piecewise polynomials.

### Constraints and Splines

We need to be cautious while using Piecewise polynomials as there are various constraints that we need to follow. Consider the image below:

*Source: Elements of Statistical Learning*

We might encounter certain situations where the polynomials at either end of a knot are not continuous at the knot. Such a condition should be avoided because the family of polynomials as a whole should generate a unique output for every input.

We can see from the above image that it outputs two different values at the first knot. Thus, to avoid this, we should **add an extra constraint/condition that the polynomials on either side of a knot should be continuous at the knot**.

*Source: Elements of Statistical Learning*

Now after adding that constraint, we get a continuous family of polynomials. But does it look perfect? Before reading further, take a moment to think about what’s missing here.

It looks like smoothness at the knots is still absent. So to smoothen the polynomials at the knots, we **add an extra constraint/condition: the first derivative of both the polynomials must be same**. One thing we should note: *Each constraint that we impose on the piecewise cubic polynomials effectively frees up one degree of freedom*, as we reduce the complexity of the resulting piecewise polynomial fit. Therefore, in the above plot, we are using only 10 degrees of freedom instead of 12.

*Source: Elements of Statistical Learning*

After imposing the constraint of equal first derivative, we obtain the above plot. This plot uses 8 degrees of freedom instead of 12 as two constraints are imposed. Although the above plot looks better, there is still some scope for improvement. Now, **we will impose an extra constraint: that the double derivatives of both the polynomials at a knot must be same.**

*Source: Elements of Statistical Learning*

This plot seems perfect for our study. It uses 6 degrees of freedom instead of 12. **Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a ****Spline**. Hence, we have constructed a Cubic Spline in the above plot. We can plot any degree of spline with m-1 continuous derivatives.

### Cubic and Natural Cubic Splines

Cubic spline is a piecewise polynomial with a set of extra constraints (continuity, continuity of the first derivative, and continuity of the second derivative). In general, a cubic spline with K knots uses cubic spline with a total of 4 + K degrees of freedom. There is seldom any good reason to go beyond cubic-splines (unless one is interested in smooth derivatives).

from patsy import dmatrix import statsmodels.api as sm import statsmodels.formula.api as smf # Generating cubic spline with 3 knots at 25, 40 and 60 transformed_x = dmatrix("bs(train, knots=(25,40,60), degree=3, include_intercept=False)", {"train": train_x},return_type='dataframe') # Fitting Generalised linear model on transformed dataset fit1 = sm.GLM(train_y, transformed_x).fit() # Generating cubic spline with 4 knots transformed_x2 = dmatrix("bs(train, knots=(25,40,50,65),degree =3, include_intercept=False)", {"train": train_x}, return_type='dataframe') # Fitting Generalised linear model on transformed dataset fit2 = sm.GLM(train_y, transformed_x2).fit() # Predictions on both splines pred1 = fit1.predict(dmatrix("bs(valid, knots=(25,40,60), include_intercept=False)", {"valid": valid_x}, return_type='dataframe')) pred2 = fit2.predict(dmatrix("bs(valid, knots=(25,40,50,65),degree =3, include_intercept=False)", {"valid": valid_x}, return_type='dataframe')) # Calculating RMSE values rms1 = sqrt(mean_squared_error(valid_y, pred1)) print(rms1) -> 39.4 rms2 = sqrt(mean_squared_error(valid_y, pred2)) print(rms2) -> 39.3 # We will plot the graph for 70 observations only xp = np.linspace(valid_x.min(),valid_x.max(),70) # Make some predictions pred1 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp}, return_type='dataframe')) pred2 = fit2.predict(dmatrix("bs(xp, knots=(25,40,50,65),degree =3, include_intercept=False)", {"xp": xp}, return_type='dataframe')) # Plot the splines and error bands plt.scatter(data.age, data.wage, facecolor='None', edgecolor='k', alpha=0.1) plt.plot(xp, pred1, label='Specifying degree =3 with 3 knots') plt.plot(xp, pred2, color='r', label='Specifying degree =3 with 4 knots') plt.legend() plt.xlim(15,85) plt.ylim(0,350) plt.xlabel('age') plt.ylabel('wage') plt.show()

We know that the behavior of polynomials that are fit to the data tends to be erratic near the boundaries. Such variability can be dangerous. These problems are resembled by splines, too. The polynomials fit beyond the boundary knots behave even more wildly than the corresponding global polynomials in that region. **To smooth the polynomial beyond the boundary knots, we will use a special type of spline known as Natural Spline.**

A natural cubic spline adds additional constraints, namely that the function is linear beyond the boundary knots. This constrains the cubic and quadratic parts there to 0, each reducing the degrees of freedom by 2. That’s 2 degrees of freedom at each of the two ends of the curve, reducing *K*+4 to *K*.

# Generating natural cubic spline transformed_x3 = dmatrix("cr(train,df = 3)", {"train": train_x}, return_type='dataframe') fit3 = sm.GLM(train_y, transformed_x3).fit() # Prediction on validation set pred3 = fit3.predict(dmatrix("cr(valid, df=3)", {"valid": valid_x}, return_type='dataframe')) # Calculating RMSE value rms = sqrt(mean_squared_error(valid_y, pred3)) print(rms) -> 39.44 # We will plot the graph for 70 observations only xp = np.linspace(valid_x.min(),valid_x.max(),70) pred3 = fit3.predict(dmatrix("cr(xp, df=3)", {"xp": xp}, return_type='dataframe')) # Plot the spline plt.scatter(data.age, data.wage, facecolor='None', edgecolor='k', alpha=0.1) plt.plot(xp, pred3,color='g', label='Natural spline') plt.legend() plt.xlim(15,85) plt.ylim(0,350) plt.xlabel('age') plt.ylabel('wage') plt.show()

### Choosing the Number and Locations of the Knots

When we fit a spline, where should we place the knots? One potential place would be the area of high variability, because in those regions the polynomial coefficients can change rapidly. Hence, one option is to place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable.

While this option can work well, in practice it is common to place knots in a uniform fashion. One way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantiles of the data.

Another option is to try out different numbers of knots and see which produces the best looking curve.

**A more objective approach is to use cross-validation**. With this method:

- we remove a portion of the data,
- fit a spline with a certain number of knots to the remaining data, and then,
- use the spline to make predictions for the held-out portion.

We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RMSE. This procedure can be repeated for different numbers of K knots. Then the value of K giving the smallest RMSE is chosen.

**Comparison of Regression Splines with Polynomial Regression**

Regression splines often give better results than polynomial regression. This is because, **unlike polynomials, which must use a high degree polynomial to produce flexible fits, splines introduce flexibility by increasing the number of knots but keep the degree fixed**.

Generally, this approach produces more stable estimates. Splines also allow us to place more knots, and hence flexibility, over regions where the function seems to be changing rapidly, and fewer knots where the function appears more stable. The extra flexibility in the polynomial produces undesirable results at the boundaries, whereas the natural cubic spline still provides a reasonable fit to the data.

￼

## End Notes

In this article, we learned about regression splines and their benefits over linear and polynomial regression. Another method to produce splines is called** smoothing splines**. It works similar to Ridge/Lasso regularisation as it penalizes both loss function and a smoothing function. You can read more about it in the book ‘Introduction to Statistical learning’. You can implement these methods on datasets with high variability and notice the difference.

Did you find this article helpful? Please share your opinions / thoughts in the comments section below.

### Learn, engage , hack and get hired!

This is awesome

Hey Kashish

Glad that you liked the article.

This was very informative. Thank you for this article.

Hey Kartik

Glad that you liked the article.

Instead of fitting a line to this data set, I would rather define a range. One can fit a slightly increasing lower limit for the whole set and two straight lines as upper limit for the intervals of 18-35 and 35-60 (or 65).

Hey klagyi

Yes, you can definitely do this for this dataset. You are dividing the dataset into bins which is the primary logic behind Regression splines.

Amazing, love it!

Hey Joseph

Glad that you liked the article.

Nice article.

Thanks Dr. Frank

Very nicely done.

Very nice material. thank you, obrigado from Brazil.

Interesting article.

Nice article.

Nice article.

What is meant by “imposing global structure”. If this cannot be answerable simply and needs some knowledge, please provide links.

Hey

So imposing global structure means you are using a single function to represent all the data points. If your data has high variance, the function used will be complex even if some part of your data is constant or linear. Same function which will be plotted on data with high variance will be plotted on constant/linearly distributed data points. I hope it cleared your query.

Hi,

This is very detailed and neatly explained.

Thanks a lot for sharing this.

Hey

Glad that you liked the article

Hi,

Nice article. For the part:

# Binning validation set into same 4 bins

bin_mapping = np.digitize(valid_x, bins)

X_valid = pd.get_dummies(bin_mapping)

I am getting an exception:

Exception: Data must be 1-dimensional

Any hints.. Thanks in anticipation

Hey

Add this extra line before generating dummies

bin_mapping = bin_mapping.ravel()

Regards

nice article

Hi Gurchetan

Indeed very good article! Thanks for sharing.

I have a question regarding the method and in general regarding the data of this kind. In effect, the additional bit the spline is capable of doing is capture some non-linearity in the data set. However, what I noticed is despite using regression splines, the RMSE hasn’t reduced much. Clearly, the high variance in the data contributes to high RMSE. But I am curious to know what else could be done to reduce RMSE? I can think of two sides of this issue:

1. Underlying data issue (eg, presence of groups, data collected for different populations/locations/times) ? And is there a way to identify and capture those groups ?

2. Presence of random effects (similar to idea above, and requiring multiple regressions)

3. Missing variables ?

Would really appreciate, If someone who has faced this issue could comment.

Thanks

Hey

This dataset is small hence might not be the best example to show the benefits of Regression splines. And yes presence of groups in the dataset is important to find. It’s like finding different groups of data within a data and fitting separate regression lines on them.