The Game of Increasing R-squared in a Regression Model

CHIRAG Last Updated : 15 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

After building a Machine Learning model, the next and very crucial step is to evaluate the model performance on the unseen or test data and see how good our model is against a benchmark model.

The evaluation metric to be used would depend upon the type of problem you are trying to solve —whether it is a supervised, unsupervised problem, or a mix of these (like semi-supervised), and if it is a classification or a regression task.

In this article, we will discuss two important evaluation metrics used for regression problem statements and we will try to find the key difference between them and learn why these metrics are preferred over Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) for a regression problem statement.

Some Important questions which we are trying to understand in this article are as follows:

👉 The Game of increasing R-squared (R²)

👉 Why we go for adjusted-R²?

👉 When to use which from R² and adjusted-R²?

Let’s first understand what exactly is R Squared?

R-squared, which sometimes is also known as the coefficient of determination, defines the degree to which the variance in the dependent variable (target or response) can be explained by the independent variable (features or predictors).

Let us understand this with an example — say the R² value for a regression model having Income as an Independent variable (predictor) and, Expenditure as a dependent variable (response) comes out to be 0.76.

– In general terms, this means that 76% of the variation in the dependent variable is explained by the independent variables.

But for our defined regression problem statement, it can be understood as,

👉 76% variability in expenditure is associated or related with the regression equation and 24% variations are due to other factors.

👉76% variability in expenditure is explained by its linear relationship with income while 24% variations are uncounted for.

👉 76% variation in expenditure due to variation in income while we can’t say anything about the 24% variations. God knows better about it.

R Squared | Linear regression

Image Source: link

Important points about R Squared

👉 Ideally, we would want the independent variables to explain the complete variations in the target variable. In that scenario, the R² value would be equal to 1. Thus we can say that the higher the R² value, the better is our model.

👉 In simple terms, the higher the R², the more variation is explained by your input variables, and hence better is your model. Also, the R² would range from [0,1]. Here is the formula for calculating R²–

The R² is calculated by dividing the sum of squares of residuals from the regression model (given by SS_RES) by the total sum of squares of errors from the average model (given by SS_TOT) and then subtracting it from 1.

Looking at R-Squared. In data science we create regression… | by Erika D | Medium

Fig. Formula for Calculating R²

Image Source: link

Drawbacks of using R Squared :

👉 Every time if we add X_i(independent/predictor/explanatory) to a regression model, R² increases even if the independent variable is insignificant for our regression model.

👉 R² assumes that every independent variable in the model helps to explain variations in the dependent variable. In fact, some independent variables don’t help to explain the dependent variable. In simple words, some variables don’t contribute to predicting the dependent variable.

👉 So, if we add new features to the data (which may or may not be useful), the R² value for the model would either increase or remain the same but it would never decrease.

So, to overcome all these problems, we have adjusted-R² which is a slightly modified version of R².

Let’s understand what is Adjusted R²?

👉 Similar to R², Adjusted-R² measures the proportion of variations explained by only those independent variables that really help in explaining the dependent variable.

👉 Unlike R², the Adjusted-R² punishes for adding such independent variables that don’t help in predicting the dependent variable (target).

Let us mathematically understand how this feature is accommodated in Adjusted-R². Here is the formula for adjusted R²

From Data Pre-processing to Optimizing a Regression Model Performance - R Squared

Fig. Formula for Calculating adjusted-R²

Image Source: link

Let’s take an example to understand the values changes of these metrics in a Regression model

For Example,

Independent Variable	R²	Adjusted-R²
X₁	67.8	67.1
X₂	88.3	85.6
X₃	92.5	82.7

In this example for a regression problem statement, we observed that the independent variable X₃ is insignificant or it doesn’t contribute to explain the variation in the dependent variable. Hence, adjusted-R² is decreased because the involvement of in-significant variable harms the predicting power of other variables that are already included in the model and declared significant.

R² vs Adjusted-R²

👉 Adjusted-R² is an improved version of R².

👉 Adjusted-R² includes the independent variable in the model on merit.

👉 Adjusted-R² < R²

👉 R² includes extraneous variations whereas adjusted-R² includes pure variations.

👉 The difference between R² and adjusted-R² is only the degrees of freedom.

The Game of Increasing R²

Sometimes researchers tried their best to increase R² in every possible way.

👉 One way to include more and more explanatory (independent) variables in the model because:

R² is an increasing function of the number of independent variables i.e, with the inclusion of one more independent variable R² is likely to increase or at least will not decrease.

When to use which?

Comparing models using R²

Comparing two models just based on R² is dangerous as,

👉 Models having a different number of independent variables may have an equal value of R².

👉 Total sample size and respective degrees of freedom are ignored.

Hence, there is a likelihood that one would choose the wrong model.

Problem solved by adjusted-R²

To compare two different models, or choose the best model, the adjusted-R²is used because:

👉 It is adjusted for the respective degree of freedom.

👉 It takes into account the total sample size and number of independent variables.

👉 It is not an increasing function of the number of independent variables.

👉 It only increases if newly independent variables have an impact on the dependent variable.

CONCLUSION:

So, concluding the discussion we say that,

👉 R²can be used to access the goodness of fit of a single model whereas,

👉Adjusted-R² is used to compare two models and to see the real impact of newly added independent variables.

👉 Adjusted-R² should be used while selecting important predictors for the regression model.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article on Top Machine Learning Libraries in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion.

CHIRAG

Beginner Linear Regression Machine Learning Supervised

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

The Game of Increasing R-squared in a Regression Model

Introduction

Let’s first understand what exactly is R Squared?

Important points about R Squared

Drawbacks of using R Squared :

Let’s understand what is Adjusted R2?

R2 vs Adjusted-R2

The Game of Increasing R2

When to use which?

Problem solved by adjusted-R2

CONCLUSION:

End Notes

About the author

Chirag Goyal

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

Let’s understand what is Adjusted R²?

R² vs Adjusted-R²

The Game of Increasing R²

Problem solved by adjusted-R²