Getting Started with Machine Learning - Implementing Linear Regression from Scratch

Yashowardhan Last Updated : 01 Dec, 2023

7 min read

This article was published as a part of the Data Science Blogathon

Introduction

Using the machine learning models in your projects is quite simple considering that we have pre-built modules and libraries like sklearn, but it is important that one knows how the actual algorithm works for a better understanding of the core concept. In this blog, we will be implementing one of the most basic algorithms in machine learning i.e Simple Linear Regression. The topics that will be covered in this blog are as follows:

Introduction
What is Linear Regression?
Calculating coefficient of the equation:
How to implement Linear Regression in Python?
How to visualize the regression line?
- Which metrics to use for model evaluation?
Results and Conclusion:
Frequently Asked Questions

What is Linear Regression?

Linear Regression is a supervised Machine Learning algorithm it is also considered to be the most simple type of predictive Machine Learning algorithm. There is some basic assumption that we make for linear regression to work, such as it is important that the relation between the independent and the target variable is linear in nature else our model will end up giving irrelevant results.

Image Link

The word Linear in Linear Regression suggests that the function used for the prediction is a linear function. This function can be represented as shown below:

Linear Regression equation — Equation for Simple Linear Regression

Now, you might be familiar with this equation, in fact, we all have used this equation this is the equation of a straight line. In terms of linear regression, y in this equation stands for the predicted value, x means the independent variable and m & b are the coefficients we need to optimize in order to fit the regression line to our data.

Calculating coefficient of the equation:

To calculate the coefficients we need the formula for Covariance and Variance, so the formula for these are:

Linear Regression covariance — Formula for Covariance

To calculate the coefficient m we will use the formula given below

m = cov(x, y) / var(x)
b = mean(y) — m * mean(x)

How to implement Linear Regression in Python?

Now that we know the formulas for calculating the coefficients of the equation let’s move onto the implementation. To implement this code we will be using standard libraries like Pandas and Numpy and later to visualize our result we will use Matplotlib and Seaborn.

The dataset used in this example is available on my Github repository here. The dataset has 2 columns Parameter 1 and Parameter 2. So, x will be Parameter 1 and y will be Parameter 2. We will first start by writing functions to calculate the Mean, Covariance, and Variance.

# mean 
def get_mean(arr):
    return np.sum(arr)/len(arr)

# variance
def get_variance(arr, mean):
    return np.sum((arr-mean)**2)

# covariance
def get_covariance(arr_x, mean_x, arr_y, mean_y):
    final_arr = (arr_x - mean_x)*(arr_y - mean_y)
    return np.sum(final_arr)

Now, the next step will be to calculate the coefficients and the Linear Regression Function so let’s see the code for those functions.

# Coefficients 
# m = cov(x, y) / var(x)
# b = y - m*x
def get_coefficients(x, y):
    x_mean = get_mean(x)
    y_mean = get_mean(y)
    m = get_covariance(x, x_mean, y, y_mean)/get_variance(x, x_mean)
    b = y_mean - x_mean*m
    return m, b

# Linear Regression 
# Train and Test
# Train Split 80 % Test Split 20 %
def linear_regression(x_train, y_train, x_test, y_test):
    prediction = []
    m, b = get_coefficients(x_train, y_train)
    for x in x_test:
        y = m*x + b
        prediction.append(y)
    
    r2 = r2_score(prediction, y_test)
    mse = mean_squared_error(prediction, y_test)
    print("The R2 score of the model is: ", r2)
    print("The MSE score of the model is: ", mse)
    return prediction
prediction = linear_regression(x[:80], y[:80], x[80:], y[80:])

Now that we have all the functions ready for implementing the regression, the only thing that remains is the visualization, so now we will write the code to visualize the regression line.

How to visualize the regression line?

To visualize the regression line we will use the matplotlib and seaborn libraries. The points will be plotted using the scatterplot and the regression line will be plotted using the lineplot functions. Seaborn also offers a separate plot for regression called the regplot (regression plot) but here we will be using the scatter and lineplot for visualizing the line.

def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

The above code is for understanding the way our visualization works but for regular use you need not write the above code you can simply use the regplot from seaborn, regplot is basically the combination of the scatter plot and the line plot, the code for regplot is as shown below.

# Regression plot
sns.regplot(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title("Regression Plot")
plt.show()

regression plot — Output for the above code

Which metrics to use for model evaluation?

For evaluation of a regression model we generally use the

r2_score
mean_squared_error
root_mean_squared_error
mean_absolute_error

You can learn about these metrics here. In this blog, we have made use of 2 metrics from the 3 listed above. The r2_score and mean_squared_error. The r2_score ranges from 0 to 1 where 1 signifies absolute precision and 0 signify poor performance, the mse or the mean squared error is basically the summation of all the squared errors and lesser the mse value better our model is.

evaluation metrics — Formulae for different metrics

Results and Conclusion:

So now we will combine all the codes we have written above and look at the results we obtain, we will make use of both, the algorithm coded by us as well as the built-in library sklearn. If results by both the methods are the same it means we have successfully coded the algorithm.

Let’s first look at our algorithm:

# Import necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score, mean_squared_error

# Read Data
df = pd.read_csv('SampleData.csv')
x = df['Parameter 1'].values
y = df['Parameter 2'].values

# mean
def get_mean(arr):
    return np.sum(arr)/len(arr)

# variance
def get_variance(arr, mean):
    return np.sum((arr-mean)**2)

# covariance
def get_covariance(arr_x, mean_x, arr_y, mean_y):
    final_arr = (arr_x - mean_x)*(arr_y - mean_y)
    return np.sum(final_arr)

# find coeff
def get_coefficients(x, y):
    x_mean = get_mean(x)
    y_mean = get_mean(y)
    m = get_covariance(x, x_mean, y, y_mean)/get_variance(x, x_mean)
    c = y_mean - x_mean*m
    return m, c

# Regression Function
def linear_regression(x_train, y_train, x_test, y_test):
    prediction = []
    m, c = get_coefficients(x_train, y_train)
    for x in x_test:
        y = m*x + c
        prediction.append(y)
    
    r2 = r2_score(prediction, y_test)
    mse = mean_squared_error(prediction, y_test)
    print("The R2 score of the model is: ", r2)
    print("The MSE score of the model is: ", mse)
    return prediction

# There are 100 sample out of which 80 are for training and 20 are for testing
linear_regression(x[:80], y[:80], x[80:], y[80:])

# Visualize
def plot_reg_line(x, y):
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()
    
plot_reg_line(x, y)

And the output generated by our algorithm is shown the figure below, the red line shows the regression line.

Now let’s see the built-in Linear Regression function:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression

df = pd.read_csv('SampleData.csv')
x = df['Parameter 1'].values
y = df['Parameter 2'].values

reg = LinearRegression()
reg.fit(x[:80].reshape(-1, 1), y[:80])
prediction = reg.predict(x[80:].reshape(-1, 1))
r2 = r2_score(prediction, y[80:])
mse = mean_squared_error(prediction, y[80:])
print("The R2 score of the model is: ", r2)
print("The MSE score of the model is: ", mse)

prediction = reg.predict(np.array([i for i in range(1, 100)]).reshape(-1, 1))

fig = plt.figure(figsize=(20,7))
plt.subplot(1,2,1)
sns.scatterplot(x=x, y=y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot between X and Y')

plt.subplot(1,2,2)
sns.scatterplot(x=x, y=y, color = 'green')
sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Regression Plot')
plt.show()

The output of the code above:

We can clearly see from both the outputs that we have obtained the exact same result from both the algorithms i.e the one we coded and the one in sklearn, so that was all about writing your own code for implementing Linear Regression in python. If you enjoyed reading the blog do share it with your friends. If you have any suggestions or doubts you can comment them down below. in the future blogs, I will try to cover the implementations of other ML algorithms like KNN, Decision Tree, etc. Happy Learning!

Frequently Asked Questions

Q1.Why is linear regression so hard?

Learning linear regression might seem hard initially due to mathematical concepts and data interpretation. Practice and understanding can make it easier.

Q2.What is the SXX formula for linear regression?

The SXX formula calculates the sum of squared differences between each data point’s X value and the mean X value.

Q3.What does F mean in linear regression?

In linear regression, “F” usually refers to the F-statistic, a measure of model fit. It helps assess whether the overall model significantly predicts the outcome statistically.

Q4.What does SSx mean in statistics?

In statistics, SSx represents the sum of squares, representing the sum of squared deviations from the mean. It’s a measure of the variability within a set of values.

Connect with me on LinkedIn

Email: [email protected]

Check out my previous articles here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion

Yashowardhan

Free Courses

4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

4.5

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

4.9

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Reading list

Getting Started with Machine Learning - Implementing Linear Regression from Scratch

Introduction

Table of contents

What is Linear Regression?

Calculating coefficient of the equation:

How to implement Linear Regression in Python?

How to visualize the regression line?

Which metrics to use for model evaluation?

Results and Conclusion:

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Data Science Course

No Code Predictive Analytics with Orange

Adaptive Email Agents with DSPy

Introduction to AI & ML

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Getting Started with Machine Learning - Implementing Linear Regression from Scratch

Introduction

Table of contents

What is Linear Regression?

Calculating coefficient of the equation:

How to implement Linear Regression in Python?

How to visualize the regression line?

Which metrics to use for model evaluation?

Results and Conclusion:

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Data Science Course

No Code Predictive Analytics with Orange

Adaptive Email Agents with DSPy

Introduction to AI & ML

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques