*This article was published as a part of the Data Science Blogathon*

**Gradient Descent algorithm is an iterative algorithm used for the optimization of parameters used in an equation and to decrease the Loss (often called a Cost function). **

But before diving deep, we first need to have a basic idea of what a gradient means? Why are we calculating the gradient of a function and what is our final goal to achieve by using this algorithm?

*The gradient of a function refers to the slope of the function at some point*. We are calculating the gradient of a function to achieve the global minima of the function. And our goal is to achieve the best set of parameters (coefficients) for a function to have the minimum loss value.

Now, as we are aware of the to-do’s and the goal, let’s explore the concept gradually.

Y=b + w_{1}x_{1}+w_{2}x_{2}+w_{3}x_{3} +…….+ w_{m}x_{m }

Take an example of a simple linear equation of m-variables, representing a population sample. X_{1}, X_{2}, X_{3}…X_{M }(fig(a)).

__Here in this equation, the number of parameters will be m+1 (X_{1}, X_{2}, X_{3}…X_{M } plus the intercept term).__

Now, writing the Cost function for the above linear equation.

**Cost function (Loss) **is calculated as an **average of summation of square of (predicted values minus Actual values) as below:**

As our objective was to find the global minima of the loss function, we need to plot the Loss function concerning all the parameters.

But we can see that the** total number of dimensions for plotting the actual graph has become m+2**. i.e.

*Now we shall try to get the logic behind the scene of gradient descent.*

Let’s say this is our loss function

*As seen above our loss function is a quadratic function. Suppose for our convenience this is our loss function, now if we want to calculate the gradient at x=1.5, we shall find it to be positive (1.66 approximately) right? *

*Now, the question is in which direction w.r.t x=1.5 we have to move to reach the minima. The trick is that we have to move opposite to the direction of the slope to reach the minima i.e. if the slope is positive then move in the negative direction and if the slope is negative then move in the positive direction. (Relate with the above diagram, as slope at x=1.5 ). Ok, now we get that, as the slope is positive at x=1.5 we have to move in the negative direction to search for the minima. But how much ??*

*So the second question that arises here is how much we have to move from point (1.5, 2.5) to get close to the minima of the function. This is called the learning rate. Suppose its “*

*So, now as we are familiar with all the concepts needed for understanding this iterative algorithm, see the steps involved in this iterative process of the algorithm. For simplicity, we shall take a simple example of a linear equation of one variable and then generalize it for m terms: y=b+w*_{1}*x*_{1}*. *

* suppose the below figure represents a scatter plot of y vs variable x*_{1 }

*To find the best set of parameters so as to minimize loss as much as possible (i.e. find the best value of w*_{1}* and b to minimize the loss function), for more simplicity take intercept as zero. *

*In the above figure, we can figure out the loss for every data point as (predicted_value -actual_value)*^{2}

If we plot Loss vs parameter w_{1 }then we shall get a bowl-shaped graph like below.

*Here we can see that for each w*_{1}* there would a loss associated with it. As here in our case, we are taking a simple model to understand, but we can imagine that as parameters increases, Dimensions increase (features). Hence, it will be very difficult for us to judge the minima visually.*

*Step_1: Initialize w*_{1}* to any value (say point C in the above figure).*

*Step_2: Calculate the Loss function J(w*_{1}*, b).*

*Step_3: Take tangent to point C and calculate its gradient. We can find that the slope is negative for point C. So we shall move in the positive direction to reach minima. *

*Step_4: Deciding the Learning Rate ” ***α “**

*Step_5: w*

(Gradient shall decrease with each iteration and the length of stride will also get shorter)

**Now, lets generalize the gradient descent algorithm for m+1 parameters.**

y=b+w_{1}x_{1}+w_{2}x_{2}+w_{3}x_{3}+ ….. ….. +w_{m}x_{m}

**(LOSS FUNCTION) ****J(b,w)= 1/2 { Σ _{(i=1 to m)} [ h_{w,b}(x^{(i)} -y^{(i)}]^{2} } **

Step_1: First we shall randomly initialize b ,w_{1},w_{2},w_{3}…..w_{m}.

Step_2: Use all parameter values and predict h_{w,b}(x^{(i)}) for each data point in the training data set.

Step_3: Calculate Loss J(b, w)

Step_4: Calculate Gradient of J(b ,w) w.r.t each parameters

Step_5: b_{(new)} = b _{(old)} – α (gradient)_{b}

* w*_{1(new)}*=w _{1(old)} – *α*(gradient)

* w _{2}*

**Similarly for the m ^{th} term w _{m}_{(new)}=w_{m(old)} – α*(gradient)_{wm}**

Step_6: Update all parameters b , w_{1} , w_{2} ,w_{3 }, w_{4} , .. w_{m} simultaneously.

Step_7: Go to Step_3.

Step_8: Repeat till convergence.

Now, we have a serious question that till when we should update our parameters and continue the iterative process. The answer is till the function converges. And how to check the Convergence let’s see :

**The graph of Loss vs parameters is purely convex: Then this is a simple case as we know a convex function converges when Slope ( Gradient ) is equal to zero.**

So we shall be observing the parameter values either in subsequent iterations or for the past few iterations. If by observing the parameter values we get that the values are not significantly changing. In that case we can stop the training.

**If the Loss function is not purely convex or some tricky kind of function, like the graph below:**

Here we can see that graph is going almost flat after some time. And we definitely don’t want to apply the algorithm forever.

**So here the trick is to define the max number of steps.** Defining the max number of steps is necessary as* beyond a certain number of steps the parameter values don’t change significantly*. So this is how we deal with the convergence part.

__Handling the Gradient Descent Part of the calculation :__

The gradient is nothing but a derivative of the loss function with respect to parameter values.

**J(b,w)= 1/2 { Σ _{(i=1 to m)} [ h_{w,b}x^{(i)} -y^{(i)}]^{2} } (parameters are w_{1}, w_{2} … w_{m} and b ) **

Now calculate the partial derivative of the loss function with respect to parameter value (say w1)

Hence, in the same notion we can write general image source – self-sizing for any j^{th }parameter:

Also, we have to find the gradient concerning the constant term.

As we have the general representation of gradient with respect to all parameters including the constant. So, now we are in a position to write the most general form of representation, i.e.

The above equation is also called Batch Gradient Descent of “m” Data points as a batch of all m data points is trained together and post to that we update the parameters. When we complete one gradient descent update for all data points, it is called One Epoch.

As we have seen the calculation as well as the intuition of Gradient Descent and the concept of Batch Gradient Descent, we can now easily follow the concepts behind Mini-Batch Gradient Descent and Stochastic Gradient Descent.

** Here we take a chunk of k-data points and calculate the Gradient Update.** In each iteration, we are using exactly k-data points to calculate gradient descent update.

(Usually, k is taken in the power of 2 like 32, 64, 512, 1024 )

Mini- Batch Gradient Descent works better than Gradient Descent as parameters get updated with every gradient update of the batch of data.

Now, we shall see a special case of Gradient Descent in which batch size k=1. This is called Stochastic gradient descent.

__Stochastic Gradient Descent (sgd) :(k=1)__ **Here we update the gradient just by looking at a single data point. Just after training on one data point, the gradient is updated. **

Step_1: Draw a bunch of k-examples of data points. (X^{(i)} ,Y^{(i)})

Step_2: Randomly Initialize parameters.

Step_3: Repeat until convergence.

Step_4: Obtain predictions from the model and calculate Loss on the Batch. (In stochastic process Batch size is of only one data point.)

Step_5: Calculate the gradient of the loss with respect to each parameter.

Step_6: Update the parameters simultaneously.

Finally, we have attained our objective of learning gradient descent and understanding the logic behind batch and stochastic gradient descent.

But as these all are optimization algorithms i.e. used for fetching the best set of parameters, you might be having the question that how do we know that our model is learning??

Well, in that case, we can plot the graph of Loss vs Iteration. The logic says loss should reduce with an increasing number of iterations just like the graph below. (Note LR is the learning rate we have seen i.e. *” ***α ” ).**

*The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask