How does Backward Propagation Work in Neural Networks?

Neha 08 Jun, 2021

8 min read

This article was published as a part of the Data Science Blogathon

Introduction

We have dived deep into what is a Neural Network, its structure and components, Gradient Descent, its limitations and how are neurons estimated, and the working of the forward propagation.

Forward Propagation is the way to move from the Input layer (left) to the Output layer (right) in the neural network. The process of moving from the right to left i.e backward from the Output to the Input layer is called the Backward Propagation.

Backward Propagation is the preferable method of adjusting or correcting the weights to reach the minimized loss function. In this article, we shall explore this second technique of Backward Propagation in detail by understanding how it works mathematically, why it is the preferred method. A caution, the article is going to be a mathematically heavy one but wait for the end to see how this method looks in action 🙂

Setting up the Base

Let’s say we want to use the neural network to predict house prices. For our understanding purpose here, we will take a subset dummy dataset having four input variables and six observations here with the input having a dimension of 4*5:

The neural network for this subset data looks like below:

Source: researchgate.net

The architecture of the neural network is [4, 5, 1] with:

4 independent variables, Xs in the input layer
5 nodes in the hidden layer, and
Since we have a regression problem at hand, we will have one node in the output layer.

A Neural Network operates by:

Initializing the weights with some random values, which are mostly between 0 and 1.
Compute the output to calculate the loss or the error term.
Then, adjust the weights so that to minimize the loss.

We repeat these three steps until have reached the optimum solution of the minimum loss function or exhausted the pre-defined epochs (i.e. the number of iterations).

Now, the computation graph after applying the sigmoid activation function is:

In case you are wondering how and from where this equation arrived and why there will be matrix dimensions then request you to read the previous article to understand the mechanism of how neural networks work and are estimated.

Building on this, the first step in Backward Propagation to calculate the error. In our regression problem, we shall take the loss function = (Y-Y^)²/2 where Y is actual values and Y^ is predicted values. For simplicity, replacing Y^ with O, so the error E becomes = (Y-O)²/2.

Our goal is to minimize the error that is clearly dependent on Y, which is the actual observed values, and on the output, which is further is dependent on the:

input values
coefficients or betas of the input variables
biases, the activation function, and
Optimizers

Now, we can neither change the input variables nor the actual Y values however, we can change the other factors. The activation function and the optimizers are the tuning parameters – and we can change these based on our requirement.

The other two factors: the coefficients or betas of the input variables (W_is) and the biases (b_ho, b_ih) are updated using the Gradient descent algorithm with the following equation:

W_new = W_old – (α * dE/dW)

where,

W_new = the new weight of X_i
W_old = the old weight of the X_i
α = learning rate
dE/dW is the partial derivative of the error for each of the Xs. It is the rate of change of the error to the change in weight.

In the backward propagation, we adjust these weights or the betas in the output. The weights and biases between the respective input, hidden and output layers we have here are W_ih, b_ih, W_ho, and b_ho:

W_ih: weight between the input and the hidden layer
b_ih: bias between the input and the hidden layer
W_ho: weight between the hidden and the output layer
b_ho: bias between the hidden and the output layer

In the first iteration, we randomly initialize the weights. In the second iteration, we change the weights of the hidden layer that is closest to the output layer. In this case, we go from the output layer, hidden layer, and then to the input layer.

Contribution of each Weight and Bias on the Error

Now, we have to calculate how much each of these weights (W_is) and biases (b_is) contribute to the error term. For this, we need to calculate the rate of change of error to the respective weights and bias parameters.

In other words, we need to compute the terms: dE/dW_ih, dE/db_ih, dE/dW_ho, and dE/db_ho. This is not a direct task. It is a series of steps involving the Chain Rule.

The weight, W_ho, between the hidden and the output layer:

From the above graph we can see that the error E is not directly dependent on the Who:

The error term is dependent on the Output O
Output O is further dependent on Z₂, and
Z₂ is dependent on W_ho

Therefore we employ the chain rule to compute the rate of change in error to the change in weight W_ho and it becomes:

dE/dW_ho = dE/dO * dO/dZ₂ * dZ₂/dW_ho

Now, we take the partial derivatives of each of these individual terms:

E = (Y-O)²/2.

The partial derivative of error with respect to Output is: dE/dO = 2*(Y-O)*(-1)/2 = (O-Y)
The partial derivative of Output with respect to Z₂, as output O = Sigmoid of Z₂ and the derivative of sigmoid is:

dO/dZ₂ = sigmoid(Z₂) *(1-sigmoid(Z₂)) = O*(1-O)

The partial derivative of Z₂ with respect to W_ho is:

dZ₂/dW_ho = d(W_ho^T * h₁ + b_h0)/dW_ho

dZ₂/dW_ho = d(W_ho^T * h₁)/dW_ho + d(b_ho/W_ho) = h₁ + 0 = h₁

Therefore, dE/dW_ho = dE/dO * dO/dZ₂ * dZ₂/dW_ho becomes:

dE/dW_ho = (O-Y) * O*(1-O) * h₁

Similarly, we will calculate the contribution for each of the other parameters in this manner.

For the bias, b_ho, between the hidden and the output layer:

dE/db_ho = dE/dO * dO/dZ₂ * dZ₂/db_ho

dE/db_ho = (O-Y) * O*(1-O) * 1

The weight, W_ih, between the input and the hidden layer:

From the above graph we can see that the terms are dependent as below:

Error term is dependent on the Output O
Output O is dependent on Z₂
Z₂ this time is dependent on h₁
h₁ is dependent on Z₁, and
Z₁ is dependent on W_ih

dE/dW_ih = dE/dO * dO/dZ₂ * dZ₂/dh₁ * dh₁/dZ₁ * dZ₁/dW_ih

So, this time, apart from the initial above dE/dO, dO/dZ₂, we have the partial derivatives as follow:

The partial derivative of Z₂ with respect to h₁ is:

dZ₂/dh₁ = d(W_ho^T * h₁ + b_ho)/dh₁

dZ₂/dh₁ = d(W_ho^T * h₁)/dh₁ + d(b_ho/h₁) = W_ho + 0 = W_ho

The partial derivative of h₁ with respect to Z₁, as h₁ = Sigmoid of Z₁ and the derivative of sigmoid is:

dh₁/dZ₁ = sigmoid(Z₁) *(1-sigmoid(Z₁)) = h₁* (1 – h₁)

The partial derivative of Z₁ with respect to W_ih is: X

dZ₁/dW_ih = d(W_ih^T * X + b_ih)/dW_ih

dZ₁/dW_ih = d(W_ih^T * X)/dW_ih + d(b_ih/W_ih) = X + 0 = X

Hence, the equation after plugging the partial derivative value is:

dE/dW_ih = dE/dO * dO/dZ₂ * dZ₂/dh₁ * dh₁/dZ₁ * dZ₁/dW_ih

dE/dW_ih = (O-Y) * O*(1-O) * W_ho * h₁(1-h₁) * X

The bias, b_ih, between the input and the hidden layer:

dE/db_ih = dE/dO * dO/dZ₂ * dZ₂/dh₁ * dh₁/dZ₁ * dZ₁/db_ih

dE/db_ih = (O-Y) * O*(1-O) * W_ho * h₁(1-h₁) * 1

Now, that we have computed these terms we can update the parameters using the following respective update equations:

W_ih = W_ih – (α * dE/dW_ih)
b_ih = b_ih – (α * dE/db_ih)
W_ho = W_ho – (α * dE/dW_ho)
b_ho = b_ho – (α * dE/db_ho)

Now, moving to another method to perform backward propagation …

Matrix Form of the Backward Propagation

The backward propagation can also be solved in the matrix form. The computation graph for the structure along with the matrix dimensions is:

Z₁ = W_ih^T * X + b_ih

where,

W_ih is the weight matrix between the input and the hidden layer with the dimension of 4*5
W_ih^T, is the transpose of W_ih, having shape 5*4
X is the input variables having dimension 4*5, and
b_ih is a bias term, has a single value here as considering the same for all the neurons.

Z₂ = W_ho^T * h₁ + b_ho

where,

W_ho is the weight matrix between the hidden and the output layer with shape 5*1
W_ho^T, is the transpose of Who having a dimension of 1*5
h₁ is the result after the applying activation function on the outcome from the hidden layer with a shape of 5*5, and
b_ho is the bias term, has a single value here as considering the same for all the neurons.

To summarize, the four equations of the rate of change of error with the different parameters are:

dE/dW_ho = dE/dO * dO/dZ₂ * dZ₂/dW_ho= (O-Y) * O*(1-O) * h₁

dE/db_ho = dE/dO * dO/dZ₂ * dZ₂/db_ho= (O-Y) * O*(1-O) * 1

dE/dW_ih = dE/dO * dO/dZ₂ * dZ₂/dh₁ * dh₁/dZ₁ * dZ₁/dW_ih = (O-Y) * O*(1-O) * W_ho * h₁(1-h₁) * X

dE/db_ih = dE/dO * dO/dZ₂ * dZ₂/dh₁ * dh₁/dZ₁ * dZ₁/db_ih = (O-Y) * O*(1-O) * W_ho * h₁(1-h₁) * 1

Now, lets’ see how we can perform matrix multiplication on each of these equations. For the weight matrix between the hidden and the output layer, W_ho.

Let us understand how the shape of this W_ho must be similar to that of the shape of dE/dW_ho, which is to used to update the weight in the following equation:

W_ho = W_ho – (α * dE/dW_ho)

We saw above that dE/dW_ho is computed using the chain rule and is of the result:

dE/dW_ho = dE/dO * dO/dZ₂ * dZ₂/dW_ho

dE/dW_ho = (O-Y) * O*(1-O) * h₁

Breaking the individual components of this above equation we see each part’s dimension:

dE/dO = (O-Y) as both O and Y have the same shape of 1*5. Hence, dE/dO is of dimension 1*5.

dO/dZ₂ = O*(1-O) having a shape of 1*5, and

dZ₂/dW_ho = h₁, which is of the shape 5*5

Now, performing matrix multiplication on this equation. As we know, matrix multiplication can be done when the number of columns of the first matrix must be equal to the number of rows of the second matrix. Where this matrix multiplication rule defies, we will take the transpose of one of the matrices to conduct the multiplication.

On applying this our equation takes the form of:

dE/dW_ho = dZ₂/dW_ho . [dE/dO * dO/dZ₂] ^T

dE/dW_ho = (5X5) . [(1X5) *(1X5)]^T

dE/dW_ho = (5X5) . (5X1) = 5X1

Therefore, the shape of dE/dW_ho 5*1 is the same as that of W_ho 5*1 which will be updated using the Gradient Descent update equation.

In the same manner, we can find perform the backward propagation for the other parameters using matrix multiplication and the respective equations will be:

dE/dW_ho = dZ₂/dW_ho . [dE/dO * dO/dZ₂] ^T

dE/db_ho = dZ₂/db_ho . [dE/dO * dO/dZ₂ ]^T

dE/dW_ih = dZ₁/dW_ih . [dh₁/dZ₁ * dZ₂/dh₁. (dE/dO * dO/dZ₂)] ^T

dE/db_ih = dZ₁/db_ih . [dh₁/dZ₁ * dZ₂/dh₁. (dE/dO * dO/dZ₂)] ^T

Where, (.) dot is the dot product and * is the element wise product.

Endnotes

To summarize, as promised, below is a very cool gif that shows how backward propagation operates in reaching to the solution by minimizing the loss function or error:

Endnotes

Source: 7-hiddenlayers.com

Backward Propagation is the preferred method for adjusting the weights and biases since it is faster to converge as we move from output to the hidden layer. Here, we change the weights of the hidden layer that is closest to the output layer, re-calculate the loss and if further need to reduce the error then repeat the entire process and in that order move towards the input layer.

Whereas in the forward propagation, the pecking order is from the input layer, hidden, and then to the output layer which takes more time to converge to the optimum solution of the minimum loss function.

I hope the article was helpful to show how backward propagation works. You may reach out to me on my LinkedIn: linkedin.com/in/neha-seth-69771111

Thank You. Happy Learning! 🙂

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

backward propagation blogathon

Neha 08 Jun, 2021

Hi there! I am Neha Seth. I work as a Data Scientist in Larsen & Toubro Infotech (LTI). I hold a Postgraduate Program in Data Science & Engineering from the Great Lakes Institute of Management and a Bachelors in Statistics. I have been featured as Top 10 Most Popular Guest Authors in 2020 on Analytics Vidhya (AV). My area of interest lies in NLP and Deep Learning. I have also passed the CFA Program. You can reach out to me on LinkedIn and can read my other blogs for AV.

Advanced Deep Learning Maths