Introduction
We have dived deep into what is a Neural Network, its structure and components, Gradient Descent, its limitations and how are neurons estimated, and the working of the forward propagation.
Forward Propagation is the way to move from the Input layer (left) to the Output layer (right) in the neural network. The process of moving from the right to left i.e backward from the Output to the Input layer is called the Backward Propagation.
Backward Propagation is the preferable method of adjusting or correcting the weights to reach the minimized loss function. In this article, we shall explore this second technique of Backward Propagation in detail by understanding how it works mathematically, why it is the preferred method. A caution, the article is going to be a mathematically heavy one but wait for the end to see how this method looks in action 🙂
Table of Contents
 Setting up the Base
 Contribution of each Weight and Bias on the Error
 Matrix Form of the Backward Propagation
Setting up the Base
Let’s say we want to use the neural network to predict house prices. For our understanding purpose here, we will take a subset dummy dataset having four input variables and six observations here with the input having a dimension of 4*5:
The neural network for this subset data looks like below:
Source: researchgate.net
The architecture of the neural network is [4, 5, 1] with:
 4 independent variables, Xs in the input layer
 5 nodes in the hidden layer, and
 Since we have a regression problem at hand, we will have one node in the output layer.
A Neural Network operates by:
 Initializing the weights with some random values, which are mostly between 0 and 1.
 Compute the output to calculate the loss or the error term.
 Then, adjust the weights so that to minimize the loss.
We repeat these three steps until have reached the optimum solution of the minimum loss function or exhausted the predefined epochs (i.e. the number of iterations).
Now, the computation graph after applying the sigmoid activation function is:
In case you are wondering how and from where this equation arrived and why there will be matrix dimensions then request you to read the previous article to understand the mechanism of how neural networks work and are estimated.
Building on this, the first step in Backward Propagation to calculate the error. In our regression problem, we shall take the loss function = (YY^)^{2}/2 where Y is actual values and Y^ is predicted values. For simplicity, replacing Y^ with O, so the error E becomes = (YO)^{2}/2.
Our goal is to minimize the error that is clearly dependent on Y, which is the actual observed values, and on the output, which is further is dependent on the:
 input values
 coefficients or betas of the input variables
 biases, the activation function, and
 Optimizers
Now, we can neither change the input variables nor the actual Y values however, we can change the other factors. The activation function and the optimizers are the tuning parameters – and we can change these based on our requirement.
The other two factors: the coefficients or betas of the input variables (W_{i}s) and the biases (b_{ho}, b_{ih}) are updated using the Gradient descent algorithm with the following equation:
W_{new} = W_{old} – (α * dE/dW)
where,
 W_{new} = the new weight of X_{i}
 W_{old} = the old weight of the X_{i}
 α = learning rate
 dE/dW is the partial derivative of the error for each of the Xs. It is the rate of change of the error to the change in weight.
In the backward propagation, we adjust these weights or the betas in the output. The weights and biases between the respective input, hidden and output layers we have here are W_{ih}, b_{ih}, W_{ho}, and b_{ho}:
 W_{ih}: weight between the input and the hidden layer
 b_{ih}: bias between the input and the hidden layer
 W_{ho}: weight between the hidden and the output layer
 b_{ho}: bias between the hidden and the output layer
In the first iteration, we randomly initialize the weights. In the second iteration, we change the weights of the hidden layer that is closest to the output layer. In this case, we go from the output layer, hidden layer, and then to the input layer.
Contribution of each Weight and Bias on the Error
Now, we have to calculate how much each of these weights (W_{i}s) and biases (b_{i}s) contribute to the error term. For this, we need to calculate the rate of change of error to the respective weights and bias parameters.
In other words, we need to compute the terms: dE/dW_{ih}, dE/db_{ih}, dE/dW_{ho}, and dE/db_{ho}. This is not a direct task. It is a series of steps involving the Chain Rule.
The weight, W_{ho}, between the hidden and the output layer:
From the above graph we can see that the error E is not directly dependent on the Who:
 The error term is dependent on the Output O
 Output O is further dependent on Z_{2}, and
 Z_{2} is dependent on W_{ho}
Therefore we employ the chain rule to compute the rate of change in error to the change in weight W_{ho} and it becomes:
dE/dW_{ho} = dE/dO * dO/dZ_{2} * dZ_{2}/dW_{ho}
Now, we take the partial derivatives of each of these individual terms:
E = (YO)^{2}/2.

The partial derivative of error with respect to Output is: dE/dO = 2*(YO)*(1)/2 = (OY)

The partial derivative of Output with respect to Z_{2}, as output O = Sigmoid of Z_{2} and the derivative of sigmoid is:
dO/dZ_{2} = sigmoid(Z_{2}) *(1sigmoid(Z_{2})) = O*(1O)

The partial derivative of Z_{2} with respect to W_{ho} is:
dZ_{2}/dW_{ho} = d(W_{ho}^{T} * h_{1} + b_{h0})/dW_{ho }
dZ_{2}/dW_{ho} = d(W_{ho}^{T} * h_{1})/dW_{ho} + d(b_{ho}/W_{ho}) = h_{1} + 0 = h_{1}
Therefore, dE/dW_{ho} = dE/dO * dO/dZ_{2} * dZ_{2}/dW_{ho} becomes:
dE/dW_{ho} = (OY) * O*(1O) * h_{1}
Similarly, we will calculate the contribution for each of the other parameters in this manner.
For the bias, b_{ho}, between the hidden and the output layer:
dE/db_{ho} = dE/dO * dO/dZ_{2} * dZ_{2}/db_{ho}
dE/db_{ho} = (OY) * O*(1O) * 1
The weight, W_{ih}, between the input and the hidden layer:
From the above graph we can see that the terms are dependent as below:
 Error term is dependent on the Output O
 Output O is dependent on Z_{2}
 Z_{2} this time is dependent on h_{1}
 h_{1} is dependent on Z_{1}, and
 Z_{1} is dependent on W_{ih}
dE/dW_{ih} = dE/dO * dO/dZ_{2} * dZ_{2}/dh_{1} * dh_{1}/dZ_{1} * dZ_{1}/dW_{ih}
So, this time, apart from the initial above dE/dO, dO/dZ_{2}, we have the partial derivatives as follow:

The partial derivative of Z_{2} with respect to h_{1} is:
dZ_{2}/dh_{1} = d(W_{ho}^{T} * h_{1} + b_{ho})/dh_{1}
dZ_{2}/dh_{1} = d(W_{ho}^{T} * h_{1})/dh_{1} + d(b_{ho}/h_{1}) = W_{ho} + 0 = W_{ho}

The partial derivative of h_{1} with respect to Z_{1}, as h_{1} = Sigmoid of Z_{1} and the derivative of sigmoid is:
dh_{1}/dZ_{1} = sigmoid(Z_{1}) *(1sigmoid(Z_{1})) = h_{1}* (1 – h_{1})

The partial derivative of Z_{1} with respect to W_{ih} is: X
dZ_{1}/dW_{ih} = d(W_{ih}^{T} * X + b_{ih})/dW_{ih }
dZ_{1}/dW_{ih} = d(W_{ih}^{T} * X)/dW_{ih} + d(b_{ih}/W_{ih}) = X + 0 = X
Hence, the equation after plugging the partial derivative value is:
dE/dW_{ih} = dE/dO * dO/dZ_{2} * dZ_{2}/dh_{1} * dh_{1}/dZ_{1} * dZ_{1}/dW_{ih}
dE/dW_{ih} = (OY) * O*(1O) * W_{ho} * h_{1}(1h_{1}) * X
The bias, b_{ih}, between the input and the hidden layer:
dE/db_{ih} = dE/dO * dO/dZ_{2} * dZ_{2}/dh_{1} * dh_{1}/dZ_{1} * dZ_{1}/db_{ih}
dE/db_{ih} = (OY) * O*(1O) * W_{ho} * h_{1}(1h_{1}) * 1
Now, that we have computed these terms we can update the parameters using the following respective update equations:
 W_{ih} = W_{ih} – (α * dE/dW_{ih})
 b_{ih} = b_{ih} – (α * dE/db_{ih})
 W_{ho} = W_{ho} – (α * dE/dW_{ho})
 b_{ho} = b_{ho} – (α * dE/db_{ho})
Now, moving to another method to perform backward propagation …
Matrix Form of the Backward Propagation
The backward propagation can also be solved in the matrix form. The computation graph for the structure along with the matrix dimensions is:
Z_{1} = W_{ih}^{T} * X + b_{ih}
where,
 W_{ih} is the weight matrix between the input and the hidden layer with the dimension of 4*5
 W_{ih}^{T}, is the transpose of W_{ih}, having shape 5*4
 X is the input variables having dimension 4*5, and
 b_{ih} is a bias term, has a single value here as considering the same for all the neurons.
Z_{2} = W_{ho}^{T} * h_{1} + b_{ho}
where,
 W_{ho} is the weight matrix between the hidden and the output layer with shape 5*1
 W_{ho}^{T}, is the transpose of Who having a dimension of 1*5
 h_{1} is the result after the applying activation function on the outcome from the hidden layer with a shape of 5*5, and
 b_{ho} is the bias term, has a single value here as considering the same for all the neurons.
To summarize, the four equations of the rate of change of error with the different parameters are:
dE/dW_{ho} = dE/dO * dO/dZ_{2} * dZ_{2}/dW_{ho }= (OY) * O*(1O) * h_{1}
dE/db_{ho} = dE/dO * dO/dZ_{2} * dZ_{2}/db_{ho }= (OY) * O*(1O) * 1
dE/dW_{ih} = dE/dO * dO/dZ_{2} * dZ_{2}/dh_{1} * dh_{1}/dZ_{1} * dZ_{1}/dW_{ih} = (OY) * O*(1O) * W_{ho} * h_{1}(1h_{1}) * X
dE/db_{ih} = dE/dO * dO/dZ_{2} * dZ_{2}/dh_{1} * dh_{1}/dZ_{1} * dZ_{1}/db_{ih} = (OY) * O*(1O) * W_{ho} * h_{1}(1h_{1}) * 1
Now, lets’ see how we can perform matrix multiplication on each of these equations. For the weight matrix between the hidden and the output layer, W_{ho}.
Let us understand how the shape of this W_{ho} must be similar to that of the shape of dE/dW_{ho}, which is to used to update the weight in the following equation:
W_{ho} = W_{ho} – (α * dE/dW_{ho})
We saw above that dE/dW_{ho} is computed using the chain rule and is of the result:
dE/dW_{ho} = dE/dO * dO/dZ_{2} * dZ_{2}/dW_{ho}
dE/dW_{ho} = (OY) * O*(1O) * h_{1}
Breaking the individual components of this above equation we see each part’s dimension:
dE/dO = (OY) as both O and Y have the same shape of 1*5. Hence, dE/dO is of dimension 1*5.
dO/dZ_{2} = O*(1O) having a shape of 1*5, and
dZ_{2}/dW_{ho} = h_{1}, which is of the shape 5*5
Now, performing matrix multiplication on this equation. As we know, matrix multiplication can be done when the number of columns of the first matrix must be equal to the number of rows of the second matrix. Where this matrix multiplication rule defies, we will take the transpose of one of the matrices to conduct the multiplication.
On applying this our equation takes the form of:
dE/dW_{ho} = dZ_{2}/dW_{ho} . [dE/dO * dO/dZ_{2}] ^{T }
dE/dW_{ho} = (5X5) . [(1X5) *(1X5)]^{T}
dE/dW_{ho} = (5X5) . (5X1) = 5X1
Therefore, the shape of dE/dW_{ho} 5*1 is the same as that of W_{ho} 5*1 which will be updated using the Gradient Descent update equation.
In the same manner, we can find perform the backward propagation for the other parameters using matrix multiplication and the respective equations will be:
dE/dW_{ho} = dZ_{2}/dW_{ho} . [dE/dO * dO/dZ_{2}] ^{T }
dE/db_{ho} = dZ_{2}/db_{ho} . [dE/dO * dO/dZ_{2} ]^{T }
dE/dW_{ih} = dZ_{1}/dW_{ih} . [dh_{1}/dZ_{1} * dZ_{2}/dh_{1}. (dE/dO * dO/dZ_{2})] ^{T }
dE/db_{ih} = dZ_{1}/db_{ih} . [dh_{1}/dZ_{1} * dZ_{2}/dh_{1}. (dE/dO * dO/dZ_{2})] ^{T }
Where, (.) dot is the dot product and * is the element wise product.
Endnotes
To summarize, as promised, below is a very cool gif that shows how backward propagation operates in reaching to the solution by minimizing the loss function or error:
Source: 7hiddenlayers.com
Backward Propagation is the preferred method for adjusting the weights and biases since it is faster to converge as we move from output to the hidden layer. Here, we change the weights of the hidden layer that is closest to the output layer, recalculate the loss and if further need to reduce the error then repeat the entire process and in that order move towards the input layer.
Whereas in the forward propagation, the pecking order is from the input layer, hidden, and then to the output layer which takes more time to converge to the optimum solution of the minimum loss function.
I hope the article was helpful to show how backward propagation works. You may reach out to me on my LinkedIn: linkedin.com/in/nehaseth69771111
Thank You. Happy Learning! 🙂
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
You can also read this article on our Mobile APP