Introduction
In the previous article, we looked at what are Neurons, what is a Neural Network, how it is structured, what are the Optimizers and their relevance in neural networks. In this article, we will dive deep into how the neurons or the nodes for the Artificial Neural Network are estimated.
Also, we will see what are the techniques available to update the weights related to these nodes to converge faster to the minimum loss function.
Table of Contents
 Estimation of Neurons or Nodes
 Squashing of Neural Net
 Forward Propagation
Estimation of Neurons or Nodes
Let’s start with a binary classification problem where we want to classify whether the customer will churn or not churn. We will use a small dummy data for our understanding purpose with four input variables and eight observations.
It has the following neural net with an architecture of [4, 5, 3, 2] and is depicted below:
 4 independent variables or the Xs in the input layer, L1
 5 neurons in the first hidden layer, L2
 3 neurons in the second hidden layer, L3, and
 2 in the output layer L4 with two nodes, Q_{1} and Q_{2}.
For our purpose here, I will refer to the neurons in Hidden Layer L2 as N_{1}, N_{2}, N_{3}, N_{4}, N_{5} and N_{6}, N_{7}, N_{8} in the Hidden Layer L3, respectively in the linear order of their occurrence. In a classification problem, the output layer can either have one node or can have nodes equivalent to the number of the classes or the categories.
This network here is called Fully Connected Network (FNN) or Dense Network since every neuron has a connection with the node of the previous layer output. It is also known as the Feedforward Neural Network or Sequential Neural Network.
Source: researchgate.net
The equation for the neural network is a linear combination of the independent variables and their respective weights and bias (or the intercept) term for each neuron. The neural network equation looks like this:
Z = Bias + W_{1}X_{1} + W_{2}X_{2} + …+ W_{n}X_{n}
where,
 Z is the symbol for denotation of the above graphical representation of ANN.
 Wis, are the weights or the beta coefficients
 Xis, are the independent variables or the inputs, and
 Bias or intercept = W_{0}
There are three steps to perform in any neural network:

We take the input variables and the above linear combination equation of Z = W_{0} + W_{1}X_{1} + W_{2}X_{2} + …+ W_{n}X_{n} to compute the output or the predicted Y values, called the Y_{pred}.

Calculate the loss or the error term. The error term is the deviation of the actual values from the predicted values.

Minimize the loss function or the error term.
Firstly, we will understand how to calculate the output for a neural network and then will see the approaches that can help to converge to the optimum solution of the minimum error term.
The output layer nodes are dependent on their immediately preceding hidden layer L3, which is coming from the hidden layer 2 and those nodes are further derived from the input variables. These middle hidden layers create the features that the network automatically creates and we don’t have to explicitly derive those features. In this manner, the features are generated in Deep Learning models and this is what makes them stand out from Machine Learning.
So, to compute the output, we will have to calculate for all the nodes in the previous layers. Let us understand what is the mathematical explanation behind any kind of neural nets.
Now, as from the above architecture, we can see that each neuron cannot have the same general equation for the output as the above one. We will have one such equation per neuron both for the hidden and the output layer.
The nodes in the hidden layer L2 are dependent on the Xs present in the input layer therefore, the equation will be the following:
 N_{1} = W_{11}*X_{1 }+ W_{12}*X_{2} + W_{13}*X_{3} + W_{14}*X_{4} + W_{10}
 N_{2} = W_{21}*X_{1}+ W_{22}*X_{2} + W_{23}*X_{3} + W_{24}*X_{4} + W_{20}
 N_{3} = W_{31}*X_{1}+ W_{32}*X_{2} + W_{33}*X_{3} + W_{34}*X_{4} + W_{30}
 N_{4} = W_{41}*X_{1}+ W_{42}*X_{2} + W_{43}*X_{3} + W_{44}*X_{4} + W_{40}
 N_{5} = W_{51}*X_{1}+ W_{52}*X_{2} + W_{53}*X_{3} + W_{54}*X_{4} + W_{50}
Similarly, the nodes in the hidden layer L3 are derived from the neurons in the previous hidden layer L2, hence their respective equations will be:
 N_{5} = W_{51} * N_{1} + W_{52} * N_{2} + W_{53} * N_{3} + W_{54} * N_{4} + W_{55} * N_{5} + W_{50}
 N_{6} = W_{61} * N_{1} + W_{62} * N_{2} + W_{63} * N_{3} + W_{64} * N_{4} + W_{65} * N_{5} + W_{60}
 N_{7} = W_{71} * N_{1} + W_{72} * N_{2} + W_{73} * N_{3} + W_{74} * N_{4} + W_{75} * N_{5} + W_{70}
The output layer nodes are coming from the hidden layer L3 which makes the equations as:
 O_{1} = WO_{11} * N_{5} + WO_{12} * N_{6} + WO_{13} * N_{7} + WO_{10}
 O_{2} = WO_{21} * N_{5} + WO_{22} * N_{6} + WO_{23} * N_{7} + WO_{20}
Now, how many weights or betas will be needed to estimate to reach the output? On counting all the weights Wis in the above equation will get 51. However, no real model will have only three input variables to start with!
Additionally, the neurons and the hidden layers themselves are the tuning parameters so in that case, how will we know how many weights to estimate to calculate the output? Is there an efficient way than the manual counting approach to know the number of weights needed? The weights here are referred to the beta coefficients of the input variables along with the bias term as well (and the same will be followed in the rest of the article).
The structure of the network is 4,5,3,2. The number of weights for the hidden layer L2 would be determined as = (4 + 1) * 5 = 25, where 5 is the number of neurons in L2 and there are 4 input variables in L1. Each of the input Xs will have a bias term which makes it 5 bias terms, which we can also say as (4 + 1) = 5.
Therefore, the number of weights for a particular layer is computed by taking the product of (number of nodes/variables + bias term of each node) of the previous layer and the number of neurons in the next layer
Similarly, the number of weight for the hidden layer L3 = (5 + 1) * 3 = 18 weights, and for the output layer the number of weights = (3 + 1) * 2 = 8.
The total number of weights for this neural network is the sum of the weights from each of the individual layers which is = 25 + 18 + 8 = 51
We now know how many weights will we have in each layer and these weights from the above neuron equations can be represented in the matrix form as well. Each of the weights of the layers will take the following form:
Hidden Layer L2 will have a 5 * 5 matrix as seen the number of weights is (4 + 1) * 5:
 N_{1} = W_{11}*X_{1 }+ W_{12}*X_{2} + W_{13}*X_{3} + W_{14}*X_{4} + W_{10}
 N_{2} = W_{21}*X_{1}+ W_{22}*X_{2} + W_{23}*X_{3} + W_{24}*X_{4} + W_{20}
 N_{3} = W_{31}*X_{1}+ W_{32}*X_{2} + W_{33}*X_{3} + W_{34}*X_{4} + W_{30}
 N_{4} = W_{41}*X_{1}+ W_{42}*X_{2} + W_{43}*X_{3} + W_{44}*X_{4} + W_{40}
 N_{5} = W_{51}*X_{1}+ W_{52}*X_{2} + W_{53}*X_{3} + W_{54}*X_{4} + W_{50}
A 3*6 matrix for the hidden layer L3 having the number of weights as (5 + 1) * 3 = 18
 N_{5} = W_{51} * N_{1} + W_{52} * N_{2} + W_{53} * N_{3} + W_{54} * N_{4} + W_{55} * N_{5} + W_{50}
 N_{6} = W_{61} * N_{1} + W_{62} * N_{2} + W_{63} * N_{3} + W_{64} * N_{4} + W_{65} * N_{5} + W_{60}
 N_{7} = W_{71} * N_{1} + W_{72} * N_{2} + W_{73} * N_{3} + W_{74} * N_{4} + W_{75} * N_{5} + W_{70}
Lastly, the output layer would be 4*2 matrix with (3 + 1) * 2 number of weights:
 O_{1} = WO_{11} * N_{5} + WO_{12} * N_{6} + WO_{13} * N_{7} + WO_{10}
 O_{2} = WO_{21} * N_{5} + WO_{22} * N_{6} + WO_{23} * N_{7} + WO_{20}
Okay, so now we know how many weights we need to compute for the output but then how do we calculate the weights? In the first iteration, we assign randomized values between 0 and 1 to the weights. In the following iterations, these weights are adjusted to converge at the optimal minimized error term.
We are so persistent about minimizing the error because the error tells how much our model deviates from the actual observed values. Therefore, to improve the predictions, we constantly update the weights so that loss or error is minimized.
This adjustment of weights is also called the correction of the weights. There are two methods: Forward Propagation and Backward Propagation to correct the betas or the weights to reach the convergence. We will go into the depth of each of these techniques; however, before that lets’ close the loop of what the neural net does after estimating the betas.
Squashing the Neural Net
The next step on the ladder of computation of output is to apply a transformation on these linear equations. As we have a neural net related to classification at hand, how will this linear equation be applied in the cases when the output is to be categorized in classes?
For a binary classification problem, we know that Sigmoid is needed to transform the linear equation to a nonlinear equation. In case you are not sure why we use Sigmoid to transform a linear equation to a nonlinear equation, then would suggest refreshing the logistic regression.
For a particular node, the transformation is as follows:
N_{1} = W_{11}*X_{1 }+ W_{12}*X_{2} + W_{13}*X_{3} + W_{14}*X_{4} + W_{10}
After implementing the Sigmoid transformation, it becomes:
h_{1} = sigmoid(N_{1})
where,
sigmoid(N_{1}) = exp^{(W11*X1 + W12*X2 + W13*X3 + W14*X4 + W10)}/(1+ exp^{(W11*X1 + W12*X2 + W13*X3 + W14*X4 + W10)})
This alteration is applied on the hidden layers and the output layers and is known as the Activation or the Squashing Function. The transformation is required to bring nonlinearity to the network as every business problem may not be solved linearly.
Source: mediaexp1.licdn.com
There are various types of activation functions available and each function has a different utilization. On the output layer, the activation function is dependent on the type of business problem. The squashing function for the output layer for binary classification is the Sigmoid. To know more about these functions and their applicability can refer to these blogs here and here.
Hence, to find the output we estimate the weights and perform the mathematical transformation. The output of a node is the outcome of this activation function.
Till this point, we have just completed step 1 of the neural network that is taking the input variables and finding the output. Then we calculate the error term. And mind you, right now this is only done for one record! We perform this entire cycle all over again for all the records!
Relax, we don’t have to do this manually. This is just the process, the network does these steps in its background. The idea here is to know how the network works, we don’t have to do it manually.
In the neural network, we can move from left to right and right to left as well. The right to left process of adjusting the weights from the Output to the Input layer is Backward Propagation (I will cover this in the next article).
Forward Propagation
The process of going from left to right i.e from the Input layer to the Output Layer is Forward Propagation. We move from left to right to adjust or correct the weights. We will understand how this mathematically works and update the weights to have the minimized loss function.
Our binary classification dataset had input X as 4 * 8 matrix with 4 input variables and 8 records and the Y variable is 2 * 8 matrix with two columns, for class 1 and 0, respectively with 8 records. It had some categorical variables post converting it to dummy variables, we have the set as below:
The idea here is that we start with the input layer of the 4*8 matrix and want to get the output of (2*8). The hidden layers and the neurons in each of the hidden layers are hyperparameters and so are defined by the user. How we achieve the output is via matrix multiplication between the input variables and the weights of each layer.
We have seen above that the weights will have a matrix for each of the respective layers. We perform matrix multiplication starting from the input matrix of 4 * 8 with the weight matrix between the L1 and L2 layer to get the next matrix of the next layer L3. Similarly, we will do this for each layer and repeat the process until we reach the output layer of the dimensions 2 * 8.
Please note the above explanation of the estimation of neurons was for a single observation, the network performs the entire above process for all the observations. The final result after performing for all the records in the dataset with the respective dimensions will be:
Now, let’s break down the steps to understand how the matrix multiplication in Forward propagation works:

First, the input matrix is 4 * 8, and the weight matrix between L1 and L2, referring to it as W_{h1} is 5 * 5 (we saw this above).

The W_{h1}= 5* 5 weight matrix, includes both for the betas or the coefficients and for the bias term.

For simplification, breaking the wh1 into beta weights and the bias (going forward will use this nomenclature). So the beta weights between L1 and L2 are of 4*5 dimension (as have 4 input variables in L1 and 5 neurons in the Hidden Layer L2).

For understanding purpose, will illustrate the multiplication for one layer:
We can multiply element by element but that result will be only for one observation or one record. To get the result for all the 8 observations in one go, we need to multiply the two matrices.
For matrix multiplication, the number of columns of the first matrix must be equal to the number of rows of the second matrix. Our first matrix of input has 8 columns and the second matrix of weights has 4 rows hence, we can’t multiply the two matrices.
So, what do we do? We take the transpose of one of the matrices to conduct the multiplication. Transposing the weight matrix to 5 * 4 will help us resolve this issue.
So, now after adding the bias term, the result between the input layer and the hidden layer L2, becomes Z_{1} = W_{h1}^{T} * X + b_{h1}.
5. The next step is to apply the activation function on Z_{1}. Note, the shape of Z_{1} does not change by applying the activation function so h_{1} = activation function(Z_{1}) is of shape 5*8.
6. In a similar manner to the above five steps, the network using the forward propagation gets the outcome of each layer:
Note that for the next layer between L2 and L3, the input this time will not be X but will be h_{1,} which results from L1 and L2.
Z_{2} = W_{h2}^{T} * h_{1} + b_{h2},
where ,
 W_{h2} is the weight matrix between L2 and L3 with a shape of 5*3
 W_{h2}^{T} , is the transpose of W_{h2}, having the dimension of 3*5
 h_{1} is the result of L1 and L2, with a shape of 5*8, and
 b_{h2} is the bias term.
So, Z_{2} = W_{h2}^{T} * h_{1} + b_{h2} with its matrix multiplication is:
Z_{2} = [(3*5) * (5*8)] + b_{h2} will result Z_{2} with dimension of 3*8 and post this again apply the activation function, which results in: h_{2} = activation function(Z_{2}) is of shape 3*8.
7. We repeat these steps for the computation of the last layer.
This time for the next layer between L3 and L4, the input will be h_{2}, resulting from L2 and L3.
Z_{3} = W_{h0}^{T} * h_{2} + b_{h0},
 Where W_{h0} is the weight matrix between L3 and L4 with a shape of 3*2
 W_{h0}^{T}, is the transpose of W_{h0}, having the dimension of 2*3
 h_{2} is the result of L2 and L3, with a shape of 3*8, and
 b_{h0} is the bias term.
So, Z_{3} = W_{h0}^{T} * h_{2} + b_{h0}, with its matrix multiplication is:
Z_{3} = [(2*3) * (3*8)] + b_{h0} will result in Z_{3} with the dimension of 2*8 and post this again apply the activation function, this time use Sigmoid to transform as need to get the output, which results in O = Sigmoid(Z_{3}) is of shape 2*8.
After estimating the output using Forward propagation, then we calculate the error using this output, and this process of finding the weights to minimize the error continues until the optimal solution is achieved.
The other method and the preferred method to correct the weights is Backward Propagation which we shall explore in the next article.
Endnotes
The way Neural Network ways is:
 First, we initialize the weights with some random values, which are mostly between 0 and 1.
 Calculate the output which is to predict the values and estimate the loss.
 Then, adjust the weights such that the loss will be minimized.
We repeat the above three steps until there is no change in the loss function that has reached its optimal value or based on the number of iterations (or epochs).
I hope this article was helpful to understand how the nodes are estimated, how a neural network operates, its parameters, and the working of the forward propagation method. The next article will go into the depths of Backward propagation. Stay Tuned!
You may reach out to me on my LinkedIn: linkedin.com/in/nehaseth69771111
Thank You. Happy Learning 🙂
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.