Aman Singh — Published On June 13, 2021
Advanced Deep Learning Maths

This article was published as a part of the Data Science Blogathon

In this section, you will learn about:

         –   what is RNN?

         –   Forward propagation in RNN

         –   Backward propagation in RNN

         –   Types of RNN architectures

             Applications of RNN

What is a Recurrent Neural Network (RNN)?

  • Recurrent neural network is a type of neural network in which the output form the previous step is fed as input to the current step
  • In traditional neural networks, all the inputs and outputs are independent of each other, but this is not a good idea if we want to predict the next word in a sentence
  • We need to remember the previous word in order to generate the next word in a sentence, hence traditional neural networks are not efficient for NLP applications
  • RNNs also have a hidden stage which used to capture information about a sentence
  • RNNs have a ‘memory’, which is used to capture information about the calculations made so far
  • In theory, RNNs can use information in arbitrary long sequences, but practically they are limited to look back only a few steps

   Diagrammatic Representation


Recurrent Neural Networks

Here, xt: input at time t, st: hidden state at time t, and Ot: output at time t

Unfolding means writing the network for the complete sequence, for example, if a sequence has 4 words then the network will be unfolded into a 4 layered neural network

We can think of s as the  memory of the network as it captures information about what happened in all the previous steps

A traditional neural network uses different parameter at each layer while an RNN shares the same parameter across all the layers, in the diagram we could see that the same parameters (U, V, W) were being used across all the layers

Using the same parameters across all layers shown that we are performing the  same task with different inputs, thus reducing the total number of parameters to learn

The tree set of parameter ( U, V, and W) are used to apply linear transformation over their respective inputs

Parameter U transformation the input xt to the state st

Parameter W transforms the previous state st-1 to the current state st

And, parameter V maps the computed internal state st to the output O

Formula to calculate current state:

ht = f(ht-1,xt)

Here, ht is the current state, ht-1 is the previous state and xt is the current input

The equation applying after activation function (tanh) is:

ht=tanh(whhht-1 + wxhxt)

Here, whh : weight at recurrent neuron, Wxh : weight at input neuron

  • After calculating the final state, we can then produce the output
  • The output state can be calculated as:

 Ot = Why ht

Here, Ot is the output state, why: weight at output layer, ht: current state


Backward propagation in RNN

Backward phase :

To train an RNN, we need a loss function. We will make use of cross-entropy loss which is often paired with softmax, which can be calculated as:

 L = -ln(pc)

Here, pc is the RNN’s predicted probability for the correct class (positive or negative). For example, if a positive text is predicted to be 95% positive by the RNN, then the loss is:

L= -ln(0.95) = 0.051

After calculating loss, we will train the RNN using gradient descent to minimize loss

Steps for Back Propagation

  • We compute the cross-entropy error first using the current and actual output. The network is unfolding for each time step
  • Then for each time step in the network, the gradient descent is calculated with  respect to the weight of each of the parameter
  • Once the weight for all time step is the same we can combine together the gradients for all the time steps
  • Then we update the weights   for both the recurrent neurons as well as the dense layers

Vanishing and Exploding Gradient Problem


     Defining the problem

  • During the training of all a deep network, the gradients are propagated back in time all the way to the initial layer
  • Gradients that come from deeper layers go through multiple matrix multiplications according to the chain rule, and when they approach the earlier layers, if they have small values ( <1 ) they shrink exponentially till they vanish
  • Vanishing gradient make model learning difficult
  • While if they have large values (>1), then they eventually blow up and crash the model, this is the exploding gradient problem


Types of RNN Architectures

The common architectures which are used for sequence learning are:

  • One to one
  • One to many
  • Many to one
  • Many to many


  One to one

  • This model is similar to a single layer neural network as it only provides linear predictions
  • It is mostly used fixed-size input ‘x’ and fixed-size output ‘y’ (example: image classification)
Recurrent Neural Networks 2


One to many

  • This consist of a single input ‘x’, activation ‘a’, and multiple outputs ‘y’
  • Example: generating an audio stream. It takes a single audio stream as input and generates new tones or new music based on that stream
  • In some cases, it propagates the output ‘y’ to the next RNN units
Recurrent Neural Networks one to many


Many to one

  • This consist of multiple inputs ‘x’ (such as words or sentences), activation ‘a’ and produce a single output ‘y’ at the end
  • This type of architecture is mostly used to perform sentiment analysis as it processes the entire input (collection of words sentences) to produce a single output (positive, negative, or neutral sentiment)
Many to many

Many to many

  • In this, a single frame is taken as input for each RNN unit. A-frame represents multiple inputs ‘x’, activations ‘a’ which are propagated through the network to produce output ‘y’ which are the classification result for each frame
  • It used mostly in video classification, where we try to classify each frame of the video


Many to many


Bi- directional RNNs

  • In  this neural network, 2 hidden layers running in the opposite direction are connected to produce a single output
  • These layers allow the neural network to received information from both past as well as a future state
  • For example, given a word sequence: ‘I like programming’. The forward layer will input the sequence as it is while the backward layer will feed the sequence in the reverse order ‘programming like i’
  • The output for this will be calculated by concatenating the word sequence at each time step and generating the weight
Many to many 2


  • RNNs remember each and every piece of information through timestamp
  • The memory state which stores information of all the state is useful for tasks such as sentence generation and time series prediction
  • RNNs can handle inputs and outputs of arbitrary length
  • RNNs share the same parameters across different time steps which means fewer parameters to train and computation cost
  • RNNs can not process very long sequences while making use of tanh or ReLu as an activation function
  • RNNs face vanishing and exploding gradient problem


Application of RNN


Text summarization: Summarizing the text from any literature, for example, if a news website wants to display brief summary of important news from each and every news article on the website, then text summarization will be helpful

Text recommendation: Text autofill or sentence generation in data every work by making use of RNNs can help in automating the processes and make it less time consuming

Image recognition: RNNs can be combined with CNN in order to recognize an image and give its description

Music generation: RNNs can be used to generate new music or tunes, by feeding a single tune as an input we can generate new notes or tunes of music.

References and Links:

Understanding LSTM Networks — colah’s blog

Introduction to Recurrent Neural Network – GeeksforGeeks

Recurrent neural network – Wikipedia

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Aman Singh

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *