Recurrent Neural Networks : Introduction for Beginners
This article was published as a part of the Data Science Blogathon
In this section, you will learn about:
– what is RNN?
– Forward propagation in RNN
– Backward propagation in RNN
– Types of RNN architectures
– Applications of RNN
What is a Recurrent Neural Network (RNN)?
- Recurrent neural network is a type of neural network in which the output form the previous step is fed as input to the current step
- In traditional neural networks, all the inputs and outputs are independent of each other, but this is not a good idea if we want to predict the next word in a sentence
- We need to remember the previous word in order to generate the next word in a sentence, hence traditional neural networks are not efficient for NLP applications
- RNNs also have a hidden stage which used to capture information about a sentence
- RNNs have a ‘memory’, which is used to capture information about the calculations made so far
- In theory, RNNs can use information in arbitrary long sequences, but practically they are limited to look back only a few steps
Diagrammatic Representation
.jpg)
Here, xt: input at time t, st: hidden state at time t, and Ot: output at time t
Unfolding means writing the network for the complete sequence, for example, if a sequence has 4 words then the network will be unfolded into a 4 layered neural network
We can think of s t as the memory of the network as it captures information about what happened in all the previous steps
A traditional neural network uses different parameter at each layer while an RNN shares the same parameter across all the layers, in the diagram we could see that the same parameters (U, V, W) were being used across all the layers
Using the same parameters across all layers shown that we are performing the same task with different inputs, thus reducing the total number of parameters to learn
The tree set of parameter ( U, V, and W) are used to apply linear transformation over their respective inputs
Parameter U transformation the input xt to the state st
Parameter W transforms the previous state st-1 to the current state st
And, parameter V maps the computed internal state st to the output Ot
Formula to calculate current state:
ht = f(ht-1,xt)
Here, ht is the current state, ht-1 is the previous state and xt is the current input
The equation applying after activation function (tanh) is:
ht=tanh(whhht-1 + wxhxt)
Here, whh : weight at recurrent neuron, Wxh : weight at input neuron
- After calculating the final state, we can then produce the output
- The output state can be calculated as:
Ot = Why ht
Here, Ot is the output state, why: weight at output layer, ht: current state
Backward propagation in RNN
Backward phase :
To train an RNN, we need a loss function. We will make use of cross-entropy loss which is often paired with softmax, which can be calculated as:
L = -ln(pc)
Here, pc is the RNN’s predicted probability for the correct class (positive or negative). For example, if a positive text is predicted to be 95% positive by the RNN, then the loss is:
L= -ln(0.95) = 0.051
After calculating loss, we will train the RNN using gradient descent to minimize loss
Steps for Back Propagation
- We compute the cross-entropy error first using the current and actual output. The network is unfolding for each time step
- Then for each time step in the network, the gradient descent is calculated with respect to the weight of each of the parameter
- Once the weight for all time step is the same we can combine together the gradients for all the time steps
- Then we update the weights for both the recurrent neurons as well as the dense layers
Vanishing and Exploding Gradient Problem
Defining the problem
- During the training of all a deep network, the gradients are propagated back in time all the way to the initial layer
- Gradients that come from deeper layers go through multiple matrix multiplications according to the chain rule, and when they approach the earlier layers, if they have small values ( <1 ) they shrink exponentially till they vanish
- Vanishing gradient make model learning difficult
- While if they have large values (>1), then they eventually blow up and crash the model, this is the exploding gradient problem
Types of RNN Architectures
The common architectures which are used for sequence learning are:
- One to one
- One to many
- Many to one
- Many to many
One to one
- This model is similar to a single layer neural network as it only provides linear predictions
- It is mostly used fixed-size input ‘x’ and fixed-size output ‘y’ (example: image classification)
One to many
- This consist of a single input ‘x’, activation ‘a’, and multiple outputs ‘y’
- Example: generating an audio stream. It takes a single audio stream as input and generates new tones or new music based on that stream
- In some cases, it propagates the output ‘y’ to the next RNN units

Many to one
- This consist of multiple inputs ‘x’ (such as words or sentences), activation ‘a’ and produce a single output ‘y’ at the end
- This type of architecture is mostly used to perform sentiment analysis as it processes the entire input (collection of words sentences) to produce a single output (positive, negative, or neutral sentiment)

Many to many
- In this, a single frame is taken as input for each RNN unit. A-frame represents multiple inputs ‘x’, activations ‘a’ which are propagated through the network to produce output ‘y’ which are the classification result for each frame
- It used mostly in video classification, where we try to classify each frame of the video

Bi- directional RNNs
- In this neural network, 2 hidden layers running in the opposite direction are connected to produce a single output
- These layers allow the neural network to received information from both past as well as a future state
- For example, given a word sequence: ‘I like programming’. The forward layer will input the sequence as it is while the backward layer will feed the sequence in the reverse order ‘programming like i’
- The output for this will be calculated by concatenating the word sequence at each time step and generating the weight

Note:-
- RNNs remember each and every piece of information through timestamp
- The memory state which stores information of all the state is useful for tasks such as sentence generation and time series prediction
- RNNs can handle inputs and outputs of arbitrary length
- RNNs share the same parameters across different time steps which means fewer parameters to train and computation cost
- RNNs can not process very long sequences while making use of tanh or ReLu as an activation function
- RNNs face vanishing and exploding gradient problem
Application of RNN
Text summarization: Summarizing the text from any literature, for example, if a news website wants to display brief summary of important news from each and every news article on the website, then text summarization will be helpful
Text recommendation: Text autofill or sentence generation in data every work by making use of RNNs can help in automating the processes and make it less time consuming
Image recognition: RNNs can be combined with CNN in order to recognize an image and give its description
Music generation: RNNs can be used to generate new music or tunes, by feeding a single tune as an input we can generate new notes or tunes of music.
References and Links:
Understanding LSTM Networks — colah’s blog
Introduction to Recurrent Neural Network – GeeksforGeeks
Recurrent neural network – Wikipedia
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.