Recurrent Neural Networks : Introduction for Beginners

A

Aman Singh 13 Jun, 2021

6 min read

This article was published as a part of the Data Science Blogathon

In this section, you will learn about:

– what is RNN?

– Forward propagation in RNN

– Backward propagation in RNN

– Types of RNN architectures

– Applications of RNN

What is a Recurrent Neural Network (RNN)?

Recurrent neural network is a type of neural network in which the output form the previous step is fed as input to the current step

In traditional neural networks, all the inputs and outputs are independent of each other, but this is not a good idea if we want to predict the next word in a sentence

We need to remember the previous word in order to generate the next word in a sentence, hence traditional neural networks are not efficient for NLP applications

RNNs also have a hidden stage which used to capture information about a sentence

RNNs have a ‘memory’, which is used to capture information about the calculations made so far

In theory, RNNs can use information in arbitrary long sequences, but practically they are limited to look back only a few steps

Diagrammatic Representation

Here, x_t: input at time t, s_t: hidden state at time t, and O_t: output at time t

Unfolding means writing the network for the complete sequence, for example, if a sequence has 4 words then the network will be unfolded into a 4 layered neural network

We can think of s _tas the memory of the network as it captures information about what happened in all the previous steps

A traditional neural network uses different parameter at each layer while an RNN shares the same parameter across all the layers, in the diagram we could see that the same parameters (U, V, W) were being used across all the layers

Using the same parameters across all layers shown that we are performing the same task with different inputs, thus reducing the total number of parameters to learn

The tree set of parameter ( U, V, and W) are used to apply linear transformation over their respective inputs

Parameter U transformation the input x_tto the state s_t

Parameter W transforms the previous state s_t-1to the current state s_t

And, parameter V maps the computed internal state s_tto the output O_t

Formula to calculate current state:

h_t= f(h_t-1,x_t)

Here, h_tis the current state, h_t-1is the previous state and x_tis the current input

The equation applying after activation function (tanh) is:

h_t=tanh(w_hhh_t-1+ w_xhx_t)

Here, whh : weight at recurrent neuron, Wxh : weight at input neuron

After calculating the final state, we can then produce the output
The output state can be calculated as:

O_t= W_hyh_t

Here, O_tis the output state, why: weight at output layer, h_t: current state

Backward propagation in RNN

Backward phase :

To train an RNN, we need a loss function. We will make use of cross-entropy loss which is often paired with softmax, which can be calculated as:

L = -ln(p_c)

Here, p_cis the RNN’s predicted probability for the correct class (positive or negative). For example, if a positive text is predicted to be 95% positive by the RNN, then the loss is:

L= -ln(0.95) = 0.051

After calculating loss, we will train the RNN using gradient descent to minimize loss

Steps for Back Propagation

We compute the cross-entropy error first using the current and actual output. The network is unfolding for each time step
Then for each time step in the network, the gradient descent is calculated with respect to the weight of each of the parameter
Once the weight for all time step is the same we can combine together the gradients for all the time steps
Then we update the weights for both the recurrent neurons as well as the dense layers

Vanishing and Exploding Gradient Problem

Defining the problem

During the training of all a deep network, the gradients are propagated back in time all the way to the initial layer
Gradients that come from deeper layers go through multiple matrix multiplications according to the chain rule, and when they approach the earlier layers, if they have small values ( <1 ) they shrink exponentially till they vanish
Vanishing gradient make model learning difficult
While if they have large values (>1), then they eventually blow up and crash the model, this is the exploding gradient problem

Types of RNN Architectures

The common architectures which are used for sequence learning are:

One to one
One to many
Many to one
Many to many

One to one

This model is similar to a single layer neural network as it only provides linear predictions
It is mostly used fixed-size input ‘x’ and fixed-size output ‘y’ (example: image classification)

One to many

This consist of a single input ‘x’, activation ‘a’, and multiple outputs ‘y’
Example: generating an audio stream. It takes a single audio stream as input and generates new tones or new music based on that stream
In some cases, it propagates the output ‘y’ to the next RNN units

Many to one

This consist of multiple inputs ‘x’ (such as words or sentences), activation ‘a’ and produce a single output ‘y’ at the end
This type of architecture is mostly used to perform sentiment analysis as it processes the entire input (collection of words sentences) to produce a single output (positive, negative, or neutral sentiment)

Many to many

In this, a single frame is taken as input for each RNN unit. A-frame represents multiple inputs ‘x’, activations ‘a’ which are propagated through the network to produce output ‘y’ which are the classification result for each frame
It used mostly in video classification, where we try to classify each frame of the video

Bi- directional RNNs

In this neural network, 2 hidden layers running in the opposite direction are connected to produce a single output
These layers allow the neural network to received information from both past as well as a future state
For example, given a word sequence: ‘I like programming’. The forward layer will input the sequence as it is while the backward layer will feed the sequence in the reverse order ‘programming like i’
The output for this will be calculated by concatenating the word sequence at each time step and generating the weight

Note:-

RNNs remember each and every piece of information through timestamp
The memory state which stores information of all the state is useful for tasks such as sentence generation and time series prediction
RNNs can handle inputs and outputs of arbitrary length
RNNs share the same parameters across different time steps which means fewer parameters to train and computation cost
RNNs can not process very long sequences while making use of tanh or ReLu as an activation function
RNNs face vanishing and exploding gradient problem

Application of RNN

Text summarization: Summarizing the text from any literature, for example, if a news website wants to display brief summary of important news from each and every news article on the website, then text summarization will be helpful

Text recommendation: Text autofill or sentence generation in data every work by making use of RNNs can help in automating the processes and make it less time consuming

Image recognition: RNNs can be combined with CNN in order to recognize an image and give its description

Music generation: RNNs can be used to generate new music or tunes, by feeding a single tune as an input we can generate new notes or tunes of music.

References and Links:

Understanding LSTM Networks — colah’s blog

Introduction to Recurrent Neural Network – GeeksforGeeks

Recurrent neural network – Wikipedia

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.