This article was published as a part of theÂ Data Science Blogathon

Â Â Â Â **Â –Â ** Â what is RNN?

Â Â Â Â Â **–Â Â **Forward propagation in RNN

Â Â Â Â Â **–Â ** Â Backward propagation in RNN

Â Â Â Â **Â –Â **Â Types of RNN architectures

Â Â Â Â Â **–**Â Â Applications of RNN

- Recurrent neural network is a type of neural network in which the output form the previous step is fed as input to the current step

- In traditional neural networks, all the inputs and outputs are independent of each other, but this is not a good idea if we want to predict the next word in a sentence
- We need to remember the previous word in order to generate the next word in a sentence, hence traditional neural networks are not efficient for NLP applications
- RNNs also have a hidden stage which used to capture information about a sentence
- RNNs have a ‘memory’, which is used to capture information about the calculations made so far
- In theory, RNNs can use information in arbitrary long sequences, but practically they are limited to look back only a few steps

**Here, x _{t}: input at time t, s_{t}: hidden state at time t, and O_{t}: output at time t**

Unfolding means writing the network for the complete sequence, for example, if a sequence has 4 words then the network will be unfolded into a 4 layered neural network

We can think of s _{tÂ }as theÂ memory of the network as it captures information about what happened in all the previous steps

A traditional neural network uses different parameter at each layer while an RNN shares the same parameter across all the layers, in the diagram we could see that the same parameters (U, V, W) were being used across all the layers

Using the same parameters across all layers shown that we are performing theÂ same task with different inputs, thus reducing the total number of parameters to learn

The tree set of parameter ( U, V, and W) are used to apply linear transformation over their respective inputs

Parameter U transformation the input x_{t }to the state s_{t}

Parameter W transforms the previous state s_{t-1 }to the current state s_{t}

And, parameter V maps the computed internal state s_{t }to the output O_{tÂ }

h_{t }= f(h_{t-1},x_{t})

Here, h_{t }is the current state, h_{t-1 }is the previous state and x_{t }is the current input

The equation applying after activation function (tanh) is:

h_{t}=tanh(w_{hh}h_{t-1 }+ w_{xh}x_{t})Here, whh : weight at recurrent neuron, Wxh : weight at input neuron

- After calculating the final state, we can then produce the output
- The output state can be calculated as:

## Â O

_{t }= W_{hy }h_{t}

Here, O

_{t }is the output state, why: weight at output layer, h_{t}: current state

**Backward phase :**

To train an RNN, we need a loss function. We will make use of cross-entropy loss which is often paired with softmax, which can be calculated as:

Â L = -ln(p_{c})

Here, p

_{c }is the RNN’s predicted probability for the correct class (positive or negative). For example, if a positive text is predicted to be 95% positive by the RNN, then the loss is:

L= -ln(0.95) = 0.051After calculating loss, we will train the RNN using gradient descent to minimize loss

- We compute the cross-entropy error first using the current and actual output. The network is unfolding for each time step
- Then for each time step in the network, the gradient descent is calculated withÂ respect to the weight of each of the parameter
- Once the weight for all time step is the same we can combine together the gradients for all the time steps
- Then we update the weightsÂ Â for both the recurrent neurons as well as the dense layers

__Â __

- During the training of all a deep network, the gradients are propagated back in time all the way to the initial layer
- Gradients that come from deeper layers go through multiple matrix multiplications according to the chain rule, and when they approach the earlier layers, if they have small values ( <1 ) they shrink exponentially till they vanish
- Vanishing gradient make model learning difficult
- While if they have large values (>1), then they eventually blow up and crash the model, this is the exploding gradient problem

The common architectures which are used for sequence learning are:

- One to one
- One to many
- Many to one
- Many to many

Â

- This model is similar to a single layer neural network as it only provides linear predictions
- It is mostly used fixed-size input ‘x’ and fixed-size output ‘y’ (example: image classification)

- This consist of a single input ‘x’, activation ‘a’, and multiple outputs ‘y’
- Example: generating an audio stream. It takes a single audio stream as input and generates new tones or new music based on that stream
- In some cases, it propagates the output ‘y’ to the next RNN units

- This consist of multiple inputs ‘x’ (such as words or sentences), activation ‘a’ and produce a single output ‘y’ at the end
- This type of architecture is mostly used to perform sentiment analysis as it processes the entire input (collection of words sentences) to produce a single output (positive, negative, or neutral sentiment)

- In this, a single frame is taken as input for each RNN unit. A-frame represents multiple inputs ‘x’, activations ‘a’ which are propagated through the network to produce output ‘y’ which are the classification result for each frame
- It used mostly in video classification, where we try to classify each frame of the video

- InÂ this neural network, 2 hidden layers running in the opposite direction are connected to produce a single output
- These layers allow the neural network to received information from both past as well as a future state
- For example, given a word sequence: ‘I like programming’. The forward layer will input the sequence as it is while the backward layer will feed the sequence in the reverse order ‘programming like i’
- The output for this will be calculated by concatenating the word sequence at each time step and generating the weight

**Note:-**

**RNNs remember each and every piece of information through timestamp****The memory state which stores information of all the state is useful for tasks such as sentence generation and time series prediction****RNNs can handle inputs and outputs of arbitrary length****RNNs share the same parameters across different time steps which means fewer parameters to train and computation cost****RNNs can not process very long sequences while making use of tanh or ReLu as an activation function****RNNs face vanishing and exploding gradient problem**

**Â **

** Text summarization: **Summarizing the text from any literature, for example, if a news website wants to display brief summary of important news from each and every news article on the website, then text summarization will be helpful

** Text recommendation: **Text autofill or sentence generation in data every work by making use of RNNs can help in automating the processes and make it less time consuming

** Image recognition: **RNNs can be combined with CNN in order to recognize an image and give its description

** Music generation: **RNNs can be used to generate new music or tunes, by feeding a single tune as an input we can generate new notes or tunes of music.

References and Links:

Understanding LSTM Networks — colah’s blog

Introduction to Recurrent Neural Network – GeeksforGeeks

Recurrent neural network – Wikipedia

*The media shown in this article are not owned by Analytics Vidhya and are used at the Authorâ€™s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist##

##

##

##

##

##

##

##

##

##

Pretrained Language Models in NLP
Generative Pre-training (GPT) for Natural Language Understanding(NLU)
Finetuning GPT-2
Understanding BERT
Finetune Masked language Modeling in BERT
Implement Text Classification using BERT
Finetuning BERT for NER
Extensions of BERT: Roberta, Spanbert, ALBER
MobileBERT
GPT-3
Prompt Engineering in GPT-3
Bigbird
T5 and large language models