Understanding Architecture of LSTM

Gourav Singh 27 Feb, 2024 • 5 min read

Introduction

“Machine intelligence is the last invention that humanity will ever need to make “ — Nick Bostrom

As we have already discussed RNNs in my previous post, it’s time we explore LSTM lstm architecture diagram for long memories. Since LSTM’s work takes previous knowledge into consideration it would be good for you also to have a look at my previous article on RNNs ( relatable right ?).

Let’s take an example, suppose I show you one image and after 2 mins I ask you about that image you will probably remember that image content, but if I ask about the same image some days later, the information might be fade or totally lost right? The first condition is where we need RNNs ( for shorter memories ) while the other one is when we need LSTMs for long memory capacities. this clears some doubts right?

For more clarification let’s take another one, suppose you are watching a movie without knowing its name ( e.g. Justice League ) in one frame you See Ban Affleck and think this might be The Batman Movie, in another frame you see Gal Gadot and think this can be Wonder Women right? but when seeing a few next frames you can be sure that this is Justice League because you are using knowledge acquired from past frames, this is exactly what LSTMs architecture diagram do, and by using the following mechanisms:

1. Forgetting Mechanism: Forget all scene related information that is not worth remembering.

2. Saving Mechanism: Save information that is important and can help in the future.

Now that we know when to use LSTMs architecture diagram, let’s discuss the basics of it.

This article was published as a part of the Data Science Blogathon.

The architecture of LSTM

LSTMs deal with both Long Term Memory (LTM) and Short Term Memory (STM) and for making the calculations simple and effective it uses the concept of gates.

Forget Gate: LTM goes to forget gate and it forgets information that is not useful.
Learn Gate: Event ( current input ) and STM are combined together so that necessary information that we have recently learned from STM can be applied to the current input.
Remember Gate: LTM information that we haven’t forget and STM and Event are combined together in Remember gate which works as updated LTM.
Use Gate: This gate also uses LTM, STM, and Event to predict the output of the current event which works as an updated STM.

Figure: Remember Gate

Source: Udacity

The above figure shows the simplified architecture of LSTMs. The actual mathematical architecture of LSTM is represented using the following figure:

Figure: LSTM Architecture

don’t go haywire with this architecture we will break it down into simpler steps which will make this a piece of cake to grab.

Breaking Down the Architecture of LSTM

1. Learn Gate: Takes Event ( E_t ) and Previous Short Term Memory ( STM_t-1 ) as input and keeps only relevant information for prediction.

Source: Udacity

Calculation:

Source: Udacity

Previous Short Term Memory STM_t-1 and Current Event vector E_t are joined together [STM_t-1, E_t] and multiplied with the weight matrix W_n having some bias which is then passed to tanh ( hyperbolic Tangent ) function to introduce non-linearity to it, and finally creates a matrix N_t.
For ignoring insignificant information we calculate one Ignore Factor i_t, for which we join Short Term Memory STM_t-1 and Current Event vector E_t and multiply with weight matrix W_i and passed through Sigmoid activation function with some bias.
Learn Matrix N_t and Ignore Factor i_t is multiplied together to produce learn gate result.

2. The Forget Gate: Takes Previous Long Term Memory ( LTM_t-1 ) as input and decides on which information should be kept and which to forget.

Figure: Forget Gate

Calculation:

Source: Udacity

Previous Short Term Memory STM_t-1 and Current Event vector E_t are joined together [STM_t-1, E_t] and multiplied with the weight matrix W_f and passed through the Sigmoid activation function with some bias to form Forget Factor f_t.
Forget Factor f_t is then multiplied with the Previous Long Term Memory (LTM_t-1) to produce forget gate output.

3. The Remember Gate: Combine Previous Short Term Memory (STM_t-1) and Current Event (E_t) to produce output.

Figure: Remember Gate

Calculation:

Source: Udacity

The output of Forget Gate and Learn Gate are added together to produce an output of Remember Gate which would be LTM for the next cell.

4. The Use Gate

Combine important information from Previous Long Term Memory and Previous Short Term Memory to create STM for next and cell and produce output for the current event.

Calculation

Previous Long Term Memory ( _LTM-1) is passed through Tangent activation function with some bias to produce U_t.
Previous Short Term Memory ( STM_t-1 ) and Current Event ( E_t)are joined together and passed through Sigmoid activation function with some bias to produce V_t.
Output U_t and V_t are then multiplied together to produce the output of the use gate which also works as STM for the next cell.

Now scroll up to the architecture and put all these calculations so that you will have your LSTM ready.

Usage of LSTMs

Training LSTMs removes the problem of Vanishing Gradient ( weights become too small that under-fits the model ), but it still faces the issue of Exploding Gradient ( weights become too large that over-fits the model ). Training of LSTMs can be easily done using Python frameworks like Tensorflow, Pytorch, Theano, etc. and the catch is the same as RNN, we would need GPU for training deeper LSTM Networks.

Since LSTMs take care of the long term dependencies its widely used in tasks like Language Generation, Voice Recognition, Image OCR Models, etc. Also, this technique is getting noticed in Object Detection also ( mainly scene text detection ).

Conclusion

In essence, LSTMs epitomize the pinnacle of machine intelligence, embodying Nick Bostrom’s notion of humanity’s ultimate invention. Their architecture, governed by gates managing memory flow, underscores their capacity for long-term retention and utilization of information. Despite challenges like vanishing gradients, LSTMs find crucial application in tasks such as language generation, voice recognition, and image OCR. Their expanding role in domains like object detection heralds a new era of AI innovation.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

blogathon LSTM

Gourav Singh 27 Feb 2024

Applied Machine Learning Engineer skilled in Computer Vision/Deep Learning Pipeline Development, creating machine learning models, retraining systems and transforming data science prototypes to production-grade solutions. Consistently optimizes and improves real-time systems by evaluating strategies and testing on real world scenarios.

Advanced Deep Learning

Understanding Architecture of LSTM

Introduction

The architecture of LSTM

Breaking Down the Architecture of LSTM

Calculation:

Calculation:

Calculation:

4. The Use Gate

Calculation

Usage of LSTMs

Conclusion

Frequently Asked Questions

Responses From Readers

Write for us

Natural Language Processing

Understanding Architecture of LSTM

Introduction

The architecture of LSTM

Breaking Down the Architecture of LSTM

Calculation:

Calculation:

Calculation:

4. The Use Gate

Calculation

Usage of LSTMs

Conclusion

Frequently Asked Questions

Responses From Readers

Write for us

Natural Language Processing

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP