Rahul Dogra — Published On July 7, 2023
Beginner BERT Classification Data Science NLP


I’m going to explain transformers encoders to you in very simple way. People who are having trouble learning transformers may read this blog post all the way through, and if you are interested in working in the NLP field, you should be aware of transformers at least as most industries use this state-of-the-art models for various jobs. Transformers, introduced in the paper “Attention Is All You Need,” are the state-of-the-art models in NLP tasks, surpassing traditional RNNs and LSTMs. Transformers overcome the challenge of capturing long-term dependencies by relying on self-attention rather than recurrence. They have revolutionised NLP and paved the way for architectures like BERT, GPT-3, and T5.

Learning Objectives

In this article, you will learn:

  • Why did transformers become so popular?
  • The role of Self-Attention mechanism in the fields of NLP.
  • We will see how to create Keys, Queries and Value matrices from our own input data.
  • Will see how to compute attention matrix using Keys, Queries and Value matrices .
  • Importance of applying softmax function in the mechanism.

This article was published as a part of the Data Science Blogathon.

What led to the outperformance of Transformers over RNN and LSTM models?

We encountered a significant obstacle while working with RNN and LSTM as this was a recursive model which was still unable to understand the long-term dependencies and was becoming more computationally expensive by dealing with complex data. The publication “Attention Is All You Need” developed a new design called Transformers to get over this constraint of conventional sequential networks, and they are now the most advanced model for a number of NLP applications.

  • In RNN and LSTM, inputs and tokens are fed one at a time while the complete sequence is transmitted simultaneously through the transformers(parallel feeding of data).
  • The Transformers model totally eliminates the recursion process and is exclusively reliant on the attention mechanism. Use Self-attention which is a unique kind of attention mechanism.

What Transformer consists? How does it operate?

For many NLP tasks, the transformers model is currently state-of-the-art model.The introduction of the transformers led to a significant advancement in the field of NLP and prepared the way for cutting-edge systems like the BERT, GPT-3, T5, and others.

Let’s understand how the transformers and self-attention works with a language translation task.The transformer consists of an encoder-decoder architecture.We feed the input sentence(source sentence) to the encoder. The encoder learns the representation of the input sentence and sends the representation to the decoder. The decoder learns receives the representation learned by the encoder as input and generated the output sentence(target sentence)

Let’s say we want to translate a phrase from English to French.We require the English sentence as input to the encoder, as indicated in the following figure.The encoder learn the representations of the given English sentence and feeds the representation to the decoder.The decoder takes the encoder’s representation as input and generates the French sentence as output.

Transformers Encoder | NLP

All well, but what precisely is happening here? How does the transformer’s encoder and decoder translate an English sentence (the source sentence) into a French sentence (the target sentence)? What precisely occurs within the encoder and decoder? As a result, we’ll only be looking at the encoder network in this post because we want to keep it brief and focus on the encoder right now. We’ll cover the decoder component in the future article, for sure. In the sections that follow, let’s find out.

Understanding the Encoder of the Transformer

The encoder is just a neural network that is designed to receive an input and transform it into different representation/form where a machine can understand.The transformers consists of a stack of N number of encoders.The output of one encoder is sent as input to the other encoder above it. As shown in the following figure we have a stack of N number of encoders. Each encoder sends its output to the encoder above it. The final encoder returns the representation of the given resource sentence as output.We feed the source sentence as input to the encoder and get the representation of the source sentence as output:

Transformers Encoder | NLP

The authors of the original paper Attention Is All You Need ,chose N = 6, which means that they stacked six encoders one on top of the other. Nevertheless, we can experiment with other values of N. Let’s retain N = 2 for simplicity and better understanding.

Okay, the question is how exactly does the encoder works? How is it generating the representations for a given source sentence(input sentence)? Let’s see what is there in encoder

 Components of Encoder | Transformers Encoder | NLP
Components of Encoder

From the above figure, we can understand that all the encoder blocks are identical.We can also observe that each encoder block consists of two components.

  1. Multi-head attention
  2. Feedforward network

Let’s get into the details and learn how exactly these two components works actually.To understand how multi-head attention works, first we need to understand the self-attention mechanism.

Self-attention Mechanism

Let’s understand the self-attention mechanism with an example.Consider the following sentence

                 I swam across the river to get to the other bank

 Example 1 | Self attention mechanism

Example 1

In the above example 1, if I ask any you to tell me the meaning of bank here.So in order to answer this question the you have to understand the words which surrounds the word bank.

So is it :-

Bank == financial institution ?

Bank ==  the ground at the edge of a river ?

By reading the sentence you can easily say the  words ‘Bank’ means the ground at the edge of a river

So Context Matters!

Let’s see other example –

              A dog ate the food because it was hungry

 Example 2 | Transformers Encoder | NLP

Example 2

How does a machine can understand that in a given sentence that what all these unknown words refer to? Here is where the self-attention mechanism helps machine to understand.

In the given sentence,  A dog ate the food because it was hungry , first , our model will compute the representation of the word A, next it will compute the representation of the word dog, then it will compute the representation of the word ate, and so on. While computing the representation of each word, it will relate each word to all other words in the sentence to understand more about the word

For instance, while computing the representation of the word it, our model relates the word it, to all the other words in the sentence to understand more about the word it.

In the image below, our model connects the word “it” to every word in the phrase to calculate its representation. By doing so, our model understands that “it” is associated with “dog” and not “food” in the given sentence. The thickness of the line connecting “it” and “dog” is greater, indicating a higher score and a stronger relationship. This enables the machine to make predictions based on the higher score.


All right, but exactly how does this operate? Let’s learn more about the self-attention process in detail now that we have a fundamental understanding of what it is.

Assume I have:

SourceSentence = I am good

Tokenized = [‘I’, ‘am’, ‘good’]

Here, representation is nothing but a word embedding model.

 Embedding Matrix of SourceSentence
Embedding Matrix of SourceSentence

Input Matrix (Embedding Matrix)

From above input matrix(Embedding Matrix), we can understand that the first row of the matrix implies the embedding of the word I, the second row implies the embedding of the word am, and the third row implies the embedding of the word good. Thus the dimension of the input matrix will be – [sentence length x embedding dimension].The number of words in our sentence(sentence length) is 3. Let the embedding dimension be 3 for now as per explanation.Then, our input matrix(input embedding) dimension will be [3,3]. So, if you are taking dimension as 512 then your shape would be [3×512].So for ease we are taking  [3,3]

 X Matrix(Embedding Matrix) | Transformers Encoder | NLP
X Matrix(Embedding Matrix)

We now generate three new matrices from the aforementioned matrix, X: a query matrix, Q, a key matrix, K, and a value matrix, V.Wait. What exactly are these three matrices? And why do we require them? They are employed in the self-awareness mechanism.In a moment, we’ll see how these three matrices are employed.

 Searching-Engine Wor
Searching-Engine Wor

So let me offer you an example to help you grasp and imagine self-awareness. I’m just looking for good data science tutorials to help me learn data science.Despite the fact that the YouTube database is so huge, it allows me to insert a query and have it provide me the outcome from among various data.So if I supply the query Data Science Tutorial, my question will be Data Science Tutorial, which will compute the score among other data sequences(keys) and return which ever its related to it(which has a higher score).

NOTE: The above explanation is just an example to make you visualize how my query is being compared with other words/sequences as keys here.

Let me return to the [key, query, and values] notions.Now consider how we may generate these three matrices for self attention mechanism.So, in order to generate these three matrices, we add three new weights W[Q], W[K], and W[V].By multiplying the input matrix, X, by W[Q], W[K], and W[V], we get the query, Q, key, K, and value, V matrices.

NOTE: W[Q], W[K], and W[V] weight matrices are randomly initialised, and their optimal values are learnt during training.We will receive more accurate query, key, and values matrices as we learn the ideal weights.

As indicated in the diagram below, we multiply the input matrix (X) by the weights matrices, W[Q], W[K], and W[V], yielding query, key, and value.Furthermore, these are arbitrary values rather than accurate embeddings for just understanding purpose.

 Creating query, key and value matrices | Transformers Encoder | NLP
Creating query, key and value matrices

Understanding  the Self-attention Mechanism

So why we calculated query, key, values matrices? Let’s understand with 4 steps:

Step 1

  • The dot product of the query matrix, Q, and the key matrix, K(Transpose) is computed as the initial step in the self-attention process.
 Query and Key matrices
Query and Key matrices
  • The following shows the result of the dot product between the query matrix,Q and the key matrix,K(Transpose)
Dot Product between the query and key | Transformers Encoder | NLP
Dot Product between the query and key:
  • But what is the use of computing the dot product between the query and key matrices? What exactly does Q.K(Transpose) signify? Let’s understand this by looking at the result of  Q.K(Transpose) in detail.
  • Let’s look into the first row of the Q.K(Transpose) matrix as shown in following figure below.We can observe that we are computing the dot product between query vector q1 (I) and all the key vectors – k1(I), k2(am), and k3(good).

NOTE: The computing dot product indicates how comparable they are.The stronger the relationship, the higher the score.

  • So anyhow, here dot product just measures the similarity between the query vectors and the key vectors to compute attention scores.
  • And in same way we calculate dot products of other rows as well.
 Dot Product between query and key vectors
Dot Product between query and key vectors


  • The Q.K(Transpose) matrix is then divided by the square root of the key vector’s dimension in the self-attention process. But why are we forced to do so?

And what may happen if we don’t undertake this type of scaling?

As a result, without scaling, the magnitudes of the dot products might vary depending on the size of the key vectors. When the key vectors are larger, the dot products might also get larger. This can cause gradients to expand or shrink too fast during training, causing the optimisation process to become unstable and model training to suffer.

 Dividing Dot product by square root of dk
Dividing Dot product by square root of dk
 Scaling of Dot product
Scaling of Dot product
  • Let dk be the key vector’s dimension.So, if my embedding size is 512, let us suppose the key vector dimension is 64.So, if we take the square root of that, we get 8.


  • We can tell that the aforementioned similarity scores are in the unnormalised form by looking at them. As a result, we use the softmax function to normalise them. The softmax function assists in getting the score to the range of 0 to 1, and the total of the scores equals 1, as seen in the image below:
 Scaling of Dot Product
Scaling of Dot Product
  • Refer to the previous matrix as a scoring matrix, which allows us to understand the interconnectedness between each word in the sentence by analyzing the scores assigned to them. Examining the first row of the score matrix, we observe that the word “I” is 90% connected to itself, connecting 7% to the word “am,” and 3% connected to the word “good.” This newfound attention on my word is certainly gratifying.


  • So, what’s next? We generated the dot product of the query and key matrices, calculated the scores, and then normalised the scores using the softmax function. Compute the attention matrix, Z, as the final step in the self-attention mechanism.
  • Each word in the phrase has its own attention value in the attention matrix. The attention matrix, Z, compute by multiplying the score matrix with the Value matrix, V, as illustrated:
 Computing attention matrix
Computing attention matrix
  • As a result, our sequence will have the following attention matrix:
 Result of attention Matrix
Result of attention Matrix
  • The attention matrix is calculated by adding the weighted sum of the value vectors. Let’s break this down row by row to better comprehend it. First, consider how the self-attention of the word I is calculated in the first row:
 Self attention Vector
Self attention Vector
  • From the preceding image, we can deduce that the computation of self-attention for the word “I” involves weighting the value vectors by the scores and summing them together. As a result, the value will comprise 90% of the values v1 (I) from the value vector (I), 7% of the values from the value vector v2(am), and 3% of the values from the value vector v3(good) and so on for others.
 Self-attention mechanism | Transformers Encoder | NLP
Self-attention mechanism

As a result, in this way Self-Attention Mechanism operates in transformer-based Encoders.


Consequently, we have gained a comprehensive understanding of how the transformer’s encoder and self-attention approach operate. I believe that possessing knowledge of the architecture of various frameworks and effectively integrating them into NLP-based tasks is a crucial aspect of this line of work. In the future, we will incorporate additional sections on the Decoder, Bert, Large Language Models, and more. And I propose that you understand any architecture like this before deploying it elsewhere, so that you feel more knowledgeable and engaged in Data Science.

  • It is important to approach complex architectures with the mindset that nothing is inherently tough. With the right knowledge, dedication, and utilization of your talents, you can simplify and navigate through these architectures effectively, making them more manageable and empowering your work in data science.
  • Understanding the architecture of a framework, such as a transformer’s encoder and self-attention approach, is crucial for working effectively in NLP-based activities. It allows you to grasp the underlying principles and mechanisms that power these models.
  • Integrating the architecture of a framework correctly in any task is an essential skill. It enables you to leverage the capabilities of the framework effectively and achieve better results in NLP tasks.

Frequently Asked Questions

Q1. When was the self attention mechanism introduced?

A. The attention mechanism was first used in 2014 in computer vision, to try and understand what a neural network is looking at while making a prediction. This was one of the first steps to try and understand the outputs of Convolutional Neural Networks (CNNs).

Q2. Why do we use multi-head attention in transformers?

A. The idea behind using multi-head attention is that instead of using a single attention head, if we use multiple attention heads, then our attention matrix will be more accurate as model can attend to different parts of the input simultaneously, enabling it to capture various types of information and maintain a richer representation and improves the model’s robustness and stability by reducing reliance on a single attention head and aggregating information from multiple perspectives.

Q3. Can the transformer encoder capture long-range dependencies effectively?

A. Yes, the transformer encoder can capture long-range dependencies effectively. It achieves this through the use of self-attention, which allows each position in the sequence to attend to all other positions, capturing relevant information regardless of distance. The parallel computation and multi-head attention mechanism further enhance the model’s ability to capture diverse relationships.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

About the Author

Rahul Dogra

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article