Introduction to Vision Transformers (ViT)

Mobarak Inuwa 14 Jun, 2023 • 6 min read


Over the years, we have been using Computer vision (CV) and image processing techniques from artificial intelligence (AI) and pattern recognition to derive information from images, videos, and other visual inputs. Underlying methods successfully achieve this by manipulating digital images through computer algorithms.

Researchers found that regular models had limitations in some applications, which prompted advancements in traditional deep learning and deep neural networks. This brought about the popularity of transformer models. They have the ability known as “self-attention”. This provides them with an edge over other model architectures, and researchers have introduced it extensively in natural language processing and computer vision.

Vision Transformers (ViT)
Source: Freepik

Learning Objectives

  • What are vision transformers and transformers?
  • How do vision transformers work?
  • The idea of Multi-Head Attention
  • ViT versus Convolutional Neural Networks

This article was published as a part of the Data Science Blogathon.

What are Vision Transformers?

In simple terms, vision transformers are types of transformers used for visual tasks such as in image processing. This entails that transformers are being used in many areas, including NLP, but ViT specifically focuses on processing image-related tasks. Recently, used majorly in Generative artificial intelligence and stable diffusion.

How do Vision Transformers Work?

ViT measures the relationships between input images in a technique called attention. It enhances some parts of the image and diminishes other parts while mimicking cognitive attention. The goal is to learn the important parts of the input. The instructions that provide context and constraints guide this approach.

How Do Vision Transformers Work?

Vision Transformer applies the transformer to image classification tasks with a model architecture similar to a regular transformer. It adjusts itself to allow efficient handling of images, as other models will perform for natural language processing tasks.

Key concepts of vision transformers include ‘attention’ and ‘multi-head attention’. Having an understanding of these concepts is very essential in how vision transformers work. Attention is a key mechanism unique to transformers and is the secrete to their strength. Let’s look at the transformer architecture and see how it works.

The Masked Multi-Head Attention is a central mechanism of the Transformer similar to skip-joining as in ResNet50 architecture. This means that there is a shortcut connection or skipping of some layers of the network.

 Source: Wikipedia
Source: Wikipedia

Lets us look at these variables briefly. Where the value of X is a concatenation of the matrix of word embeddings and the matrices:

Q: This stands for Query.

K: This stands for Key, and

V: Stands for Value

How do Vision Transformers Work?

The multi-head attention calculates the attention weight of a Query token which could be the prompt of an image. Both the Key token and the Value associated with each Key are multiplied together. We can also say it calculates the relationship or attention weight between the Query and the Key and then multiplies the Value associated with each Key.

We can conclude that multi-head attention allows us to treat different parts of the input sequence differently. The model bests capture positional details since each head will separately attend to different input elements. This gives us a more robust representation.

Python Implementation of Multihead Attention

We have seen that multi-head attention transforms the consecutive weight matrices into the corresponding feature vectors representing the Queries, Keys, and Values. Lets us see an implementation module below.

class MultiheadAttention(nn.Module):

    def __init__(self, input_dim, embed_dim, num_heads):
        assert embed_dim % num_heads == 0, "Embedding dimension must be 0 modulo number of heads."

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Stack all weight matrices 1...h together for efficiency
        # Note that in many implementations you see "bias=False" which is optional
        self.qkv_proj = nn.Linear(input_dim, 3*embed_dim)
        self.o_proj = nn.Linear(embed_dim, embed_dim)


    def _reset_parameters(self):
        # Original Transformer initialization, see PyTorch documentation

    def forward(self, x, mask=None, return_attention=False):
        batch_size, seq_length, _ = x.size()
        qkv = self.qkv_proj(x)

        # Separate Q, K, V from linear output
        qkv = qkv.reshape(batch_size, seq_length, self.num_heads, 3*self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3) # [Batch, Head, SeqLen, Dims]
        q, k, v = qkv.chunk(3, dim=-1)

        # Determine value outputs
        values, attention = scaled_dot_product(q, k, v, mask=mask)
        values = values.permute(0, 2, 1, 3) # [Batch, SeqLen, Head, Dims]
        values = values.reshape(batch_size, seq_length, self.embed_dim)
        o = self.o_proj(values)

        if return_attention:
            return o, attention
            return o

Visit here for more information.

Applications of Vision Transformers

Vision Transformers have revolutionized traditional Computer Vision tasks. Following are the areas of application of the vision transformers:

  • Image Detection and Classification
  • Video Deepfake Detection and Anomaly Detection
  • Image segmentation and cluster analysis
  • Autonomous Driving

Vision Transformers versus Convolutional Neural Networks

It is beneficial to also look at the comparison between the two as this can help understand transformers. The differences are many; besides, both have different architecture.

  1. Major Building Blocks: Vision transformers are made up of three major components, including the optimizer and dataset-specific parameters valued to control the learning process and the network depth. Convolutional neural networks are less complex compared to optimization.
  2. CNNs require and learn better based on data volume. The better the dataset, the better the accuracy. This is not exactly the same for Vision transformers, as they perform satisfactorily at comparatively fewer datasets.
  3. CNNs tend to have inductive biases. Inductive bias or learning bias is the assumption the model makes when making predictions limiting it to fail in global relations or generalization. Vision Transformers does not have these biases making them work well generalized by the approach of their training method.
  4. By their performance, Vision Transformers are more robust in dealing with input image distortions than CNNs.
  5. Transformers work non-sequentially whereas CNNs are sequential in the data processing. CNN will take an image at a time or in batches while transformers can take all the images input at once.
  6. A huge difference is the presence of an attention mechanism in transformers. The attention helps transformers work according to prompts or contexts while still using past information, but CNNs can only use learned knowledge without any contextual strength.

Vision Transformers for Dense Prediction

Intel labs has certainly played a vital role in researching and presenting work on vision transformers in the context of making predictions on images. This is seen as a dense prediction. Dense prediction learns a mapping from a simple input image to a complex output. This might have to do with semantic segmentation or image depth estimation, etc.

Vision Transformers for Dense Prediction

Depth estimation looks at the pixel of images, so it is very handy for computer vision used in object tracking, augmented reality, and autonomous cars.


Vision transformer architecture processes their data in a diversified manner allowing them to gather information on the image from different parts or pixels. To achieve the focus on suitable pixels, they use self-attention mechanisms to capture the relationships in the overall image context. Finally, researchers have used cases where they combined both architectures of CNN and ViT together to build a hybrid architecture, thereby obtaining excellent results.

Key Takeaways:

  • Self-Attention: Transformers have gained an edge over other regular model architectures, and researchers have introduced them extensively in advanced applications.
  • Vision transformers serve as transformers specifically designed for visual tasks, such as image processing.
  • The key concept that forms the foundation of vision transformers is “multi-head attention.”
  • Intel labs certainly presented vital work on vision transformers in the context of making predictions on images. This is seen as a dense prediction.


Frequently Asked Questions

Q1. What is a vision transformer in simple terms?

A. In simple terms, a vision transformer is a deep learning model that utilizes transformers, originally designed for natural language processing for image recognition tasks. It breaks down an image into patches, processes them using transformers, and aggregates the information for classification or object detection.

Q2. How to learn vision transformer?

A. Learning vision transformers involves studying the underlying concepts of transformers, attending online courses, reading research papers, experimenting with available code implementations, and working on image-based tasks to gain hands-on experience.

Q3. What size is a vision transformer?

A. The size of a vision transformer refers to its number of parameters. Different models may vary in size, typically ranging from a few million to several billion parameters, depending on the complexity and scale of the desired vision task.

Q4. How many layers does the vision transformer have?

A. The number of layers in a vision transformer can also vary. Commonly, vision transformers have multiple layers, often ranging from a few dozen to several hundred, allowing the model to learn hierarchical representations of the input image at different levels of abstraction.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Mobarak Inuwa 14 Jun 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Computer Vision
Become a full stack data scientist