Vision Transformers (ViT) in Image Captioning Using Pretrained ViT Models

Mobarak Inuwa 27 Jun, 2023 • 7 min read


Image captioning using Pretrained ViT models can be seen as a text or written description beneath an image meant to provide a description of the details of the image. It is the task of translating an image into a textual description. It is done by connecting Vision (image) and Language (Text). In this article, we achieve this using Vision Transformers (ViT) in images as the major technology using the PyTorch backend. The goal is to show a way of employing transformers, ViTs in particular in generating image captions, using trained models without retraining from scratch.

Vision Transformers in images | Pretrained ViT Models | image captioning | Vision Transformers
Source: Springer

With the current trend of social media platforms and online usage of pictures, the benefits of this skill are numerous and could be done for many reasons including description, citation, to aid the visually impaired, and even search engine optimization. This makes learning this technique very handy for projects that involve images.

Learning Objectives

  • The idea of Image Captioning
  • Using ViTs for Image Capturing
  • Carrying out Image captioning with pre-trained Models
  • Utilizing Transformers using Python

You can find the entire code used in this GitHub repo.

This article was published as a part of the Data Science Blogathon.

What are Transformer Models?

Before we look into Vit, let’s start with an understanding of Transformers. Since the introduction of transformers in 2017 by Google Brain, it steered an interest in its capability in NLP. A transformer is a deep learning model distinguished by its adoption of self-attention, differentially weighting the significance of each part of the input data. And has been used primarily in the fields of natural language processing (NLP)).

Transformers process sequential input data, such as in natural language but transformers process the entire input all at once. With the help of the attention mechanism, there is a context for any position in the input sequence. This efficiency allows for more parallelization and reduces training times while improving efficiency.

Transformer Architecture

Now let us look into the architectural makeup of transformers. The Transformer architecture is made up of an encoder-decoder structure primarily. The encoder-decoder structure of the Transformer architecture was presented in a famous paper titled “Attention Is All You Need”.

Vision Transformers in images | Pretrained ViT Models | image captioning | Vision Transformers

The encoder is made up of layers responsible for processing the input iteratively one layer after another, while on the other hand, the decoder layers receive the encoder output and generate a decoded output. Simply put, the encoder maps the input sequence to a sequence which is then fed into a decoder. The decoder then generates an output sequence.

 What are Vision Transformers?

Since this article shows a practical use of ViTs in image captioning, it is useful to also have an understanding of how ViTs work. Vision transformers are a type of transformers that perform visual-related tasks that include images. They are a transformer that also use attention mechanisms to find the relationships between input images. In this use case, they will connect our image with tokens or texts.

Vision Transformers in images | Pretrained ViT Models | image captioning | Vision Transformers
Source: Alexey et al. 2021

Implementing Image Captioning

With the understanding of what transformers are and how they work, let us go on to implement our image captioning model. We will start by installing the transformer library and then build the model before using our model to generate captions of images.

Vision Transformers in images | Pretrained ViT Models | image captioning | Vision Transformers

Before we go on to write the codes, let us bring to mind that we are actually using the vit-gpt2-image-captioning model trained for image captioning made available from the Hugging Face library. The backbone of this model is a vision transformer.

Importing Required Libraries

The first thing is to install the Transformer library since it is not pre-installed yet in Colab.

# Installing Transformer Libraries

!pip install transformers

Now, we can import libraries.

# Web links Handler
import requests

# Backend
import torch

# Image Processing
from PIL import Image

# Transformer and pre-trained Model
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2TokenizerFast

# Managing loading processing
from tqdm import tqdm

# Assign available GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

You can find the entire code in this GitHub repo.

# Loading a fine-tuned image captioning Transformer Model

# ViT Encoder-Decoder Model
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning").to(device)

# Corresponding ViT Tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

# Image processor
image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

We have introduced three (3) pre-trained models from the transformers class. Let us see their functions briefly.

  • VisionEncoderDecoderModel: This helps in carrying out an image-to-text generation with any pre-trained vision model using a Transformers (as the encoder) such as ViT (which we used here) or BEiT kind of models which uses self-supervised pre-training of Vision Transformers (ViTs) to outperform supervised pre-training alongside any pre-trained language model as the decoder such as GPT2 (which we are also using here). So, in this approach, we employ VisionEncoderDecoder as an application for image captioning using it to encode the image and later use a language model to generate the captions.
  • GPT2TokenizerFast: This creates a GPT-2 tokenizer using the Hugging Face tokenizers library. We load the tokenizers library to the transformers. The tokenizer has been trained already to handle all the feats we require for captioning.
  • ViTImageProcessor: Lastly is the ViTImageProcessor. It helps to construct a ViT image processor.

Preparing Image for Capturing

Now we need to create a function for loading URLs and processing the images we wish to capture.

# Accesssing images from the web
import urllib.parse as parse
import os
# Verify url
def check_url(string):
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
        return False

# Load an image
def load_image(image_path):
    if check_url(image_path):
        return, stream=True).raw)
    elif os.path.exists(image_path):

So we just created two functions to, first verify a URL and another function to use that verified URL to load the image for capturing.

Performing Inference on the Image

Inference helps us to come up with a reasonable conclusion about the image based on its characteristics. An approach is to convert the image to tensors using PyTorch (as used here) or deal with it as pixels. To perform our inference, we use the general method as shown below to autoregressively generate the caption.

# Image inference
def get_caption(model, image_processor, tokenizer, image_path):
    image = load_image(image_path)
    # Preprocessing the Image
    img = image_processor(image, return_tensors="pt").to(device)
    # Generating captions
    output = model.generate(**img)
    # decode the output
    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    return caption

We have used greedy decoding which is the default. Other options might include beam search or multinomial sampling. You can experiment with them and see the difference.

Loading and Capturing Images

Finally, we can load and capture our images as we require. We will load a number of images and see how the capturing performs. Note these images were not from the coco dataset but from sources across the web. Feel free to use your images as desired.

#  Image media display
from IPython.display import display

Example 1

# Loading URLs
url = ""

# Display Image

# Display Caption
get_caption(model, image_processor, tokenizer, url)
Vision Transformers in images | Pretrained ViT Models | image captioning | Vision Transformers
Source: Pexels


a black horse running through a grassy field

Example 2

# Loading URLs
url = ""

# Display Image

# Display Caption
get_caption(model, image_processor, tokenizer, url)
Source: Pexels


a man standing on top of a hill with a mountain

Example 3

# Loading URLs
url = ""

# Display Image

# Display Caption
get_caption(model, image_processor, tokenizer, url)
Source: Pexels


a dog with a long nose 

Other Applications of Vision Transformers

Before we wrap off let us see a few other use cases of Vision Transformers other than Image captioning:

  • Optical Character Recognition (OCR)
  • Image Detection/Classification
  • Deepfake Identification
  • Anomaly Detection/Segmentation
  • Image segmentation and analysis


We have carried out Image captioning using Vision Transformers (ViT) technology with a PyTorch backend. ViTs are deep learning models that process sequential input data and reduce training times. Using the pre-trained models VisionEncoderDecoderModel, GPT2TokenizerFast, and ViTImageProcessor, provided an easy way of building without building from scratch. They also have the ability to outperform supervised pre-training and are suitable for image captioning.

Key Takeaways

  • We were able to see Image captioning by translating an image into a textual description using pre-trained Vision Transformers (ViT) models and PyTorch backend.
  • Transformers are models that process sequential input data using self-attention, parallelization, and reduced training times.
  • We demonstrated the practical use of ViTs in image captioning, utilizing attention mechanisms to connect images with texts.

Frequently Asked Questions (FAQs)

Q1. What does a Vision Transformer do?

A. Vision transformers are widely applied in image recognition, generative modeling, and multi-model tasks.

Q2. What is the difference between CNN and ViTs?

A. Vision transformers have three main components: an optimizer, dataset-specific parameters, and network depth. They outperform CNNs in fewer datasets, have no inductive biases, and handle input image distortions more robustly using attention mechanisms.

Q3. How is image captioning done?

A. Image captioning model uses an encoder and decoder structure to extract features, using models, transformers, and various libraries.

Q4. What is the importance of image captioning?

A. Image captioning helps to provide a textual representation of an image. The benefits include helping the visually impaired get the context of an image by using screen readers to read the text.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Mobarak Inuwa 27 Jun 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

  • [tta_listen_btn class="listen"]