Vision Transformers (ViT) in Image Captioning Using Pretrained ViT Models
Image captioning using Pretrained ViT models can be seen as a text or written description beneath an image meant to provide a description of the details of the image. It is the task of translating an image into a textual description. It is done by connecting Vision (image) and Language (Text). In this article, we achieve this using Vision Transformers (ViT) in images as the major technology using the PyTorch backend. The goal is to show a way of employing transformers, ViTs in particular in generating image captions, using trained models without retraining from scratch.
With the current trend of social media platforms and online usage of pictures, the benefits of this skill are numerous and could be done for many reasons including description, citation, to aid the visually impaired, and even search engine optimization. This makes learning this technique very handy for projects that involve images.
- The idea of Image Captioning
- Using ViTs for Image Capturing
- Carrying out Image captioning with pre-trained Models
- Utilizing Transformers using Python
You can find the entire code used in this GitHub repo.
This article was published as a part of the Data Science Blogathon.
Table of contents
What are Transformer Models?
Before we look into Vit, let’s start with an understanding of Transformers. Since the introduction of transformers in 2017 by Google Brain, it steered an interest in its capability in NLP. A transformer is a deep learning model distinguished by its adoption of self-attention, differentially weighting the significance of each part of the input data. And has been used primarily in the fields of natural language processing (NLP)).
Transformers process sequential input data, such as in natural language but transformers process the entire input all at once. With the help of the attention mechanism, there is a context for any position in the input sequence. This efficiency allows for more parallelization and reduces training times while improving efficiency.
Now let us look into the architectural makeup of transformers. The Transformer architecture is made up of an encoder-decoder structure primarily. The encoder-decoder structure of the Transformer architecture was presented in a famous paper titled “Attention Is All You Need”.
The encoder is made up of layers responsible for processing the input iteratively one layer after another, while on the other hand, the decoder layers receive the encoder output and generate a decoded output. Simply put, the encoder maps the input sequence to a sequence which is then fed into a decoder. The decoder then generates an output sequence.
What are Vision Transformers?
Since this article shows a practical use of ViTs in image captioning, it is useful to also have an understanding of how ViTs work. Vision transformers are a type of transformers that perform visual-related tasks that include images. They are a transformer that also use attention mechanisms to find the relationships between input images. In this use case, they will connect our image with tokens or texts.
Implementing Image Captioning
With the understanding of what transformers are and how they work, let us go on to implement our image captioning model. We will start by installing the transformer library and then build the model before using our model to generate captions of images.
Before we go on to write the codes, let us bring to mind that we are actually using the vit-gpt2-image-captioning model trained for image captioning made available from the Hugging Face library. The backbone of this model is a vision transformer.
Importing Required Libraries
The first thing is to install the Transformer library since it is not pre-installed yet in Colab.
# Installing Transformer Libraries !pip install transformers
Now, we can import libraries.
# Web links Handler import requests # Backend import torch # Image Processing from PIL import Image # Transformer and pre-trained Model from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2TokenizerFast # Managing loading processing from tqdm import tqdm # Assign available GPU device = "cuda" if torch.cuda.is_available() else "cpu"
You can find the entire code in this GitHub repo.
# Loading a fine-tuned image captioning Transformer Model # ViT Encoder-Decoder Model model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning").to(device) # Corresponding ViT Tokenizer tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning") # Image processor image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
We have introduced three (3) pre-trained models from the transformers class. Let us see their functions briefly.
- VisionEncoderDecoderModel: This helps in carrying out an image-to-text generation with any pre-trained vision model using a Transformers (as the encoder) such as ViT (which we used here) or BEiT kind of models which uses self-supervised pre-training of Vision Transformers (ViTs) to outperform supervised pre-training alongside any pre-trained language model as the decoder such as GPT2 (which we are also using here). So, in this approach, we employ VisionEncoderDecoder as an application for image captioning using it to encode the image and later use a language model to generate the captions.
- GPT2TokenizerFast: This creates a GPT-2 tokenizer using the Hugging Face tokenizers library. We load the tokenizers library to the transformers. The tokenizer has been trained already to handle all the feats we require for captioning.
- ViTImageProcessor: Lastly is the ViTImageProcessor. It helps to construct a ViT image processor.
Preparing Image for Capturing
Now we need to create a function for loading URLs and processing the images we wish to capture.
# Accesssing images from the web import urllib.parse as parse import os # Verify url def check_url(string): try: result = parse.urlparse(string) return all([result.scheme, result.netloc, result.path]) except: return False # Load an image def load_image(image_path): if check_url(image_path): return Image.open(requests.get(image_path, stream=True).raw) elif os.path.exists(image_path): return Image.open(image_path)
So we just created two functions to, first verify a URL and another function to use that verified URL to load the image for capturing.
Performing Inference on the Image
Inference helps us to come up with a reasonable conclusion about the image based on its characteristics. An approach is to convert the image to tensors using PyTorch (as used here) or deal with it as pixels. To perform our inference, we use the general method as shown below to autoregressively generate the caption.
# Image inference def get_caption(model, image_processor, tokenizer, image_path): image = load_image(image_path) # Preprocessing the Image img = image_processor(image, return_tensors="pt").to(device) # Generating captions output = model.generate(**img) # decode the output caption = tokenizer.batch_decode(output, skip_special_tokens=True) return caption
We have used greedy decoding which is the default. Other options might include beam search or multinomial sampling. You can experiment with them and see the difference.
Loading and Capturing Images
Finally, we can load and capture our images as we require. We will load a number of images and see how the capturing performs. Note these images were not from the coco dataset but from sources across the web. Feel free to use your images as desired.
# Image media display from IPython.display import display
# Loading URLs url = "https://images.pexels.com/photos/101667/pexels-photo-101667.jpeg?auto=compress&cs=tinysrgb&w=600" # Display Image display(load_image(url)) # Display Caption get_caption(model, image_processor, tokenizer, url)
a black horse running through a grassy field
# Loading URLs url = "https://images.pexels.com/photos/103123/pexels-photo-103123.jpeg?auto=compress&cs=tinysrgb&w=600" # Display Image display(load_image(url)) # Display Caption get_caption(model, image_processor, tokenizer, url)
a man standing on top of a hill with a mountain
# Loading URLs url = "https://images.pexels.com/photos/406014/pexels-photo-406014.jpeg?auto=compress&cs=tinysrgb&w=600" # Display Image display(load_image(url)) # Display Caption get_caption(model, image_processor, tokenizer, url)
a dog with a long nose
Other Applications of Vision Transformers
Before we wrap off let us see a few other use cases of Vision Transformers other than Image captioning:
- Optical Character Recognition (OCR)
- Image Detection/Classification
- Deepfake Identification
- Anomaly Detection/Segmentation
- Image segmentation and analysis
We have carried out Image captioning using Vision Transformers (ViT) technology with a PyTorch backend. ViTs are deep learning models that process sequential input data and reduce training times. Using the pre-trained models VisionEncoderDecoderModel, GPT2TokenizerFast, and ViTImageProcessor, provided an easy way of building without building from scratch. They also have the ability to outperform supervised pre-training and are suitable for image captioning.
- We were able to see Image captioning by translating an image into a textual description using pre-trained Vision Transformers (ViT) models and PyTorch backend.
- Transformers are models that process sequential input data using self-attention, parallelization, and reduced training times.
- We demonstrated the practical use of ViTs in image captioning, utilizing attention mechanisms to connect images with texts.
Frequently Asked Questions (FAQs)
A. Vision transformers are widely applied in image recognition, generative modeling, and multi-model tasks.
A. Vision transformers have three main components: an optimizer, dataset-specific parameters, and network depth. They outperform CNNs in fewer datasets, have no inductive biases, and handle input image distortions more robustly using attention mechanisms.
A. Image captioning model uses an encoder and decoder structure to extract features, using models, transformers, and various libraries.
A. Image captioning helps to provide a textual representation of an image. The benefits include helping the visually impaired get the context of an image by using screen readers to read the text.
- Project GitHub: https://github.com/inuwamobarak/Image-captioning-ViT
- Vision Transformer (ViT)We’re on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
- OpenAI GPT2We’re on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
- TokenizerWe’re on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
- Vision Encoder Decoder ModelsWe’re on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.