Zero-shot Object Detection with Owl ViT Base Patch32

Maigari David Last Updated : 15 Nov, 2024
7 min read

Owl ViT is a computer vision model that has become very popular and has found applications across various industries. This model takes in an image and a text query as input. After the image processing, the output comes with a confidence score and the object’s location (from the text query) in the image. 

This model’s vision transformer architecture allows it to understand the relationship between text and images, which justifies the image and text encoder it uses during image processing. Owl ViT uses CLIP so the similarities of image-text can be accurate with contrastive loss. 

Learning Objectives

  • Learn about the zero-shot object detection capabilities of Owl ViT.
  • Study the model architecture and image processing phases of this model. 
  • Explore Owl ViT object detection by running inference. 
  • Get Insight into real-life applications of Owl ViT. 

This article was published as a part of the Data Science Blogathon.

What is Zero-shot Object Detection? 

Zero-shot object detection is a computer vision system that helps a model identify objects of different classes without previous knowledge. This model can take images as input and receive a list of candidates to choose from, which is more likely to be the object in the image. This model’s capability also ensures that it sees the bounding boxes that identify the object’s position in the image.

Models like Owl ViT would need a lot of pre-trained data to perform these tasks. So, the number of images of cars, cats, dogs, bikes, etc., would be used during the training process. But with the help of zero-shot object detection, you can break down this method using text-image similarities, allowing you to bring text descriptions. In contrast, the model uses its language understanding to perform the task. This concept is the base of this model’s architecture, which brings us to the next section. 

What is Zero-shot Object Detection? 
Source: Click Here

Model Architecture of Owl ViT Base Patch32

Owl ViT is an open-source model that uses CLIP-based image classification. It can detect objects of any class and match images to text descriptions using computer vision technology. 

This model’s foundation is its vision transformer architecture. This architecture takes images in sequences of patches, which are processed by a transformer encoder. 

The transformer encoder handles the model’s language understanding to process the input text query. This is further processed by the vision transformer encoder, which works with the image in patches. The model can find the relationship between text descriptions and images with this structure. 

Vision transformer architecture has become popular for many computer vision tasks. With the Owl ViT model, zero-shot object detection is the game changer. The model can easily classify objects in images even with words it has not seen before, streamlining the pre-training process and identifying images. 

How to Use This Model Owl ViT Base Patch 32 ?

So, to put this theory into practice, we need to meet some requirements before running the model. We will use the hugging face transformer library, which gives us access to open-source transformer models and toolkits. There are a few steps to running this model, starting by importing the needed libraries.  

Importing the Necessary Libraries 

Firstly, we must import three essential libraries to run this model: the request, PIL.image, and torch. Each of these libraries is necessary for the object detection tasks. Here is the brief breakdown; 

The ‘request’ library is essential for making HTTPS requests and accessing API. This library can interact with web servers, allowing you to download web content, such as images, using links. On the other hand, the PIL library allows you to open, download, and modify images in different file formats. Torch is a deep learning framework that allows different tensor operations, such as model training, GPU support, and matching learning tasks. 

import requests
from PIL import Image
import torch

Loading the Owl ViT Model 

Providing preprocessed data for the Owl ViT is another part of running this model.

from transformers import OwlViTProcessor, OwlViTForObjectDetection

This code ensures the model can handle input formats, resize images, and work with input such as text descriptions. Hence, you have pre-processed data and the fine-tuned tasks it performs. 

For the case, we use Owl for object detection, so we define the processor and expected input the model would handle.

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

Image Processing Parameters

image_path = "/content/five cats.jpg"
image = Image.open(image_path)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

An Owl ViT processor has to be compatible with the input you want to use. So, using ‘processor(text=texts, images=image, return_tensors=”pt”)’ does not only allow you to process image and text descriptions. This line also indicates that the preprocessed data should be returned as PyTorch tensors. 

Image Processing Parameters

Here, we fetch the image_path using a file from our computer. This is an alternative to using a URL and calling PIL to load the image for the object detection task. 

There are some common image processing parameters common with the OWL-ViT model, and we will briefly look at a few of them here; 

  • Pixel_values: This parameter usually represents raw image data passed of multiple images. The pixel_values come in the form of torch.tensor with the batch_size, color channels (num_channels), and the width and height of each image. Pixel_values are usually represented in a range (e.g., 0 to 1 or -1 to 1)
  • Query_pixel_values: While you can find the raw image data for multiple images, this parameter allows you to provide the model with pixel data for specific images that it will try to identify within other target images.
  • Output_attention: The output_parameter is an essential value for object detection models like OWl ViT. Depending on the model type, it allows you to show attention weights across tokens or image patches. The attention tensors can help the model visualize which part of the input it should prioritize, which is the object detected in this case.  
  • return_dict: This is another important parameter that helps the model return the output results of images that have gone through object detection. If this is set to ‘True,’ you can easily access the output. 

Processing Text and Image Inputs for Object Detection

The texts show the list of candidates for the classes: “a photo of a cat” and a “photo of a dog.” Finally, you have the model preprocessing the text and image descriptions to make them suitable as input for the model. The output will contain information about the detected object in the image, which, in this case, will be a confidence score. It can also use bounding boxing to identify the location of the image.

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

This code prepares the image to fit the prediction from the bounding box and also ensures that the format is compatible with the data set that carries the image. The result is a structured output of detected objects, each with its bounding box and class label, suitable for evaluation or further application use. 

Here is a breakdown simple breakdown; 

target_sizes = torch.Tensor: This code defines the target image sizes in (height, width) format. It reverses the original image’s (width, height) dimensions and stores them as a PyTorch tensor.

Additionally, the code uses the processor’s ‘post_process_object_detection’ method to convert the model’s raw output into bounding boxes and class labels. 

Image-Text Match

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

Here, you want to obtain the detection result by analyzing the text query, scores, and labels for the detected object in the image. Full resources for this are available in this notebook.

Finally, we get a summary of the results after completing the object detection task. We can run this with the code shown below;

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Image-Text Match: Owl ViT
Image-Text Match: Owl ViT

Real-Life Application of Owl ViT Object Detection Model

Many tasks involve computer vision and object detection these days. Owl ViT can come in handy for each of the following applications; 

  • Image search is one of the most obvious ways to use this model. Because it can match text with images, users would only need to enter a text prompt to search for images. 
  • Object detection can also find useful applications in robotics to identify objects in their environment. 
  • Users with vision loss can also find this tool valuable as this model can describe image content based on their text queries. 

Conclusion

Computer vision models are traditionally versatile, and Owl ViT is no different. Due to the model’s zero-shot capabilities, you can use it without extensive pre-training. This model’s strength is based on leveraging CLIP and vision transformer architecture for image-text matching, so exploring it becomes streamlined.  

Resources

Key Takeaways

  • Zero-shot object detection is the game-changer in this model’s architecture. It allows the model to perform tasks with images without previous knowledge of the image classes. Text queries can also help identify objects, avoiding the need for large data for pre-training. 
  • This model’s ability to match text-image pairs lets it identify objects using textual descriptions and bounding boxes in real time.
  • Owl ViT’s capabilities extend to real-life applications like image search, robotics, and assistive technology for visually impaired users, highlighting the model’s versatile computer vision applications.

Frequently Asked Questions

Q1. What is zero-shot object detection in Owl ViT?

A. Zero-shot object detection allows Owl ViT to identify objects just by matching textual descriptions of the images, even if it has not been trained on that specific class. This concept enables the model to detect new objects based on text prompts alone.

Q2. How does Owl ViT use text-image matching?

A. Owl ViT leverages a vision transformer architecture with CLIP, which matches images to text descriptions using contrastive learning. This phenomenon allows it to recognize objects based on text queries without prior knowledge of specific object classes.

Q3. What are some real-world applications of Owl ViT? 

A. Owl ViT can find useful applications in image search, robotics technology, and for users with impaired vision. That means people with this challenge can benefit from this model as it can describe objects based on text input. 

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details