OpenAI’s Future of Vision: Contrastive Language Image Pre-training (CLIP)

Tanishq Gautam 13 Jan, 2021 • 5 min read

Introduction

2021 has begun with a bang! OpenAI has released two major innovations in the field of Computer Vision: CLIP and DALL-E.

The CLIP network has a really interesting and possibly game-changing approach to Image Classification tasks using Contrastive Pre-training to perform Zero-Shot learning similar to that of GPT-3.

What CLIP allows us to do is to design our own classifiers and removes the need for any specific training data but still achieve State of the art results regardless of the computer vision task.

Before understanding how CLIP works let’s see what OpenAI is aiming to solve.

We have seen major improvements in Computer vision to solve a multitude of problems but they each come with their back draws, such as:-

A lot of current vision models like ResNet, InceptionNet are able to achieve human-level performance on complex image classification datasets however they rely on the availability of large datasets, which is difficult to create.

Even though the current state of the art models can perform extremely well on datasets like ImageNet they fall drastically when introduced to variants or out of the box data as they have only been optimized for performing on the benchmark and fail to perform in real-life scenarios.

OpenAI aims to solve these problems of large datasets and bad real-life performances with CLIP. CLIP has not only proven to give a state of the art of results on image classification but also other vision tasks such as object classification, action recognition in videos, and OCR. This shows that a single algorithm like CLIP can work with a variety of tasks and datasets without the need to build huge datasets but is computationally expensive.

CLIP also plays a vital role in the working of DALL-E, so make sure to read all about DALL-E in the upcoming blog!!

Overview of The Algorithm

Source: Official blog post by OpenAI on CLIP

The team at OpenAI have incorporated a lot of state of the art working methods such as zero-shot transfer, natural language supervision, and multimodal learning. Let’s begin with a high-level overview of the working of CLIPs

It starts off with a batch of text and image pairs that can be easily found on the internet. These texts and images are passed into a text and image encoder respectively and a similarity search is done where the images are mapped to the corresponding text from the entire batch. This alignment of image and text is the contrastive pre-training approach. A similar approach has been implemented in the ConVIRT paper in the field of medical imaging.

Once the images and texts have been matched, Zero-shot prediction can be performed. What happens here is that all the classes in the dataset are arranged in a specific format like “a photo of a {classname}’ and that is fed into the encoder. Similar to contrastive pre-training, the image is passed to the encoder and it performs a similarity search to determine which text matches the image from the entire batch, i.e, the text encoder will contain a batch of ‘a photo of a {dog}’, ‘a photo of a {car}’ etc and CLIP will estimate the best pairs with a given image. For example, we can see that the class guacamole ranked 1 out of 101 classes and television ranked 1 out of 397.

Source: Official blog post by OpenAI on CLIP

We have already seen that zero-shot approaches such as these (GPT-3) are computationally expensive. CLIP utilized a few different approaches to tackle this. The first one we have already seen is the contrastive pre-training approach, which led to significantly less computation. Secondly, it utilizes a vision transformer which further increased the efficiency over standard vision models like the Resnet.

Source: Official blog post by OpenAI on CLIP

This Zero-shot learning approach coupled with natural language supervision is what differentiates CLIP from the other vision models. By training a wide variety of data easily accessible on the internet and no direct optimizing for benchmark, CLIP is much more generalized and representative.

We can see in the above image that the CLIP achieved the language model accuracy at just 33M parameters compared to 400M. CLIP is 12 times more efficient!!
As a result of this methodology, CLIP can easily be applied to nearly any visual classification tasks and achieve great performance.

Now you can see that the team at OpenAI has solved a lot of the problems of the current vision models. CLIP has reduced the labor-intensive large datasets that are required for SOTA computer vision tasks by learning from the text–image pairs that are already publicly available and not only that, it has also reduced the need to focus on a limited number of visual concepts.

Did you know the ImageNet dataset required 25,000 workers to annotate 14 million images for 22,000 object categories? That’s a lot of work!

Imagine using a pre-trained imagenet model on a specific dataset of your choice. It would require to build a dataset from scratch and fine-tune your model. But all CLIP requires is for you to pass the names of your task’s visual concepts into the text encoder, and it will output a linear classifier of the visual representations.

One thing to note is that the CLIP can match the performance of SOTA vision models on datasets like ImageNet. But what openAI has also tested is that adding the linear classifier on top of the CLIP’s features boosts its accuracy by 10% however fails to generalize well on other variations of imagenet.

IS CLIP ALL GOOD?

There are many limitations of CLIPs currently. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance a large amount of computing is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware.

When compared to task-specific models, CLIP has shown poor performance on several types of classification problems such as differentiating models of cars, species of flowers, etc. Another limitation is that CLIP still generalizes poorly to data that is truly out-of-distribution for it. Taking the example of MNIST, a simple baseline of logistic regression outperforms zero-shot CLIP. CLIP hopes that by training on a large and varied dataset that all data will be effectively in-distribution but as MNIST demonstrates, it is easy to violate.

CLIP as we have seen can flexibly generate zero-shot classifiers for a wide variety of datasets but is still limited to only those concepts in a given zero-shot classifier. We have seen an approach like Image Captioning that can generate novel outputs in comparison to ZS CLIP.

ENDING NOTES

Kudos to the research team at OpenAI! CLIP has introduced a really interesting and flexible approach to tackling computer vision problems. Not only can it overcome problems faced by many of today’s vision models and approaches but it does so with flying colors. Its ability to tackle almost any vision problem and still produce amazing results is not a small feat. OpenAI has released both the research paper and code. Feel free to check it out!

We have seen what CLIP has able to achieve and it blew our minds. But that’s not the end of it, the release of DALL-E further introduces us to a new era of Artificial Intelligence and CLIP plays a vital role. Stay tuned for our blog post on the biggest breakthrough in Computer Vision in recent years, i.e DALL-E!!

Let us know your thoughts on CLIP in the comment section below.

CLIP