OpenAI’s Future of Vision with DALL-E: Creating Images from Text

Last Updated : 13 Jan, 2021

6 min read

Introduction

Source: Official blog post by OpenAI on DALL-E

A major goal of Artificial Intelligence in recent years has been creating multimodal systems, i.e, systems that can learn concepts in multiple domains. We have seen one such system CLIP in the previous `blog post.

In this article, we will look at possibly one of the biggest breakthroughs in Computer Vision in recent years: the DALL-E named after the artist Salvador Dalí and Pixar’s WALL-E.

DALL-E is a neural network that can successfully turn text into an appropriate image for a wide range of concepts expressible in natural language. We have already seen the implementation and success of the GPT-3 which was a major breakthrough in NLP. A similar approach has been used for this network. You could almost call this a GPT-3 for images.

DALL-E has 12 billion parameters trained to generate images from textual descriptions. Similar to CLIP, it also uses text-image pairs to form its dataset. We have seen in GPT-3 which is a 175 billion parameter network capable of various text generation tasks. DALL-E has now shown to extend that knowledge to manipulation of vision tasks.

Simply put, you could pass any piece of text to the DALL-E model, real or imaginary and it will generate the images for us. We will take a look at some examples later on. DALL-E really does provides a lot of customization with the images it produces.

Be prepared to have your mind blown!!

Overview of The Algorithm

As of date, the real work behind DALL-E has not been published,i.e, the team at OpenAI has not released the research paper and code behind it. But I’ll walk you through what we know now. Also, Phil Wang (aka Lucidrains) has developed a replication of DALL-E which you can view here!

Taking its inspiration from GPT-3, DALL-E is also a transformer language model. Transformer models have been the state of the art networks for all things NLP in recent years and are now seeing much use in vision tasks. We have already seen a vision transformer being used in the CLIP network.

One thing to note is that DALL-E is a Decoder-only transformer, which means all the images and texts are passed together directly as a single stream of tokens into the decoder.

Source: Official blog post by OpenAI on DALL-E

What we know about DALL-E’s architecture is that it uses a similar approach to a VQVAE. Let’s understand how that works!

The above image is the working of a VQVAE, in which we can see that an image of a dog is passed into an encoder. This encoder creates a “latent space” which is simply a space of compressed image data in which similar data points are closer together.

Now the VQVAE has a batch of vectors which we can call vocabulary to simplify. The encoder must output a vector from its latent space which is present in our vocabulary. Meaning we provide the encoder with a set of choices. This vector is fed into the decoder which reconstructs the image of the dog.

So DALL-E uses a pre-trained VAE, meaning the encoder, decoder, and vocabulary are trained together. Let’s take an example to see DALL-E in action:-

Source: Official blog post by OpenAI on DALL-E

Now given a text prompt such as this, DALL-E is able to generate images. What DALL-E can do compared to other text to image synthesis networks is not only generate an image from scratch but also to regenerate any rectangular region of an existing image.

The team at OpenAI has mentioned that the network generates 512 images as output. Their latest network CLIP is used as a reranking algorithm to choose the best 32 for the given examples. What this means is that the CLIP uses its similarity search capabilities to select the best image-test pairs. We have seen how good CLIP was at that in the previous blog post.

WHAT IS DALL-E GOOD AND BAD AT?

DALL-E has shown a lot of capabilities with what it can do. Let’s analyze them!

We can see that DALL-E can create objects in polygonal shapes that may or may not occur in real life. DALL-E does not miss the mark on creating the correct object but does have a lower success rate for some shapes as seen below. We can see some images of the red lunch box are hexagons even though the text requests for pentagonal boxes.

Source: Official blog post by OpenAI on DALL-E

Counting is a major issue with DALL-E as of now. DALL-E is able to generate multiple objects but is unable to reliably count past two. While DALL·E does offer some level of controllability over the attributes and number of objects, the success rate can depend on how the caption is phrased. If the text requests nouns for which there are multiple meanings, such as “glasses,” “chips,” and “cups” it sometimes draws both interpretations, depending on the plural form that is used.

Source: Official blog post by OpenAI on DALL-E

DALL-E is robust to changes in the lighting, shadows, and environment based on the time of day or season. This enables it to have some capabilities of a 3D rendering engine. A major difference is that, while 3D engines require the inputs to be completely detailed, DALL-E can fill those blanks based on the caption provided. In fact, DALL-E is really good at Styles!

A few other capabilities of the DALL-E are:-

Visualizing internal and external capabilities, i.e, it is able to generate a cross-section of a mushroom for example. It also the ability to draw the fine-grained external details of several different kinds of objects as we see in the image below. These details are only apparent when the object is viewed up close.

Source: Official blog post by OpenAI on DALL-E

Combining unrelated elements, i.e DALL-E is able to describe both real and imaginary objects. It has the ability to take inspiration from an unrelated idea while respecting the form of the thing being designed, ideally producing an object that appears to be practically functional. It appears that when the text contains the phrases “in the shape of,” “in the form of,” and “in the style of” gives it the ability to do this.

Source: Official blog post by OpenAI on DALL-E

Geographic and Temporal knowledge, i.e, is sometimes capable of rendering semblances of certain locations and understanding simple geographical facts, such as country flags, cuisines, and local wildlife. Keep in mind most of these locations do not exist!
It also has learned about basic stereotypical trends in design and technology over the decades.

Source: Official blog post by OpenAI on DALL-E

ENDING NOTES

Can you believe that AlexNet was introduced 8 years ago and in such a short time we have had such revolutionary breakthroughs in AI?

With the recent breakthroughs in NLP with GPT-3, CLIP and DALL-E both introduce a whole new improvement in performance and generalization like never before. DALL-E is a prime example of extending NLP’s prowess to the visual domain.

The applications of such models could only have been thought of theoretically. We could see conversions of books into visual representations, have applications that generate photorealistic movies from a script, or new video games from a description. It’s only a matter of years at this point. One could also imagine animating or generating imagery of dreams using this technology.

From a social perspective, the team at OpenAI have plans to analyze how models like DALL-E could, in fact, relate to societal issues, given the long term ethical challenges of this technology. I truly believe the application of vision and language together is most definitely the future of deep learning.

Let us know what you think about the CLIP and DALL-E models in the comment sections below!

DALL-E

Advanced Computer Vision Deep Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

OpenAI’s Future of Vision with DALL-E: Creating Images from Text

Introduction

Overview of The Algorithm

WHAT IS DALL-E GOOD AND BAD AT?

ENDING NOTES

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID