Stable Diffusion 3: Guide to the Latest Text-to-Image Model by Stability AI

Shikha Sen 12 Jul, 2024
11 min read

Introduction

Stability AI created the Stable Diffusion model, one of the most sophisticated text-to-image generating systems. It uses diffusion models, a subclass of generative models that produce high-quality images based on textual descriptions by iteratively refining noisy images.

Stable Diffusion 3

Overview

  • Stable Diffusion 3 leverages an advanced Multimodal Diffusion Transformer (MMDiT) architecture for creating high-resolution images from textual prompts.
  • Featuring up to 8 billion parameters, Stable Diffusion 3 offers a 72% improvement in quality metrics and efficiently generates 2048×2048 resolution images.
  • Stable Diffusion 3 integrates text and image inputs and utilizes separate weights for text and image embeddings to enhance understanding and image clarity.
  • Built on the DiT framework, Stable Diffusion 3 employs modulated attention layers and MLPs to improve text-conditional image generation.
  • Accessible via Hugging Face Diffusers or local GPU setups, Stable Diffusion 3 supports diverse creative applications with customizable prompts and optimizations.

What is the Stable Diffusion Model?

A particular kind of deep learning model called stable diffusion is intended to produce visuals from textual descriptions. With the help of the input text, the model eventually converts random noise into coherent visuals through a process known as diffusion. This approach allows for generating highly detailed and diverse images that align closely with the provided text prompts.

Key Components and Architecture

Here are the components and architecture of the Stable Diffusion Model:

  • Diffusion Process: It starts with a noisy image and progressively denoises it to match the textual description. This ensures the final image is high-quality and faithful to the input text.
  • Forward and Reverse Diffusion Process:
    • In the forward diffusion process, Gaussian noise is progressively added to an image until it becomes completely random and unrecognizable. This noisy transformation is applied to all images during training. However, forward diffusion is only used beyond training in tasks like image-to-image conversion.
    • Reverse diffusion is a parameterized process that iteratively removes the noise added during forward diffusion. For instance, if trained on only two images, such as a cat and a dog, the reverse process would generate images resembling either a cat or a dog without intermediate forms. In practice, the model is trained on billions of images and utilizes prompts to generate unique images.
  • Autoencoder: Downsampling Factor 8 Autoencoder is used in Stable Diffusion 1 to compress and decompress image representations efficiently.
  • UNet: The first version of the architecture had 860 million parameters. These were crucial for adding and removing noise during the diffusion process, guided by the input text.
  • Text Encoder: CLIP ViT-L/14 Text Encoder: Translates textual descriptions into a format usable by the image generation process.
  • OpenCLIP: This was introduced in Stable Diffusion 2 to enhance the model’s ability to interpret and generate images based on text.
  • Training and Datasets: It is trained on large, diverse datasets to generate various images.
Stable Diffusion 3

Evolution of Stable Diffusion: Version Progression

Stable Diffusion 1 and 2

The progression from Stable Diffusion 1 to Stable Diffusion 2 saw significant enhancements in text-to-image generation capabilities. Stable Diffusion 1 utilized a downsampling-factor 8 autoencoder with an 860 million parameter (860M) UNet and a CLIP ViT-L/14 text encoder. Initially pretrained on 256×256 images and later fine-tuned on 512×512 images, it revolutionized open-source AI by inspiring hundreds of derivative models. Its rapid rise to over 33,000 GitHub stars underscores its impact. Stable Diffusion 2.0 introduced robust text-to-image models trained with OpenCLIP, supporting default resolutions of 512×512 and 768×768 pixels. This version also included an Upscaler Diffusion model capable of enhancing image resolution by a factor of four, allowing for outputs up to 2048×2048 pixels, thanks to training on a refined LAION-5B dataset.

Despite these advancements, Stable Diffusion 2 lacked consistency, realistic human depictions, and accurate text integration within images. These limitations prompted the development of Stable Diffusion 3, which addresses these issues by outperforming state-of-the-art systems like DALL·E 3, Midjourney v6, and Ideogram v1 in typography and prompt adherence. 

Stable Diffusion 3

Stable Diffusion v3 introduces a significant upgrade from v2 by shifting from a U-Net architecture to an advanced diffusion transformer architecture. This enhances scalability, supporting models with up to 8 billion parameters and multi-modal inputs. The resolution has increased by 168%, from 768×768 pixels in v2 to 2048×2048 pixels in v3, with the number of parameters more than quadrupling from 2 billion to 8 billion. These changes result in an 81% reduction in image distortion and a 72% improvement in quality metrics. Additionally, v3 offers enhanced object consistency and a 96% improvement in text clarity. Stable Diffusion 3 outperforms systems like DALL-E 3, Midjourney v6, and Ideogram v1 in typography, prompt adherence, and visual aesthetics. Its Multimodal Diffusion Transformer (MMDiT) architecture enhances text understanding, enabling nuanced interpretation of complex prompts. The model is highly efficient, with the largest version generating high-resolution images rapidly.

Featuring Stable Diffusion 3 

Stable Diffusion 3 employs the new Multimodal Diffusion Transformer (MMDiT) architecture with separate weights for image and language representations, enhancing text understanding and spelling. In human preference evaluations, Stable Diffusion 3 matched or exceeded other models in prompt adherence, typography, and visual aesthetics. The largest SD3 model with 8 billion parameters in early tests generated 1024×1024 images in 34 seconds on an RTX 4090, demonstrating impressive efficiency. The release includes models ranging from 800 million to 8 billion parameters, reducing hardware barriers and improving accessibility and performance.

How Does Stable Diffusion 3 Enhance Multimodal Generation of Text and Image?

The model integrates textual and visual inputs for text-to-image generation, mirrored in the new architecture called MMDiT, which highlights the model’s multimodality handling capabilities. Pretrained models are utilized to extract appropriate representations from both text and images, just like in previous incarnations of Stable Diffusion. To be more precise, the text is encoded using three different text embedders (two CLIP models and T5), and image token encoding is done using an improved autoencoding model.

The method uses different weights for each modality since text and image embeddings differ fundamentally. This configuration is similar to having separate transformers for processing images and text. Sequences from both modalities are mixed during the attention operation, enabling each representation to function within its domain while taking the other modality.

The Architecture of Stable Diffusion 3

Here is the architecture of Stable Diffusion 3:

Text-Conditional Sampling Architecture

The model blends text and image data for text-conditional image generation. Following the LDM framework for training text-to-image models in the latent space of a pretrained autoencoder, the model explains the diffusion backbone architecture and leverages pretrained models to create suitable representations. Text conditioning is encoded using pretrained, frozen text models, much like how images are encoded into latent representations.

The architecture builds upon the DiT (Diffusion Transformer) model, originally considered class-conditional image generation, and uses a modulation mechanism to condition the network on the diffusion timestep and the class label. The modulation mechanism is fed by embeddings of the timestep and the text conditioning vector. The network also needs sequence representation information because pooled text representation only contains coarse input information.

Both text and image inputs are embedded to create a sequence. This entails flattening 2 × 2 patches of the latent pixel representation into a patch encoding sequence and adding positional encodings. Once the text encoding and this patch encoding are embedded in a common dimensionality, the two sequences are concatenated. A sequence of modulated attention layers and MLPs is used following the DiT methodology.

Due to their conceptual distinctions, separate weights have been used for text and image embeddings. In this approach, the sequences of the two modalities are linked for the attention operation, which is equivalent to having two independent transformers for each modality. This permits the operation of both representations in their own spaces while considering each other.

They parameterize the model size based on its depth, defined by the number of attention blocks for scaling. The hidden size is 64 times the depth, expanding to four times this size in the MLP blocks, with the number of attention heads equal to the depth.

Here’s the Architecture:

Stable Diffusion 3 architecture

The Research

There is a research paper also written on this : Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, which explains the indepth features, components and experimental values.

This study focuses on enhancing generative diffusion models, which convert noise into perceptual data like images and videos by reversing their data-to-noise paths. A newer model variant, rectified flow, simplifies this process by directly connecting data and noise. However, it lacks widespread adoption due to uncertainty over its effectiveness. The researchers propose improving noise sampling techniques for rectified flow models, emphasizing perceptually relevant scales. They conducted a large-scale study demonstrating that their approach outperformed traditional diffusion models in generating high-resolution images from text inputs.

Additionally, they introduce a transformer-based architecture tailored for text-to-image generation, optimizing bidirectional information flow between image and text representations. Their findings show consistent improvements in text comprehension, typography, and human preference ratings, with their largest models surpassing current benchmarks. They plan to release their experimental data, code, and model weights for public use.

You can interact with the Stable Diffusion 3 model through its user interface provided by stability AI, or programmatically via its API. This article also outlines the steps and includes code examples for utilizing the API to interface with the model.

Here, you can independently experiment with the stable diffusion 3 prompts. Below is an example of a picture generated by a prompt. 

Examples of Picture Generated Using Prompt

Prompt: A lion holding a sign saying ” we are burning”.  Behind the lion, the forest is burning, and birds are burning halfway and trying to fly away while the elephant in the background is trying to spray water to cut the fire out. Snakes are burning, and helicopters are seen in the sky 

Stable Diffusion 3

Now, with a Negative prompting, in the advanced settings, you can also tune other things: a blurred and low-resolution image.

Effect of Negative Prompting

The current focus is on enhancing the image’s quality and resolution due to applying the negative prompt.

Stable Diffusion 3

Here are the other images generated using stable Diffusion 3

Prompt: A vividly colored, incredibly detailed HD picture of a Renaissance fair with a steampunk twist. In an ornate scene that combines contemporary technology with finely constructed medieval castles, Victorian-dressed people mix with knights in shining armor.

Stable Diffusion 3

Prompt 2: A colorful, high-definition picture of a kitchen where cooking tools are animated and ingredients float in midair while they prepare food independently. The sight is warm and inviting with sunlight pouring through the windows and creating a golden glow over the colorful surroundings.

Stable Diffusion 3

Prompt: A high-definition, vibrant image of a post-apocalyptic wasteland. Ruined buildings and abandoned vehicles are overrun by nature. A lone survivor, dressed in makeshift armor, stands in the foreground holding a hand-painted sign board that says ‘SURVIVOR.’ Nearby, a group of scavengers sifts through the debris. In the background, A child with a toy sits beside an older sibling near a small fire pit.”

Stable Diffusion 3

Prompt: A woman with an oval face and a wheatish complexion. Her lips are slightly smaller than her sharp, thin nose. She has pretty eyes with long lashes. She has a cheeky smile and freckles.

Stable Diffusion 3

Now, let’s see how to use Python to leverage the power of stable Diffusion 3. Explore some techniques using code on our local system and learn how to use this model locally:

Getting Started with Stable Diffusion 3

There are two primary methods to utilize Stable Diffusion 3: through the Hugging Face Diffusers library or by setting it up locally with GPU support. Let’s explore both approaches.

Method 1: Using Hugging Face Diffusers

This method is straightforward and ideal for those who want to experiment with Stable Diffusion 3 quickly.

Step 1: Hugging Face Authentication

Before downloading the model, you need to authenticate with Hugging Face. You must create a Hugging Face account and generate an access token to do so.

  1. Go to https://huggingface.co/ and create an account or log in.
  2. Navigate to your profile settings and create a new access token.
  3. Use the following code to log in with your token:
from huggingface_hub import login

login(token="your_huggingface_token_here")

Replace “your_huggingface_token_here” with your actual token.

Step 2: Installation

Install the necessary libraries:

!pip install diffusers transformers torch

Step 3: Implementing the Model

Use the following Python code to generate an image:

import torch
from diffusers import StableDiffusion3Pipeline

# Load the model
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", 
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate an image
prompt = "A futuristic cityscape with flying cars and holographic billboards, bathed in neon lights"
image = pipe(prompt, num_inference_steps=28, height=1024, width=1024).images[0]

# Save the image
image.save("sd3_futuristic_city.png")
Stable Diffusion 3

Method 2: Local Setup with GPU

For those with access to powerful GPUs, setting up Stable Diffusion 3 locally can offer more control and potentially faster generation times.

Step 1: Prerequisites

Ensure you have a compatible GPU with sufficient VRAM (24GB+ recommended for optimal performance).

Step 2: Installation

Install the required libraries:

pip install diffusers transformers torch accelerate

Step 3: Implementation

Use the following code to generate an image locally:

import torch
from diffusers import StableDiffusion3Pipeline

# Enable model CPU offloading for better memory management
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", 
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

# Generate an image
prompt = "An underwater scene of a bioluminescent coral reef teeming with exotic fish and sea creatures"
image = pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

# Save the image
image.save("sd3_underwater_scene.png")
Stable Diffusion 3

This implementation uses model CPU offloading, particularly helpful for GPUs with limited VRAM.

Advanced Techniques and Optimizations

As you become more familiar with Stable Diffusion 3, you may want to explore advanced techniques to enhance performance and efficiency.

Memory Optimizations

Dropping the T5 Text Encoder

For scenarios where memory is at a premium, you can opt to remove the memory-intensive T5-XXL text encoder:

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    text_encoder_3=None,
    tokenizer_3=None,
    torch_dtype=torch.float16
)

Quantized T5 Text Encoder

Alternatively, use a quantized version of the T5 Text Encoder to balance performance and memory usage:

from transformers import T5EncoderModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

text_encoder = T5EncoderModel.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world-8bit-T5.png")
Stable Diffusion 3

Performance Optimizations

Using torch.compile

Accelerate inference by compiling the Transformer and VAE components:

import torch
from diffusers import StableDiffusion3Pipeline

torch.set_float32_matmul_precision("high")

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

# Warm-up run
_ = pipe("A warm-up prompt", generator=torch.manual_seed(0))

Tiny AutoEncoder (TAESD3)

For faster decoding, implement the Tiny AutoEncoder:
import torch
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

Conclusion

Stable Diffusion 3 represents a significant advancement in AI-powered image generation. Whether you’re a developer, artist, or enthusiast, its improved capabilities in text understanding, image quality, and performance open up new possibilities for creative expression.

By leveraging the methods and optimizations discussed in this article, you can tailor Stable Diffusion 3 to your specific needs, whether working with cloud-based solutions or local GPU setups. As you experiment with different prompts and settings, you’ll discover the full potential of this powerful tool in bringing your imaginative concepts to life.

AI-generated imagery is evolving rapidly, and Stable Diffusion 3 stands at the forefront of this revolution. As we continue to push the boundaries of what’s possible, we can only imagine the creative horizons that future iterations will unveil. So, dive in, experiment, and let your imagination soar with Stable Diffusion 3!

Ready to transform your creative workflow? Start by Exploring Stable Diffusion 3 and unlock the next level of AI-generated imagery today!

Frequently Asked Questions

Q1. What is the Stable Diffusion model?

A. Stability Diffusion is a text-to-image generating system by Stability AI that produces high-quality images from text descriptions using diffusion.

Q2. How does the diffusion process work?

A. The diffusion process involves adding noise to an image (forward diffusion) and then iteratively removing this noise (reverse diffusion) guided by input text, to generate a clear and accurate image.

Q3. What are the key components of Stable Diffusion?

A. Here are the components of Stable Diffusion:
a. Autoencoder: Compresses and decompresses image representations.
b. UNet: Manages noise with 860 million parameters.
c. Text Encoder: Translates text into a format usable for image generation, initially using CLIP ViT-L/14 and later OpenCLIP for better interpretation.

Q4. How can I use Stable Diffusion 3 to generate images?

A. You can use Stable Diffusion 3 through Stability AI’s interface or programmatically via the Hugging Face Diffusers library with Python, allowing for efficient text-to-image generation on cloud or local GPU setups.

Shikha Sen 12 Jul, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear