We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details

Guide to Image-to-Image Diffusion: A Hugging Face Pipeline

Mobarak 18 Jul, 2024
9 min read

Introduction

By applying specific modern state-of-the-art techniques, stable diffusion models make it possible to generate images and audio. Stable Diffusion works by modifying input data with the guide of text input and generating new creative output data. In this article, we will see how to generate new images from a given input image by employing depth-to-depth model diffusers on the PyTorch backend with a Hugging Face pipeline. We are using Hugging Face since they have made an easy-to-use image generation using stable diffusion pipeline available.

Learn More: Hugging Face Transformers Pipeline Functions

Learning Objectives

  • Understand the concept of Stable Diffusion and its application in generating images and audio using modern state-of-the-art techniques.
  • Gain knowledge of the key components and techniques involved in Stable Diffusion, such as latent diffusion models, denoising autoencoders, variational autoencoders, U-Net blocks, and text encoders.
  • Explore common applications of diffusion models, including text-to-image, text-to-videos, and text-to-3D conversions.
  • Learn how to set up the environment for Stable Diffusion, including utilizing GPU and installing necessary libraries and dependencies.
  • Develop practical skills in applying Stable Diffusion by loading and diffusing images, creating text prompts to guide the output, adjusting diffusion levels, and understanding the limitations and challenges associated with image generation using stable diffusion models.

This article was published as a part of the Data Science Blogathon.

What is a Stable Diffusion?

Stable Diffusion models function as latent diffusion models. It learns the latent structure of input by modeling how the data attributes diffuse through the latent space. They belong to the deep generative neural network. It is considered stable because we guide the results using original images, text, etc. On the other hand, an unstable diffusion will be unpredictable.

The Concepts of Stable Diffusion

Stable Diffusion uses the Diffusion or latent image generation using stable diffusion model (LDM), a probabilistic model. These models are trained like other deep learning models. Still, the objective here is removing the need for continuous applications of signal processing denoting a kind of noise in the signals in which the probability density function equals the normal distribution. We refer to this as the Gaussian noise applied to the training images. We achieve this through a sequence of denoising autoencoders (DAE). DAEs contribute by changing the reconstruction criterion. This is what alters the continuous application of signal processing. It is initialized to add a noise process to the standard autoencoder.

Stable Diffusion | Hugging Face Pipeline

In a more detailed explanation, Stable Diffusion consists of 3 essential parts: First is the variational autoencoder (VAE) which, in simple terms, is an artificial neural network that performs as probabilistic graphical models. Next is the U-Net block. This convolutional neural network (CNN) was developed for image segmentation. Lastly is the text encoder part. A trained CLIP ViT-L/14 text encoder deals with this. It handles the transformations of the text prompts into an embedding space.

Stable Diffusion | Hugging Face Pipeline

The VAE encoder compresses the image pixel space values into a smaller dimensional latent space to carry out image diffusion. This helps the image not to lose details. It is represented again in pixeled pictures.

Common Applications of Diffusion

Let us quickly look at three common areas where diffusion models can be applied:

  • Text-to-Image: This approach does not use images but a piece of text “prompt” to generate related photos.
  • Text-to-Videos: Diffusion models are used for generating videos out of text prompts. Current research uses this in media to do interesting feats like creating online ad videos, explaining concepts, and creating short animation videos, song videos, etc.
  • Text-to-3D: This stable image generation using stable diffusion approach converts input text to 3D images.

Applying diffusers can help generate free images that are plagiarism free. This provides content for your projects, materials, and even marketing brands. Instead of hiring a painter or photographer, you can generate your images. Instead of a voice-over artist, you can create your unique audio. Now let’s look at Image-to-image Generation.

Image-to-image Generation | Stable Diffusion

Also Read: Bring Doodles to Life: Meta Open-Sources AI Model

Setting Up Environment

This task requires GPU and a good development environment like processing images and graphics. You are expected to ensure you have GPU available if you want to follow along with this project. We can use Google Colab since it provides a suitable environment and GPU, and you can search for it online. Follow the steps below to engage the available GPU:

  1. Go to the Runtime tab towards the top right.
  2. After selecting Runtime, click the Change Runtime Type option.
  3. Then select GPU as a hardware accelerator from the drop-down option.

You can find all the code on GitHub.

Importing Dependencies

There are several dependencies in using the pipeline from Huggingface. We will first start by importing them into our project environment.

Installing Libraries

Some libraries are not preinstalled in Colab. We need to start by installing them before importing from them.

#  Installing required libraries
%pip install --quiet --upgrade diffusers transformers scipy ftfy
#  Installing required libraries
%pip install --quiet --upgrade accelerate

Let us explain the installations we have done above. Firstly are the diffusers, transformers, scipy, and ftfy. SciPy and ftfy are standard Python libraries we employ for everyday Python tasks. We will explain the new major libraries below.

Diffusers: Diffusers is a library made available by Hugging Face for getting well-trained image to image stable diffusion models for generating images. We are going to use it for accessing our pipeline and other packages.

Transformers: Transformers contain tools and APIs that help us cut training costs from scratch.

# Backend
import torch

 # Internet access
import requests

# Regular Python library for Image processing
from PIL import Image

# Hugging face pipeline
from diffusers import StableDiffusionDepth2ImgPipeline

StableDiffusionDepth2ImgPipeline is the library that reduces our code. All we need to do is pass an image and a prompt describing our expectations.

Instantiating the Pre-trained Diffusers

Next, we just make an instance of the pre-trained diffuser we imported above and assign it to our GPU. Here this is Cuda.

#  Creating a variable instance of the pipeline
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
    torch_dtype=torch.float16,
)

#  Assigning to GPU
pipe.to("cuda")

Preparing Image Data

Let’s define a function to help us check images from URLs. You can skip this step to try an image you have locally. Mount the drive in Colab.

# Accesssing images from the web
import urllib.parse as parse
import os
import requests

# Verify URL
def check_url(string):
    try:
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

We can define another function to use the check_url function for loading an image.

# Load an image
def load_image(image_path):
    if check_url(image_path):
        return Image.open(requests.get(image_path, stream=True).raw)
    elif os.path.exists(image_path):
        return Image.open(image_path)

Loading Image

Now, we need an image to diffuse into another image. You can use your photo. In this example, we are using an online image for convenience. Feel free to use your URL or images.

# Loading an image URL
img = load_image("https://img.freepik.com/free-photo/stacked-tomatoes_1353-262.jpg?w=740&t=st=1683821147~exp=1683821747~hmac=708f16371d1e158d76c8ea5e8b9790fb68dc75009750b8328e17c21f16d36468")

# Displaying the Image
img
Stable Diffusion

Creating Text Prompts

Now we have a usable image. Let’s now show some image to image stable diffusion feats on it. To achieve this, we wrap prompts to the pictures. These are sets of texts with keywords describing our expectations from the Diffusion. Instead of generating a random new image, we can use prompts to guide the model’s output.

Note that we set the strength to 0.7. This is an average. Also, note the negative_prompt is set to None. We will look at this more later.

# Setting Image prompt
prompt = "Some sliced tomatoes mixed"

# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=None, strength=0.7).images[0]
Stable Diffusion

Now we can continue with this step on new images. The method remains;

Loading the image to be diffused, and

Creating a text description of the target image.

You can create some examples on your own.

Creating Negative Prompts

Another approach is to create a negative prompt to counter the intended output. This makes the pipeline more flexible. We can do this by assigning a negative prompt to the negative_prompt variable.

# Loading an image URL
img = load_image("https://img.freepik.com/free-photo/stacked-tomatoes_1353-262.jpg?w=740&t=st=1683821147~exp=1683821747~hmac=708f16371d1e158d76c8ea5e8b9790fb68dc75009750b8328e17c21f16d36468")

# Displaying the Image
img
Stable Diffusion
# Setting Image prompt
prompt = ""
n_prompt = "rot, bad, decayed, wrinkled"

# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=0.7).images[0]
"

Adjusting Diffusion Level

You may ask about altering how much the new image changes from the first. We can achieve this by changing the strength level. We will observe the effect of different strength levels on the previous image.

At strength = 0.1

# Setting Image prompt
prompt = ""
n_prompt = "rot, bad, decayed, wrinkled"

# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=0.1).images[0]
Stable Diffusion

On strength = 0.4

# Setting Image prompt
prompt = ""
n_prompt = "rot, bad, decayed, wrinkled"

# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=0.4).images[0]
"

At strength = 1.0

# Setting Image prompt
prompt = ""
n_prompt = "rot, bad,decayed, wrinkled"

# Assigning to pipeline
pipe(prompt=prompt, image=img, negative_prompt=n_prompt, strength=1.0).images[0]
"

The strength variable makes it possible to work on the effect of Diffusion on the new image generated. This makes it more flexible and adjustable.

Limitations of Diffusion Models

Before we call it a wrap on Stable Diffusion, one must understand that one can face some limitations and challenges with these pipelines. Every new technology always has some issues at first.

  • We trained the stable diffusion model on images with 512×512 resolution. The implication is that when we generate new photos and desire dimensions higher than 512×512, the image quality tends to degrade. Although, there is an attempt to solve this problem by updating higher versions of the Stable Diffusion model where we can natively generate images but at 768×768 resolution. Although people attempt to improve things, as long as there is a maximum resolution, the use case will primarily limit printing large banners and flyers.
  • Training the dataset on the LAION database. It is a non-profit organization that provides datasets, tools, and models for research purposes. This has shown that the model could not identify human limbs and faces richly.
  • Stable image to image stable diffusion on a CPU can run in a feasible time ranging from a few seconds to a few minutes. This removes the need for a high computing environment. It can only be a bit complex when the pipeline is customized. This can demand high RAM and processor, but the available channel takes less complexity.
  • Lastly is the issue of Legal rights. The practice can easily suffer legal matters as the models require vast images and datasets to learn and perform well. An instance is the January 2023 lawsuits from three artists for copyright infringement against Stability AI, Midjourney, and DeviantArt. Therefore, there can be limitations in freely building these images.

Conclusion

In conclusion, while the concept of diffusers is cutting-edge, the Hugging Face pipeline makes it easy to integrate into our projects with an easy and very direct code underside. Using prompts on the images makes it possible to set and bring an imaginary picture to the Diffusion. Additionally, the strength variable is another critical parameter. It helps us with the level of Diffusion. We have seen how to generate new images from images.

Key Takeaways

  • By applying state-of-the-art techniques, stable diffusion models generate images and audio.
  • Typical applications of image to image stable diffusion include Text-to-image, Text-to-Videos, and Text-to-3D.
  • StableDiffusion Depth2ImgPipeline is the library that reduces our code, so we only need to pass an image to describe our expectations.

Learn More: Pytorch | Getting Started With Pytorch

Master image generation with our Stable Diffusion with Hugging Face course. Learn to create stunning images from text prompts and input images with ease.

Reference Links

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What can I do with Stable Diffusion?

A. Stable Diffusion allows users to generate high-quality images by iteratively refining them through diffusion processes. This technique enhances image quality and realism over time, making it suitable for various creative and artistic applications.

Q2. Can I use Stable Diffusion for free?

A. Yes, Stable Diffusion is open-source and available for free. Users can access and utilize the model without any cost, facilitating experimentation and development in the field of image generation and enhancement.

Q3. Can you make NSFW with Stable Diffusion?

A. Yes, Stable Diffusion can generate NSFW (Not Safe For Work) content as it allows users to control and manipulate image generation processes. However, ethical considerations and guidelines should be followed when creating such content.

Q4. How to start working with Stable Diffusion?

A. To begin working with Stable Diffusion, you can install the necessary libraries and dependencies, such as PyTorch and Stable Diffusion framework. Next, explore tutorials and documentation available online to understand its functionalities and start experimenting with image generation tasks.

Mobarak 18 Jul, 2024

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.