An Introduction to Diffusion Models and Stable Diffusion
Introduction
Imagine a world where creativity transcends the limitations of brushes, clay, and canvas. It was in 2022, at Colorado’s State Fair art competition, that a groundbreaking entry defied the conventional boundaries of artistic creation. Jason M. Allen’s masterpiece, “Théâtre D’opéra Spatial” won the first prize and defied convention. Not through traditional means, but with the aid of an AI program called Midjourney, which uses a diffusion model to generate images. By turning a text prompt into a hyper-realistic image, Allen’s creation not only captivated the audience and judges but also set off a fierce backlash from artists who accused him of, essentially, cheating.
“Théâtre D’opéra Spatial” entry for the Colorado State Fair.
However, the rise of Midjourney and other AI advancements merely scratches the surface of what is possible with diffusion models. These generative models have become a force to be reckoned with, attracting attention and pushing the boundaries of image synthesis previously ruled by Generative Adversarial Networks (GANs) [1]. In fact, diffusion models are being said to be beating GANs on image synthesis [2].
In this article, we explore the theoretical foundations of diffusion models, uncovering their inner workings and understanding their fundamental components and remarkable effectiveness. Along the way, we’ll shine a spotlight on one of the most popular families of diffusion models: Stable Diffusion.
Join us as we uncover the secrets behind diffusion models’ success and how they are revolutionizing image generation. By the end, you’ll have a profound understanding of their transformative potential, inspiring the realm of AI-driven creativity.
Diffusion Models
Diffusion Models are generative models that learn from the data during training and generate similar examples based on what they have learned. These models draw inspiration from non-equilibrium thermodynamics and have achieved state-of-the-art quality in generating various forms of data. Some examples include generating high-quality images and even audio (e.g. Audio Diffusion Models [3]). If you are interested in how diffusion models can be applied in audio settings, check our blog on voice cloning.
In a nutshell, Diffusion Models work by corrupting training data through the addition of Gaussian noise (called forward diffusion process), and then learning how to recover the original information by reversing this noising process step by step (called reverse diffusion process). Once trained, these models can generate new data by sampling random Gaussian noise and passing it through the learned denoising process.
Diffusion Models go beyond just creating high-quality images. They have gained popularity by addressing the well-known challenges associated with adversarial training in GANs. Diffusion Models offer advantages in terms of training stability, efficiency, scalability, and parallelization.
In the following sections, we’ll dive deeper into the fine details of Diffusion Models. We’ll explore the forward diffusion process, the reverse diffusion process, and an overview of the steps involved in the training process. Also, we’ll also gain an intuition on the calculation of the loss function. By examining these components, we’ll acquire a comprehensive understanding of how Diffusion Models function and how they achieve their impressive results.
Forward Diffusion Process
The forward diffusion process consists of gradually adding Gaussian noise to an input image step by step, for a total of T steps. At step 0 we have the original image, at step 1 a very slightly corrupted image which gets even more corrupted step by step until the whole information of the original image is lost.
Forward diffusion process [4]
To formalize this process, we can view it as a fixed Markov chain with T steps, where the image at timestep t maps to its subsequent state at timestep t+1. As such, each step depends only on the previous one, allowing us to derive a closed-form formula to obtain the corrupted image at any desired timestep, bypassing the need for iterative computation.
Forward diffusion formulation [4]
Consequently, this closed-form formula enables direct sampling of xₜ at any timestep, significantly accelerating the forward diffusion process.
Schedulers
In addition, the noise addition at each step follows a deliberate pattern. A Scheduler determines the amount of noise to be added. In the original Denoising Diffusion Probabilistic Models (DDPM) paper [4], the authors define a linear schedule ranging βₜ from 0.0001 at timestep 0 to 0.02 at timeste