Fun With Diffusion Models! - Project Spec
- Table of Contents {:toc}
In the first part of the project, I played around with an already trained stable diffusion model, DeepFloyd IF.
To try out the diffusion model, I tried the following 3 prompts with the following num_inference_steps values. A higher number of steps resulted in higher quality output. I used a random seed of 180.
20 steps:
| A man wearing a hat | A rocket ship | An oil painting of a snowy mountain village |
|---|---|---|
![]() |
![]() |
![]() |
100 steps:
| A man wearing a hat | A rocket ship | An oil painting of a snowy mountain village |
|---|---|---|
![]() |
![]() |
![]() |
First, I implemented a function forward which takes in an original image im and time step t. A higher t means more noise was added. Mathematically, this is represented as follows:
where:
-
$$x_t$$ is the image at time step$$t$$ ($$x_0$$ is the original image) -
$$\bar{\alpha_t}$$ is a noise coefficient chosen by the people who trained DeepFloyd -
$$\epsilon \sim \mathcal{N}(0, 1)$$ is Gaussian noise
| Original | t = 250 |
t = 500 |
t = 750 |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Since stable diffusion is essentially trying to denoise an image step by step, first, I tried a classical denoising technique to attempt to recover the original image: Gaussian blur. As expected, the denoised output does not work very well. We get rid of the noise, but we also get rid of the "signal" (e.g. the actual original image).
| t | Noisy | Denoised |
|---|---|---|
| 250 | ![]() |
![]() |
| 500 | ![]() |
![]() |
| 750 | ![]() |
![]() |
Next I used a pretrained diffusion model (a UNet trained on a large dataset of pairs of t values (such as t = 750) the denoised output tower starts to look quite different from the original Campanile tower.
For reference, here is the original image again:
| t | Noisy | Denoised |
|---|---|---|
| 250 | ![]() |
![]() |
| 500 | ![]() |
![]() |
| 750 | ![]() |
![]() |
To solve the issue mentioned in 1.3, we can use iterative denoising with a stride of 30 time steps, starting at t = 990 to t = 0. I implemented the function iterative_denoise which uses the following formula:
where:
-
$$x_t$$ is the same as in 1.1 -
$$x_{t'}$$ is the noisy image at time step$$t'$$ where$$t' < t$$ (e.g. a slightly less noisy image) -
$$\bar{\alpha_t}$$ is the same as in 1.1 -
$$\alpha_t$$ =$$\frac{\bar{\alpha_t}}{\bar{\alpha_{t'}}}$$ -
$$\beta_t$$ =$$1 - \alpha_t$$ -
$$x_0$$ is the same as in 1.1 -
$$v_\sigma$$ is random noise
See also: Denoising Diffusion Probabilistic Models by Jonathan Ho, Ajay Jain, and Pieter Abbeel.
The following shows the results of iterative denoising for every 5 loops:
t = 690 |
t = 540 |
t = 390 |
t = 240 |
t = 90 |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
Here are the other outputs for reference:
| Original | Gaussian Blur Denoised | One-Step Denoised | Iterative Denoised |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
We can observe that the iteratively denoised image produces the best result.
Stable diffusion works by converting noise to completely denoised images. Thus, I can use my iterative_denoise function from 1.4 to generate images from completely noisy images (e.g. an image with pixels from a Gaussian distribution). Here are 5 generated images:
| Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
Some of the images in 1.5 are non-sensical, so to improve image output quality (at the expense of image diversity), I used a technique called Classifier-Free Diffusion Guidance developed by Jonathan Ho and Tim Salimans.
In CFG, we compute a conditional and unconditional noise estimate
where
| Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
Below are the results of taking 3 original images, adding varying levels of noise, and then iteratively denoising the noisy images with CFG. i_start denotes where we begin the iterative denoising process (e.g. lower value means we denoised with a few iterations and higher value means we denoised with a lot of iterations).
In addition to the given test image of the Campanile, I chose an image of the Golden Gate Bridge and the Eiffel Tower that were resized to
i_start = 1 |
i_start = 3 |
i_start = 5 |
i_start = 7 |
i_start = 10 |
i_start = 20 |
Original |
|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
See also: SDEdit algorithm by Meng et al. (2022).
In this subpart, I drew 2 images and found 1 image from the web and utilized the diffusion model to edit the images using the same method as above.
i_start = 1 |
i_start = 3 |
i_start = 5 |
i_start = 7 |
i_start = 10 |
i_start = 20 |
Original |
|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Next I implemented the inpainting part of this paper, RePaint: Inpainting using Denoising Diffusion Probabilistic Models by Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. (Using this paper, Scene Completion Using Millions of Photographs by James Hays and Alexei Efros, for inspiration on how to "complete" the scene.)
Given an image,
| Campanile with new top | Cat with dog head | Oski with dog head |
|---|---|---|
![]() |
![]() |
![]() |
In this part, I changed the text prompt from "a high quality image" to "a rocket ship", obtaining the results below with different levels of noise for the Campanile image and some images from the web.
i_start = 1 |
i_start = 3 |
i_start = 5 |
i_start = 7 |
i_start = 10 |
i_start = 20 |
Original |
|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
In this part, I implemented Visual Anagrams from this paper, Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models by Daniel Geng, Inbum Park, and Andrew Owens. To do this, we denoise like before but we alternate between 2 different text prompts and flip the image upside-down. That way, the final image right-side up looks like the first prompt, and upside-down looks like the second prompt.
The full algorithm is as follows:
where flip is a function that flips the image and p_i are the 2 text prompts. Our final noise estimate is
*This should say "coast" but that's how the text embeddings were loaded in the starter code provided by staff :)
In this part, I implemented this paper, Factorized Diffusion: Perceptual Illusions by Noise Decomposition by Daniel Geng, Inbum Park, and Andrew Owens. Specifically, I used the diffusion model to create hybrid images like in Project 2.
Similar to the visual anagrams, we just have to adjust our noise estimate as follows:
where
| Prompt 1 | Prompt 2 | Hybrid Image |
|---|---|---|
| a lithograph of waterfalls | a lithograph of a skull | ![]() |
| an oil painting of people around a campfire | a lithograph of a skull | ![]() |
| a man wearing a hat | a lithograph of waterfalls | ![]() |
In this part of the project, I wrote a stable diffusion model from scratch trained on the MNIST dataset.
Given a noisy image
I did this by implementing a UNet, from this paper: U-Net: Convolutional Networks for Biomedical Image Segmentation by Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
To begin, I visualized how an image can become more and more noisy by implementing a function add_noise which takes a clean image x and adds Gaussian noise:
where
Next, I began training my unconditioned UNet to be able to denoise an image in 1 step by training on pairs of clean and noisy images where
- Batch size: 256
- Epochs: 5
- Hidden dimension
D: 128 - Optimizer: Adam
- Learning rate:
1e-4
Here is my training loss graph:
Here are the results of the denoiser at epoch 1 and epoch 5:
To see if the denoiser can generalize to other levels of noise, I kept the image the same and varied
In this part, I implemented DDPM from this paper, Denoising Diffusion Probabilistic Models by Jonathan Ho, Ajay Jain, and Pieter Abbeel.
Instead of estimating the denoised image, we can train a UNet to estimate the noise that was added to the image. This changes the loss function we're trying to minimize accordingly:
where
where:
-
$$x_t$$ is the image at time step$$t$$ ($$x_0$$ is the original image) -
$$\bar{\alpha_t}$$ is a noise coefficient computed accordingly:-
$$\beta_t$$ is a list of numbers of length 300 (t = 299tot = 0) such that$$\beta_0 = 0.0001$$ and$$\beta_T = 0.02$$ and all other elements are evenly spaced between the two $$\alpha_t = 1 - \beta_t$$ $$\bar{\alpha_t} = \Pi_{s = 1}^{t} \alpha_s$$
-
-
$$\epsilon \sim \mathcal{N}(0, 1)$$ is Gaussian noise
We can add time conditioning to the UNet by modifying the loss function again:
and adding an FCBlock to the unconditioned UNet made up of some nn.Linear and nn.GELU elements (see the project spec for more details).
I implemented this algorithm:
to perform the forward pass of the UNet. I trained the network with the following hyperparameters, as suggested:
- Batch size: 128
- Epochs: 20
- Hidden dimension
D: 64 - Optimizer: Adam
- Learning rate:
1e-3 - Learning rate decay scheduler: ExponentialLR with
gamma = 0.1 ** (1.0 / num_epochs)
Below is my training loss curve:
I then implemented this algorithm:
to generate/sample 10 digits from the UNet at epoch = 5:
and at epoch 20:
One thing I had to be very careful about was setting the seed to generate the starting
To be able to generate a particular digit (rather than any digit), I implemented class conditioning by adding 2 more FCBlocks in a similar way as before (see the spec for details). Additionally, I dropped out (e.g. turned that image's one-hot-encoded vector to all zeros in the batch) the class label for 10% of images in the batch. I also used Classifier-Free Guidance to estimate noise from both the unconditioned and conditioned UNet.
I implemented this algorithm:
And had this training loss curve with the same hyperparameters as before:
I then implemented this algorithm:
to generate/sample digits 0-9 in that order at epoch 5:
and at epoch 20:
Once again, I had to be very careful about was setting the seed to generate the starting epoch + j where epoch is the epoch index and j is an index from [0, 3] since we have to sample digits 0-9 at least 4 times) and setting the seed to generate
I created gifs of the sampling process from t = 299 to t = 0 for the time conditioned UNet at epoch 5 (first row) and epoch 20 (second row):
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
I did the same for the class conditioned UNet:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |






























































































































































