Fun with Diffusion Models

Varun Bharadwaj - CS 180

Part A: The Power of Diffusion Models

Step 0: Setup

For this project we will be using the DeepFloyd IF Diffusion model. DeepFloyd is a 2 stage model where the first stage produces a 64x64 image and the second stage uses the outputs of the first model to generate images of size 256x256. Below I have put the outputs from the images using 20 and 200 different inference steps.

Model Outputs (Low Inference Steps)

Here are the outputs of the stage 1 model using 20 inference steps.

An oil painting of a snowy mountain village
A man wearing a hat
A rocket ship

Here are the outputs of the stage 2 model using 20 inference steps.

An oil painting of a snowy mountain village
A man wearing a hat
A rocket ship

Model Outputs (High Inference Steps)

Here are the outputs of the stage 1 model using 200 inference steps.

An oil painting of a snowy mountain village
A man wearing a hat
A rocket ship

Here are the outputs of the stage 2 model using 200 inference steps.

An oil painting of a snowy mountain village
A man wearing a hat
A rocket ship

Quality of the Outputs

The stage one models, as expected, were lower resolution compared to the stage 2 outputs. However, there was quite a substantial difference between the quality of the outputs while using 200 inferences steps versus using only 20. Using a higher number of inference steps led to an image with a lot more detail. There were a lot of small details that made the images look a lot more realistic with more steps. The oil painting of a snowy mountain village had building facades with small snow spots on it. The rocket ship had a more realistic rocket exhaust that had a mixture of colors, and was not just a simple red line. Furthermore, the rocket even had a small reflection of some celestial body on the window. However, the biggest difference was with the man wearing a hat. The diffusion model with 200 inference steps was able to generate a very realistic looking man. In the smaller inference steps output, the man's eyes were clearly not aligned, furthermore the man had idealized skin without any imperfections which leads to an unrealistic look. The hat on his had was also not properly aligned, the man's head is slightly tilted, but the hat is perfectly flat. With more inference steps, there was a lot more detail in the final image. The man has a more realistic beard and skin, the hat fits his head properly and has proper creases that mimics how a hat would sit on a head. Furthermore, the smaller details of his bust such as wrinkles, freckles, and eye bags all make him look a lot more realistic.

I used the random seed 23 while running the model.

Part 1: Sampling Loops

1.1 Implementing the Forward Process

What is the Forward Process

The forward process takes a clean image and adds noise to it. The amount of noise is based on the parameter t, where a low t means a less noisy image and a higher t is a more noisy. The specific way we add noise to the image is defined by the following equation. alpha_t here is determined by the developers of DeepFloydIf and is used to determine the ratio of original image to noise added.

Equation to calculate x_t (the noisy image) given alphas_cumprod[t]

Forward Process Results

Original Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750

1.2 Classical Denoising

Previously in the class, we used low pass filters in order to remove noise from an image. However, due to the large amounts of noise in these (as is visible in the campanile noised at t=750), the results are not very good.

Noisy Images

Original Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750

Gaussian Blurring Results

Gaussian blurred Campanile at t=250
Gaussian blurred Campanile at t=500
Gaussian blurred Campanile at t=750

1.3 One-step Denoising

We will now use a pretrained diffusion model in order to recover the gaussian noise and clean up our campenile image. Below are my results using the one step denoising of the prebuilt diffusion model.

Noisy Images

Original Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750

One Step Denoising Results

One Step Denoised Campanile at t=250
One Step Denoised Campanile at t=500
One Step Denoised Campanile at t=750

1.3 One-step Denoising

These diffusion models were not trained in order denoise super noisy images back to the natural image manifold in one step. They are much better at iteratively denoising the images across multiple steps. We can use this, to iteratively denoise our noisy images back onto this natural manifold. Below are the results doing so.

Iteratively Denoised Images

Iteratively Step Denoised Campanile at t=690
Iteratively Step Denoised Campanile at t=540
Iteratively Step Denoised Campanile at t=390
Iteratively Step Denoised Campanile at t=240
Iteratively Step Denoised Campanile at t=90

Final Results

Original Campenile
Fully Denoised Campanile (Iterative Denoising)
Fully Denoised Campanile (One Step Denoising)
Gaussian Denoised

Below is a illustartion showing the foll iterative denoising process in work

1.5 Sampling from Diffusion Model

We can now sample random images from the diffusion model by passing in random noise instead of our original noised campenile. Below are 5 random outputs

Sampled Images

Gifs Showing Sampling Process

1.6 Classifier Free Guidance

The quality of the images that were generated in the previous image where not very good. To improve this we implement classifier free guidance that allows us to generate higher quality images using both a conditional and unconditional outputs and scaling our noise to be a linear extrapolation of the 2. Below are my samples using CFG.

Sampled Images

Gifs Showing Sampling Process

1.7 Image to Image Translation

We can use this sampling loop to edit our images by adding some noise, and then using the diffusion model to cast this noisy image back onto the natural manifold of images. Below are my reuslts

Original Images

Original Campanile
Original JJ McCarthy
Original Frog

Image to Image translation of Campanile

Campanile at starting_index = 1
Campanile at starting_index = 3
Campanile at starting_index = 5
Campanile at starting_index = 7
Campanile at starting_index = 10
Campanile at starting_index = 20
Campanile at starting_index = 1
Campanile at starting_index = 3
Campanile at starting_index = 5
Campanile at starting_index = 7
Campanile at starting_index = 10
Campanile at starting_index = 20

Image to Image translation of JJ McCarthy

JJ McCarthy at starting_index = 1
JJ McCarthy at starting_index = 3
JJ McCarthy at starting_index = 5
JJ McCarthy at starting_index = 7
JJ McCarthy at starting_index = 10
JJ McCarthy at starting_index = 20
JJ McCarthy at starting_index = 1
JJ McCarthy at starting_index = 3
JJ McCarthy at starting_index = 5
JJ McCarthy at starting_index = 7
JJ McCarthy at starting_index = 10
JJ McCarthy at starting_index = 20

Image to Image translation of Mr. Frog

Frog at starting_index = 1
Frog at starting_index = 3
Frog at starting_index = 5
Frog at starting_index = 7
Frog at starting_index = 10
Frog at starting_index = 20
Frog at starting_index = 1
Frog at starting_index = 3
Frog at starting_index = 5
Frog at starting_index = 7
Frog at starting_index = 10
Frog at starting_index = 20

Hand Drawn Images

We can do a similar process to what we did in the previous part in order to project hand drawn images onto what the moel thinks is the natural manifold of images.

Minecraft

Minecraft Original
SDEEdit of Minecraft starting_index = 1
SDEEdit of Minecraft starting_index = 3
SDEEdit of Minecraft starting_index = 5
SDEEdit of Minecraft starting_index = 7
SDEEdit of Minecraft starting_index = 10
SDEEdit of Minecraft starting_index = 20

Football

Football Original
SDEEdit of Football starting_index = 1
SDEEdit of Football starting_index = 3
SDEEdit of Football starting_index = 5
SDEEdit of Football starting_index = 7
SDEEdit of Football starting_index = 10
SDEEdit of Football starting_index = 20

Football

Beach Original
SDEEdit of Beach starting_index = 1
SDEEdit of Beach starting_index = 3
SDEEdit of Beach starting_index = 5
SDEEdit of Beach starting_index = 7
SDEEdit of Beach starting_index = 10
SDEEdit of Beach starting_index = 20

Image Inpainting

This diffusion model sampling is super powerful. We can use it in order to fill in images that are missing a mask. Below we can see some examples.

Image Inpainting of Campenile

Original Campenile
Mask of Campenile
Part of Image to Replace
Final Result

Image Inpainting of Lebron's Block

Original Lebron Block
Mask of Lebron Block
Part of Image to Replace
Final Result

Image Inpainting of SF Skyline

Original Skyline
Mask of Skyline
Part of Image to Replace
Final Result

Text Conditioned Image to Image

We can also add text conditioning to change one image into another image based off of a text prompt

Campenile onto "a rocket ship"

Original Campenile
Campanile at starting_index = 1
Campanile at starting_index = 3
Campanile at starting_index = 5
Campanile at starting_index = 7
Campanile at starting_index = 10
Campanile at starting_index = 20

Lebron's Block onto "an oil painting of a snowy mountain village"

Original Lebron Block
Lebron Block at starting_index = 1
Lebron Block at starting_index = 3
Lebron Block at starting_index = 5
Lebron Block at starting_index = 7
Lebron Block at starting_index = 10
Lebron Block at starting_index = 20

Just a Chill Guy onto "a man wearing a hat"

Original Chill Guy
Chill Guy at starting_index = 1
Chill Guy at starting_index = 3
Chill Guy at starting_index = 5
Chill Guy at starting_index = 7
Chill Guy at starting_index = 10
Chill Guy at starting_index = 20

1.8 Visual Anagrams

We can use the diffusion models to do some other cooler things as well. We can at each iterative denoising step, combine the outputs from one point facing straight and from another point upside down to get visual anagrams. These images that are generated look like one text prompt facing up and a different one facing down.

Visual Anagrams

An Oil Painting of People around a Campfire
An Oil Painting of an Old Man
A (rather disturbing) Photo of a man
A photo of a dog
An olympics medal ceremony
A photo of the swiss alps

1.9 Hybrid Images

Similar to the previous one, we can use the diffusion model to generate images that look like one thing from up close and like a different prompt from farther away. The results can be achieved by low pass filtering one model output and high pass filtering the other one to get a hybrid image similar to how we did in project 2.

Visual Anagrams

A Hybrid of a skull and a waterfall
A Hybrid of a rocket ship and a pencil
A Hybrid of a dog and the amalfi coast

Part B: The Power of Diffusion Models

Step 1: Training the Unconditional Unet

Noise Addition

Below i put examples of noise below

0
0.2
0.4
0.5
0.6
0.8
1.0
0
0.2
0.4
0.5
0.6
0.8
1.0
0
0.2
0.4
0.5
0.6
0.8
1.0

Here is the loss graph for the training process for the unconditional unet

Loss graph of training the model

First Epoch Output

Here are the results after the first epochs

Input image
Noise with sigma in 0.5
Denoised by model
Input image
Noise with sigma in 0.5
Denoised by model
Input image
Noise with sigma in 0.5
Denoised by model
Input image
Noise with sigma in 0.5
Denoised by model

Fifth Epoch Output

Here are the results after the first 5 epochs

Input image
Noise with sigma in 0.5
Denoised by model
Input image
Noise with sigma in 0.5
Denoised by model
Input image
Noise with sigma in 0.5
Denoised by model
Input image
Noise with sigma in 0.5
Denoised by model

Out-of-distributions output

Here are the results after the first 5 epochs on out of output data

Set 1

0
0.2
0.4
0.5
0.6
0.8
1.0
Denoised with sigma in 0.0
Denoised with sigma in 0.2
Denoised with sigma in 0.4
Denoised with sigma in 0.5
Denoised with sigma in 0.6
Denoised with sigma in 0.8
Denoised with sigma in 1.0

Set 2

0
0.2
0.4
0.5
0.6
0.8
1.0
Denoised with sigma in 0.0
Denoised with sigma in 0.2
Denoised with sigma in 0.4
Denoised with sigma in 0.5
Denoised with sigma in 0.6
Denoised with sigma in 0.8
Denoised with sigma in 1.0

Set 3

0
0.2
0.4
0.5
0.6
0.8
1.0
Denoised with sigma in 0.0
Denoised with sigma in 0.2
Denoised with sigma in 0.4
Denoised with sigma in 0.5
Denoised with sigma in 0.6
Denoised with sigma in 0.8
Denoised with sigma in 1.0

Step 2: Training the Time Dependent Unet

What's new

Here we are going to train a iterative denosier diffusion model. Below is my loss curve.

Loss Curve

Here are outputs after 5 epochs

Output 0
Output 1
Output 2
Output 3
Output 4
Output 5
Output 6
Output 7
Output 8
Output 9

Here are outputs after 2 epochs

Output 0
Output 1
Output 2
Output 3
Output 4
Output 5
Output 6
Output 7
Output 8
Output 9

Step 3: Training the Class Dependent Unet

Class Dependent

Here we are going to train a iterative denosier diffusion model that accounts for the digits. Below is my loss curve.

Loss Curve

Here are outputs after 5 epochs

Here are outputs after 20 epochs

Bells and Whistles: GIFs