Varun Bharadwaj - CS 180
For this project we will be using the DeepFloyd IF Diffusion model. DeepFloyd is a 2 stage model where the first stage produces a 64x64 image and the second stage uses the outputs of the first model to generate images of size 256x256. Below I have put the outputs from the images using 20 and 200 different inference steps.
Here are the outputs of the stage 1 model using 20 inference steps.
Here are the outputs of the stage 2 model using 20 inference steps.
Here are the outputs of the stage 1 model using 200 inference steps.
Here are the outputs of the stage 2 model using 200 inference steps.
The stage one models, as expected, were lower resolution compared to the stage 2 outputs.
However, there
was quite a substantial difference between the quality of the outputs while using 200 inferences
steps
versus using only 20. Using a higher number of inference steps led to an image with a lot more
detail.
There were a lot of small details that made the images look a lot more realistic with more
steps. The
oil painting of a snowy mountain village had building facades with small snow spots on it. The
rocket
ship had a more realistic rocket exhaust that had a mixture of colors, and was not just a simple
red
line. Furthermore, the rocket even had a small reflection of some celestial body on the window.
However,
the biggest difference was with the man wearing a hat. The diffusion model with 200 inference
steps was
able to generate a very realistic looking man. In the smaller inference steps output, the man's
eyes
were clearly not aligned, furthermore the man had idealized skin without any imperfections which
leads
to an unrealistic look. The hat on his had was also not properly aligned, the man's head is
slightly
tilted, but the hat is perfectly flat. With more inference steps, there was a lot more detail in
the
final image. The man has a more realistic beard and skin, the hat fits his head properly and has
proper
creases that mimics how a hat would sit on a head. Furthermore, the smaller details of his bust
such as
wrinkles, freckles, and eye bags all make him look a lot more realistic.
I used the random seed 23 while running the model.
The forward process takes a clean image and adds noise to it. The amount of noise is based on the parameter t, where a low t means a less noisy image and a higher t is a more noisy. The specific way we add noise to the image is defined by the following equation. alpha_t here is determined by the developers of DeepFloydIf and is used to determine the ratio of original image to noise added.
Previously in the class, we used low pass filters in order to remove noise from an image. However, due to the large amounts of noise in these (as is visible in the campanile noised at t=750), the results are not very good.
We will now use a pretrained diffusion model in order to recover the gaussian noise and clean up our campenile image. Below are my results using the one step denoising of the prebuilt diffusion model.
These diffusion models were not trained in order denoise super noisy images back to the natural image manifold in one step. They are much better at iteratively denoising the images across multiple steps. We can use this, to iteratively denoise our noisy images back onto this natural manifold. Below are the results doing so.
Below is a illustartion showing the foll iterative denoising process in work
We can now sample random images from the diffusion model by passing in random noise instead of our original noised campenile. Below are 5 random outputs
The quality of the images that were generated in the previous image where not very good. To improve this we implement classifier free guidance that allows us to generate higher quality images using both a conditional and unconditional outputs and scaling our noise to be a linear extrapolation of the 2. Below are my samples using CFG.
We can use this sampling loop to edit our images by adding some noise, and then using the diffusion model to cast this noisy image back onto the natural manifold of images. Below are my reuslts
We can do a similar process to what we did in the previous part in order to project hand drawn images onto what the moel thinks is the natural manifold of images.
This diffusion model sampling is super powerful. We can use it in order to fill in images that are missing a mask. Below we can see some examples.
We can also add text conditioning to change one image into another image based off of a text prompt
We can use the diffusion models to do some other cooler things as well. We can at each iterative denoising step, combine the outputs from one point facing straight and from another point upside down to get visual anagrams. These images that are generated look like one text prompt facing up and a different one facing down.
Similar to the previous one, we can use the diffusion model to generate images that look like one thing from up close and like a different prompt from farther away. The results can be achieved by low pass filtering one model output and high pass filtering the other one to get a hybrid image similar to how we did in project 2.
Below i put examples of noise below
Here is the loss graph for the training process for the unconditional unet
Here are the results after the first epochs
Here are the results after the first 5 epochs
Here are the results after the first 5 epochs on out of output data
Set 1
Set 2
Set 3
Here we are going to train a iterative denosier diffusion model. Below is my loss curve.
Here are outputs after 5 epochs
Here are outputs after 2 epochs
Here we are going to train a iterative denosier diffusion model that accounts for the digits. Below is my loss curve.
Here are outputs after 5 epochs
Here are outputs after 20 epochs