CS 180: Computer Vision, Fall 2024

Project 5: Diffusion

Rohan Gulati

SID: 3037864000

In this project, we implement diffusion using deep learning in PyTorch. First, we implement a a variety of algorithms to sample from the set of real images using Stability AI's Deep Floyd IF model. This includes inpainting, image to image translation, visual anagrams, and the generation of new images altogether. In part B, we build and train a UNet architecture from scratch to generate automatic handwritten numbers off of MNIST.

Part A: Denoising Abilities of DeepFloyd IF

Prompt Model outputs

Num Inference = 10

Prompt 1: an oil painting of a snowy mountain village
Prompt 2: a photo of a rocket ship
Prompt 3: a man wearing a hat

Num Inference = 20

Prompt 1: an oil painting of a snowy mountain village
Prompt 2: a photo of a rocket ship
Prompt 3: a man wearing a hat

Num Inference = 40

Prompt 1: an oil painting of a snowy mountain village
Prompt 2: a photo of a rocket ship
Prompt 3: a man wearing a hat

When we vary the number of inference steps, we can see a difference in quality in the outputs. Higher values of inference iterations led to more detailed realistic textures, while lower number of steps led to more smooth cartoon-like features.

Forward Process - Noising

Noise Level: 250
Noise Level: 500
Noise Level: 750

Here we apply somewhat of linear combination of random noise and the original image to noise the images. Above are shown at different noise levels

Classical Gaussian Denoising

Noise Level: 250
Noise Level: 250, Denoised
Noise Level: 500
Noise Level: 500, Denoised
Noise Level: 750
Noise Level: 750, Denoised

Here we applied low pass filtering to isolate lower frequencies of the image

One Step Denoising

Here we used DeepFloyd to calculate the noise and derived a formula to remove it from the noisy image and guess the clean image

Iterative Denoising

Time step: 690
Time step: 540
Time step: 390
Time step: 240
Noise Level: 90

Here we applied the DDPM Diffusion process to iteratively denoise the image slightly over many timesteps to turn pure noise into a sample from the set of real images.

Original Campanile
One Step & Gaussian Denoising Comparison

Diffusion Sampling

Above are example images from the default DDPM. We can see the images are more like what an alien would think we see, so these can have better quality.

Classifier Free Guidance

We applied Classifier Free Guidance, which amplifies the jump in each time step for a prompted image relative to a baseline prompt. This increased richness of the prompt leads to better quality images.

Image to Image

Campanile
Campanile Translated
Dog
Dog Translated
Golden Gate
Golden Gate Translated

By applying various levels of noise to an image (using the forward method above), we can have the model try to recover the original. Since the model is not perfect, we get translated images that are similar to our prompt. With increased levels of noise, the images get more and more different, since less information is recoverable.

Editing Hand Drawn

Spongebob Side
Spongebob Side Edited
Hand Drawn 2: Tree
Tree Edited
Hand Drawn 1: Car
Car Edited

Similarly, we can apply the image translation to our own images. I tried to hand draw a car... I think I did a good job on the tree. The diffusion model was able to recover the image and provide a little bit of realism and detail.

Inpainting

Campanile
Mask
Inpaint
Cat
Mask
Inpaint
Penguin
Mask
Inpaint

By masking out parts of the image, we can keep some parts static while diffusing particular portions of the image, to get some interesting results.

Text-Conditional Image-

Campanile
Campanile + "a rocket ship"
Telegraph
Telegraph + "an oil painting of a snowy mountain village"
Luigi
Luigi + "a photo of a hipster barista"

After we add noise, we can also guide the new image with a new prompt to control the type of similar image we get. Above are examples of images that were noised and then diffused back with the associated prompt, to get a similar and cool result.

Visual Anagrams

Old Man / People around a Fire
Skull / Waterfall
Rocketship / Dog

While diffusing, we can manipulate our trajectory by calculating the noise for a right side up image and another upside down image, simultaneously. By weighting the prior's noise and the latter's flipped noise appropriately, we can create effects like above, where the images look different right side up & upside down.

Hybrid Images

Skull from far, Waterfall from close
Dog from Far, People around Fire from Close
Rocketship from Far, Village Rooftop from Close

We can also create hybrid images. By manipulating our trajectory to low pass one prompt's results and high pass the other prompt's results, we can generate samples where the lower frequencies are controlled by one prompt and the high frequencies are controlled by the other. As a result, an image can look different depending on where you see it from.

Part B: Implementing Diffusion from Scratch

Unconditioned UNet

Noised MNIST

Above, we took a sample image from MNIST and applied various levels of noise to it. We generated a gaussian matrix of noise and then scaled this noise by sigma before adding it to the original, to get the results above. This helps visualize the denoising process we want to implement.

Training Loss

We create a unconditional UNet. The UNet has an auto-encoder like architecture that downsamples informations before upsampling, but it also sends high frequency information from the downsampling segment to the upsampling directly to maintain important high frequency infromation that can help create detailed images. Above is the training loss in predicting noise.

Denoising

After 1 Epoch
After 5 Epochs

Above are how the UNet did at 1 epoch and 5 Epochs at denoising the image. The left is the original. The middle is with noise added. The last column is what the model recreated.

Out of Distribution Testing

This model was trained to denoise images at sigma = 0.5. As a result, it can behave in unexpected ways with different levels of noise as shown above.

Time Conditioned UNet

Training Loss

We implement a time conditioned UNet. By injecting a the time of our denoising into the UNet, we can have a singular model denoise at various levels of noise. To do this, we normalize the time to [0,1] and broadcast it across the channels of the bottleneck, so it has a high influence on the upsampling segment of the network. We also inject again in between the upsampling.

Sampling Results

After 5 Epochs
After 20 Epochs

Above are the results of the model at denoising images at random t values after 5 epochs and after 20 epochs. We can see that the model just wants to generate samples from MNIST, with no understanding of the difference in labels and the numbers of each, which leads to some non-digit looking figures.

Class Conditioned UNet

Training Loss

Now, we inject the label or the digit value of each class by one hot encoding the label (0 - 9) and then applying a Linear layer to reproduce it across multiple channels. Using broadcasting, we can mulitiply this value by the unflatten layer before injecting time, to influence the trajectory of the model. Above is the training loss.

Sampling Results

After 5 Epochs
After 20 Epochs

Lastly, with this class-conditioned sampling, we get more realistic looking numbers.

Reflection

It was really interesting to use deep learning to sample across different images and create many interesting effects. I also got to understand the nature of solving problems where compute time is not as readily available.