CS 180: Computer Vision, Fall 2024

Project 5: Diffusion

Rohan Gulati

SID: 3037864000

In this project, we implement diffusion using deep learning in PyTorch. First, we implement a a variety of algorithms to sample from the set of real images using Stability AI's Deep Floyd IF model. This includes inpainting, image to image translation, visual anagrams, and the generation of new images altogether. In part B, we build and train a UNet architecture from scratch to generate automatic handwritten numbers off of MNIST.

Part A: Denoising Abilities of DeepFloyd IF

Prompt Model outputs

Num Inference = 10

Prompt 1: an oil painting of a snowy mountain village

Prompt 2: a photo of a rocket ship

Prompt 3: a man wearing a hat

Num Inference = 20

Prompt 1: an oil painting of a snowy mountain village

Prompt 2: a photo of a rocket ship

Prompt 3: a man wearing a hat

Num Inference = 40

Prompt 1: an oil painting of a snowy mountain village

Prompt 2: a photo of a rocket ship

Prompt 3: a man wearing a hat

When we vary the number of inference steps, we can see a difference in quality in the outputs. Higher values of inference iterations led to more detailed realistic textures, while lower number of steps led to more smooth cartoon-like features.

Forward Process - Noising

Noise Level: 250

Noise Level: 500

Noise Level: 750

Here we apply somewhat of linear combination of random noise and the original image to noise the images. Above are shown at different noise levels

Classical Gaussian Denoising

Noise Level: 250	Noise Level: 250, Denoised
Noise Level: 500	Noise Level: 500, Denoised
Noise Level: 750	Noise Level: 750, Denoised

Here we applied low pass filtering to isolate lower frequencies of the image

One Step Denoising

Here we used DeepFloyd to calculate the noise and derived a formula to remove it from the noisy image and guess the clean image

Iterative Denoising

Time step: 690

Time step: 540

Time step: 390

Time step: 240

Noise Level: 90

Here we applied the DDPM Diffusion process to iteratively denoise the image slightly over many timesteps to turn pure noise into a sample from the set of real images.

Original Campanile

One Step & Gaussian Denoising Comparison

Diffusion Sampling

Above are example images from the default DDPM. We can see the images are more like what an alien would think we see, so these can have better quality.

Classifier Free Guidance

We applied Classifier Free Guidance, which amplifies the jump in each time step for a prompted image relative to a baseline prompt. This increased richness of the prompt leads to better quality images.

Image to Image

Campanile

Campanile Translated

Dog

Dog Translated

Golden Gate

Golden Gate Translated

By applying various levels of noise to an image (using the forward method above), we can have the model try to recover the original. Since the model is not perfect, we get translated images that are similar to our prompt. With increased levels of noise, the images get more and more different, since less information is recoverable.

Editing Hand Drawn

Spongebob Side

Spongebob Side Edited

Hand Drawn 2: Tree

Tree Edited

Hand Drawn 1: Car

Car Edited

Similarly, we can apply the image translation to our own images. I tried to hand draw a car... I think I did a good job on the tree. The diffusion model was able to recover the image and provide a little bit of realism and detail.

Inpainting

Campanile	Mask	Inpaint
Cat	Mask	Inpaint
Penguin	Mask	Inpaint

By masking out parts of the image, we can keep some parts static while diffusing particular portions of the image, to get some interesting results.

Text-Conditional Image-

Campanile

Campanile + "a rocket ship"

Telegraph

Telegraph + "an oil painting of a snowy mountain village"

Luigi

Luigi + "a photo of a hipster barista"

After we add noise, we can also guide the new image with a new prompt to control the type of similar image we get. Above are examples of images that were noised and then diffused back with the associated prompt, to get a similar and cool result.

Visual Anagrams

Old Man / People around a Fire

Skull / Waterfall

Rocketship / Dog

While diffusing, we can manipulate our trajectory by calculating the noise for a right side up image and another upside down image, simultaneously. By weighting the prior's noise and the latter's flipped noise appropriately, we can create effects like above, where the images look different right side up & upside down.

Hybrid Images

Skull from far, Waterfall from close

Dog from Far, People around Fire from Close

Rocketship from Far, Village Rooftop from Close

We can also create hybrid images. By manipulating our trajectory to low pass one prompt's results and high pass the other prompt's results, we can generate samples where the lower frequencies are controlled by one prompt and the high frequencies are controlled by the other. As a result, an image can look different depending on where you see it from.

Part B: Implementing Diffusion from Scratch

Unconditioned UNet

Noised MNIST

Above, we took a sample image from MNIST and applied various levels of noise to it. We generated a gaussian matrix of noise and then scaled this noise by sigma before adding it to the original, to get the results above. This helps visualize the denoising process we want to implement.

Training Loss

We create a unconditional UNet. The UNet has an auto-encoder like architecture that downsamples informations before upsampling, but it also sends high frequency information from the downsampling segment to the upsampling directly to maintain important high frequency infromation that can help create detailed images. Above is the training loss in predicting noise.

Denoising

After 1 Epoch

After 5 Epochs

Above are how the UNet did at 1 epoch and 5 Epochs at denoising the image. The left is the original. The middle is with noise added. The last column is what the model recreated.

Out of Distribution Testing

This model was trained to denoise images at sigma = 0.5. As a result, it can behave in unexpected ways with different levels of noise as shown above.

Time Conditioned UNet

Training Loss

We implement a time conditioned UNet. By injecting a the time of our denoising into the UNet, we can have a singular model denoise at various levels of noise. To do this, we normalize the time to [0,1] and broadcast it across the channels of the bottleneck, so it has a high influence on the upsampling segment of the network. We also inject again in between the upsampling.

Sampling Results

After 5 Epochs

After 20 Epochs

Above are the results of the model at denoising images at random t values after 5 epochs and after 20 epochs. We can see that the model just wants to generate samples from MNIST, with no understanding of the difference in labels and the numbers of each, which leads to some non-digit looking figures.

Class Conditioned UNet

Training Loss

Now, we inject the label or the digit value of each class by one hot encoding the label (0 - 9) and then applying a Linear layer to reproduce it across multiple channels. Using broadcasting, we can mulitiply this value by the unflatten layer before injecting time, to influence the trajectory of the model. Above is the training loss.

Sampling Results

After 5 Epochs

After 20 Epochs

Lastly, with this class-conditioned sampling, we get more realistic looking numbers.

Reflection

It was really interesting to use deep learning to sample across different images and create many interesting effects. I also got to understand the nature of solving problems where compute time is not as readily available.