CS 180: Computer Vision, Fall 2024
Project 5: Diffusion
Rohan Gulati
SID: 3037864000
In this project, we implement diffusion using deep learning in PyTorch. First, we implement a a variety of algorithms to sample from the set
of real images using Stability AI's Deep Floyd IF model.
This includes inpainting, image to image translation, visual anagrams, and the generation of new images altogether. In part B,
we build and train a UNet architecture from scratch to generate automatic handwritten numbers off of MNIST.
Part A: Denoising Abilities of DeepFloyd IF
Prompt Model outputs
Num Inference = 10
Prompt 1: an oil painting of a snowy mountain village
|
Prompt 2: a photo of a rocket ship
|
Prompt 3: a man wearing a hat
|
Num Inference = 20
Prompt 1: an oil painting of a snowy mountain village
|
Prompt 2: a photo of a rocket ship
|
Prompt 3: a man wearing a hat
|
Num Inference = 40
Prompt 1: an oil painting of a snowy mountain village
|
Prompt 2: a photo of a rocket ship
|
Prompt 3: a man wearing a hat
|
When we vary the number of inference steps, we can see a difference in quality in the outputs. Higher values of inference iterations led to
more detailed realistic textures, while lower number of steps led to more smooth cartoon-like features.
Forward Process - Noising
Noise Level: 250
|
Noise Level: 500
|
Noise Level: 750
|
Here we apply somewhat of linear combination of random noise and the original image to noise the images. Above are shown at different noise levels
Classical Gaussian Denoising
Noise Level: 250
|
Noise Level: 250, Denoised
|
Noise Level: 500
|
Noise Level: 500, Denoised
|
Noise Level: 750
|
Noise Level: 750, Denoised
|
Here we applied low pass filtering to isolate lower frequencies of the image
One Step Denoising
Here we used DeepFloyd to calculate the noise and derived a formula to remove it from the noisy image and guess the clean image
Iterative Denoising
Time step: 690
|
Time step: 540
|
Time step: 390
|
Time step: 240
|
Noise Level: 90
|
Here we applied the DDPM Diffusion process to iteratively denoise the image slightly over many timesteps to turn pure noise into a sample from the set of real images.
Original Campanile
|
One Step & Gaussian Denoising Comparison
|
Diffusion Sampling
Above are example images from the default DDPM. We can see the images are more like what an alien would think we see, so these can have better quality.
Classifier Free Guidance
We applied Classifier Free Guidance, which amplifies the jump in each time step for a prompted image relative to a baseline prompt. This
increased richness of the prompt leads to better quality images.
Image to Image
Campanile
|
Campanile Translated
|
Dog
|
Dog Translated
|
Golden Gate
|
Golden Gate Translated
|
By applying various levels of noise to an image (using the forward method above), we can have the model try to recover the original. Since the model is not perfect,
we get translated images that are similar to our prompt. With increased levels of noise, the images get more and more different, since less information is recoverable.
Editing Hand Drawn
Spongebob Side
|
Spongebob Side Edited
|
Hand Drawn 2: Tree
|
Tree Edited
|
Hand Drawn 1: Car
|
Car Edited
|
Similarly, we can apply the image translation to our own images. I tried to hand draw a car... I think I did a good job on the tree.
The diffusion model was able to recover the image and provide a little bit of realism and detail.
Inpainting
Campanile
|
Mask
|
Inpaint
|
Cat
|
Mask
|
Inpaint
|
Penguin
|
Mask
|
Inpaint
|
By masking out parts of the image, we can keep some parts static while diffusing particular portions of the image, to get some interesting results.
Text-Conditional Image-
Campanile
|
Campanile + "a rocket ship"
|
Telegraph
|
Telegraph + "an oil painting of a snowy mountain village"
|
Luigi
|
Luigi + "a photo of a hipster barista"
|
After we add noise, we can also guide the new image with a new prompt to control the type of similar image we get. Above are examples of
images that were noised and then diffused back with the associated prompt, to get a similar and cool result.
Visual Anagrams
Old Man / People around a Fire
|
Skull / Waterfall
|
Rocketship / Dog
|
While diffusing, we can manipulate our trajectory by calculating the noise for a right side up image and another upside down image,
simultaneously. By weighting the prior's noise and the latter's flipped noise appropriately, we can create effects like above, where
the images look different right side up & upside down.
Hybrid Images
Skull from far, Waterfall from close
|
Dog from Far, People around Fire from Close
|
Rocketship from Far, Village Rooftop from Close
|
We can also create hybrid images. By manipulating our trajectory to low pass one prompt's results and high pass the other prompt's results, we can
generate samples where the lower frequencies are controlled by one prompt and the high frequencies are controlled by the other. As a result,
an image can look different depending on where you see it from.
Part B: Implementing Diffusion from Scratch
Unconditioned UNet
Noised MNIST
Above, we took a sample image from MNIST and applied various levels of noise to it. We generated a gaussian matrix of noise
and then scaled this noise by sigma before adding it to the original, to get the results above. This helps visualize the denoising process we want to implement.
Training Loss
We create a unconditional UNet. The UNet has an auto-encoder like architecture that downsamples informations before upsampling, but it also
sends high frequency information from the downsampling segment to the upsampling directly to maintain important high frequency infromation that can help
create detailed images. Above is the training loss in predicting noise.
Denoising
After 1 Epoch
|
After 5 Epochs
|
Above are how the UNet did at 1 epoch and 5 Epochs at denoising the image. The left is the original. The middle is
with noise added. The last column is what the model recreated.
Out of Distribution Testing
This model was trained to denoise images at sigma = 0.5. As a result, it can behave in unexpected ways with different levels of noise as shown above.
Time Conditioned UNet
Training Loss
We implement a time conditioned UNet. By injecting a the time of our denoising into the UNet, we can have a singular model
denoise at various levels of noise. To do this, we normalize the time to [0,1] and broadcast it across the channels of the bottleneck,
so it has a high influence on the upsampling segment of the network. We also inject again in between the upsampling.
Sampling Results
After 5 Epochs
|
After 20 Epochs
|
Above are the results of the model at denoising images at random t values after 5 epochs and after 20 epochs.
We can see that the model just wants to generate samples from MNIST, with no understanding of the difference in
labels and the numbers of each, which leads to some non-digit looking figures.
Class Conditioned UNet
Training Loss
Now, we inject the label or the digit value of each class by one hot encoding the label (0 - 9) and then applying a Linear layer to
reproduce it across multiple channels. Using broadcasting, we can mulitiply this value by the unflatten layer before injecting time,
to influence the trajectory of the model. Above is the training loss.
Sampling Results
After 5 Epochs
|
After 20 Epochs
|
Lastly, with this class-conditioned sampling, we get more realistic looking numbers.
Reflection
It was really interesting to use deep learning to sample across different images and create many interesting effects. I also got to understand the nature
of solving problems where compute time is not as readily available.