In this project, we implement diffusion using deep learning in PyTorch. First, we implement a a variety of algorithms to sample from the set of real images using Stability AI's Deep Floyd IF model. This includes inpainting, image to image translation, visual anagrams, and the generation of new images altogether. In part B, we build and train a UNet architecture from scratch to generate automatic handwritten numbers off of MNIST.
|
|
|
|
|
|
|
|
|
When we vary the number of inference steps, we can see a difference in quality in the outputs. Higher values of inference iterations led to more detailed realistic textures, while lower number of steps led to more smooth cartoon-like features.
|
|
|
Here we apply somewhat of linear combination of random noise and the original image to noise the images. Above are shown at different noise levels
|
|
|
|
|
|
Here we applied low pass filtering to isolate lower frequencies of the image
Here we used DeepFloyd to calculate the noise and derived a formula to remove it from the noisy image and guess the clean image
|
|
|
|
|
Here we applied the DDPM Diffusion process to iteratively denoise the image slightly over many timesteps to turn pure noise into a sample from the set of real images.
|
|
Above are example images from the default DDPM. We can see the images are more like what an alien would think we see, so these can have better quality.
We applied Classifier Free Guidance, which amplifies the jump in each time step for a prompted image relative to a baseline prompt. This increased richness of the prompt leads to better quality images.
|
|
|
|
|
|
By applying various levels of noise to an image (using the forward method above), we can have the model try to recover the original. Since the model is not perfect, we get translated images that are similar to our prompt. With increased levels of noise, the images get more and more different, since less information is recoverable.
|
|
|
|
|
|
Similarly, we can apply the image translation to our own images. I tried to hand draw a car... I think I did a good job on the tree. The diffusion model was able to recover the image and provide a little bit of realism and detail.
|
|
|
|
|
|
|
|
|
By masking out parts of the image, we can keep some parts static while diffusing particular portions of the image, to get some interesting results.
|
|
|
|
|
|
After we add noise, we can also guide the new image with a new prompt to control the type of similar image we get. Above are examples of images that were noised and then diffused back with the associated prompt, to get a similar and cool result.
|
|
|
While diffusing, we can manipulate our trajectory by calculating the noise for a right side up image and another upside down image, simultaneously. By weighting the prior's noise and the latter's flipped noise appropriately, we can create effects like above, where the images look different right side up & upside down.
|
|
|
We can also create hybrid images. By manipulating our trajectory to low pass one prompt's results and high pass the other prompt's results, we can generate samples where the lower frequencies are controlled by one prompt and the high frequencies are controlled by the other. As a result, an image can look different depending on where you see it from.
Above, we took a sample image from MNIST and applied various levels of noise to it. We generated a gaussian matrix of noise and then scaled this noise by sigma before adding it to the original, to get the results above. This helps visualize the denoising process we want to implement.
We create a unconditional UNet. The UNet has an auto-encoder like architecture that downsamples informations before upsampling, but it also sends high frequency information from the downsampling segment to the upsampling directly to maintain important high frequency infromation that can help create detailed images. Above is the training loss in predicting noise.
|
|
Above are how the UNet did at 1 epoch and 5 Epochs at denoising the image. The left is the original. The middle is with noise added. The last column is what the model recreated.
This model was trained to denoise images at sigma = 0.5. As a result, it can behave in unexpected ways with different levels of noise as shown above.
We implement a time conditioned UNet. By injecting a the time of our denoising into the UNet, we can have a singular model denoise at various levels of noise. To do this, we normalize the time to [0,1] and broadcast it across the channels of the bottleneck, so it has a high influence on the upsampling segment of the network. We also inject again in between the upsampling.
|
|
Above are the results of the model at denoising images at random t values after 5 epochs and after 20 epochs. We can see that the model just wants to generate samples from MNIST, with no understanding of the difference in labels and the numbers of each, which leads to some non-digit looking figures.
Now, we inject the label or the digit value of each class by one hot encoding the label (0 - 9) and then applying a Linear layer to reproduce it across multiple channels. Using broadcasting, we can mulitiply this value by the unflatten layer before injecting time, to influence the trajectory of the model. Above is the training loss.
|
|
Lastly, with this class-conditioned sampling, we get more realistic looking numbers.
It was really interesting to use deep learning to sample across different images and create many interesting effects. I also got to understand the nature of solving problems where compute time is not as readily available.