Project 5: Fun with Diffusion Models

Part A: The Power of Diffusion Models

Part 0: Setup

Using the pretrained diffusion model, I sampled using 3 different text prompts at 2 different numbers of inference steps. I used a random seed of 180.

num_inference_steps=20
samples
num_inference_steps=100
samples

Increasing the number of inference steps increases computation time, but it seems to generate more detailed images.

Part 1: Sampling Loops

1.1: Implementing the Forward Process

To generate a noisy image, I sample from a Gaussian distribution to add noise and scale the image. Each image depends on the time point and the noise coefficients.

t=0 (clean image) t=250 t=500 t=750
clean noise 250 noise 500 noise 750

1.2 Classical Denoising

Here, I used the classical denoising method of using Gaussian blur filtering. There are still considerable artifacts in the blurred images, especially at higher noise levels.

t=250 t=500 t=750
noise 250 noise 500 noise 750
sharpened 250 sharpened 500 sharpened 750

1.3 One-step Denoising

Here, I used a pretrained diffusion model to denoise images in one step. This works much better than classical denoising, but higher levels of noise still pose a challenge for this.

t=250 t=500 t=750
noise 250 noise 500 noise 750
sharpened 250 sharpened 500 sharpened 750

1.4 Iterative Denoising

Here, I iteratively denoised using the pretrained model over strided timesteps instead of trying to denoise in one step. This is by far the best result so far.

iter 90
t = 90
iter 240
t = 240
iter 390
t = 390
iter 540
t = 540
iter 690
t = 690
orig
original
iteratively denoised
iteratively denoised
one step denoised
gaussian blurred

1.5 Diffusion Model Sampling

By using pure noise as an input to our iterative denoising function, we can generate new images from scratch. Using "a high quality image" as our prompt, here are 5 sampled images.

sample 1 sample 2 sample 3 sample_4 sample 5

1.6 Classifier Free Guidance

We then use Classifier-Free Guidance (CFG) to improve image quality, at the expense of image diversity. This involves running the denoiser twice for each time step, generating a conditional and unconditional noise estimate. The final noise estimate for each time step combines both of these estimates using a hyperparameter scale.

cfg 1 cfg 2 cfg 3 cfg 4 cfg 5

1.7 Image to Image translation

Here, I follow the SDEdit algorithm to generate images that are progressively more similar to a starting image. This involves taking an original test image, noising it, and forcing it back onto the image manifold unconditionally. This worked better for the Campanile and dog images than the cat image. I suspect this is because there are very few images in the training set that are similar to the cat image. Interestingly, using i_start=20 for the cat image changed the cat into a pair of shoes, which makes sense since she's on a shoe rack!

clean
original capanile photo
aki
original dog photo
circle
original cat photo
campanile aki circle

1.7.1 Editing Hand-drawn and Web Images

Here I used the same process as the previous section, but using hand-drawn and other nonrealistic images to project them to the natural image manifold.

gojo
Original Gojo image
gojo
buff
Original buff guy drawing
buff
sand
Original desert drawing
buff

1.7.2 Inpainting

I then followed the RePaint paper to implemenet inpainting. At every step in the denoising loop, we force the image pixels to match the original based on the given mask. I used this to add a new top to the Capanile, remove the audience from a concert picture, and remove a fence from a field picture.

Original Mask Inpainted Image
campanile campanile mask campanile
concert concert mask concert
field field mask field

1.7.3 Text-conditioned Image-to-Image Translation

I then perform image-to-image translation again but with a text prompt.

campanile
Prompt: "a rocket ship"
campanile
original image
aki
Prompt: "a photo of a dog"
campanile
original image
aki
Prompt: "a photo of a hipster barista"
evan
original image

1.8 Visual Anagrams

Here, I implement visual anagrams to create optical illusions. I use 2 prompts to generate images that look like one of the prompts in one orientation, and the other prompt in another orientation.

campfire old
an oil painting of people around a campfire
campfire old flipped
an oil painting of an old man
pencil rocket
a pencil
campfire old flipped
a rocket
dog guy
dog
dog guy flipped
hipster barista

1.9 Hybrid Images

Here, I implement another form of optical illusions in hybrid images. Essentially, the low frequencies of the image come from one prompt, and the high frequencies come from another prompt. This involves using low-pass filters on one noise estimate, and high-pass filters on the other noise estimate.

waterfall skull
close: waterfalls
far: skull
waterfall coast
close: waterfalls
far: coast
waterfall coast
close: dog
far: man

Part B: Diffusion Models from Scratch

In the next section, I train a diffusion model from scratch using the MNIST dataset.

Part 1: Training a Single-Step Denoising UNET

First, I need to write a forward function to add noise to images. Here are the noised images at different levels:

noises

Then, I need to train a UNET for single-step denoising. I followed the given structure and built the neural network in PyTorch. This network denoises images at noise level sigma=0.5. After constructing each layer, I trained using an L2 loss and an Adam optimizer with a learning rate of 1e-4. I display results after epochs 1 and 5.

epoch 1
epoch 5
uncond loss

1.2.2 Out of Distribution Testing

Although the model was trained for images noised with sigma=0.5, I can still test to see how well it does with other levels of noise:

out of distribution

Part 2: Training a Diffusion Model

2.1 Adding Time Conditioning to UNet

To make the model account for varying levels of noise, I need to add time-conditioning to it. This invovles adding 2 more fully-connected blocks to the model. Training also becomes more complex, as we also need to account for random times. I trained for 20 epochs using a variable learning rate.

epoch 5
epoch 20
training loss

2.4 Adding Class Conditioning

Looking at the results of time-conditioned image generation, there are still some fake digits in the sampled results. Adding class conditioning fixes this issue. To implement class-conditioning, I add two more fully-connected blocks to the model based on a one-hot encoded vector representing the digit class (0-9). I also implement dropout to ensure the UNet will works unconditionally. This allows me to use CFG to generate much cleaner results:

epoch 5
epoch 20
training loss