Skip to content

Commit 0f0b55d

Browse files
author
Mishig
authored
Merge pull request #81 from huggingface/move_imgs
Fix broken imgs link on doc pages
2 parents 14c86b0 + 835b763 commit 0f0b55d

File tree

3 files changed

+9
-4
lines changed

3 files changed

+9
-4
lines changed

_toctree.yml

+5
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
title: Introduction
66

77
- title: 1. Introduction to diffusion models
8+
isExpanded: true
89
sections:
910
- local: unit1/README
1011
newlocal: unit1/1
@@ -17,6 +18,7 @@
1718
title: Diffusion Models from Scratch
1819

1920
- title: 2. Fine-Tuning, Guidance and Conditioning
21+
isExpanded: true
2022
sections:
2123
- local: unit2/README
2224
newlocal: unit2/1
@@ -29,6 +31,7 @@
2931
title: Making a Class-Conditioned Diffusion Model
3032

3133
- title: 3. Stable Diffusion
34+
isExpanded: true
3235
sections:
3336
- local: unit3/README
3437
newlocal: unit3/1
@@ -38,6 +41,7 @@
3841
title: Stable Diffusion Introduction
3942

4043
- title: 4. Going Further with Diffusion Models
44+
isExpanded: true
4145
sections:
4246
- local: unit4/README
4347
newlocal: unit4/1
@@ -50,6 +54,7 @@
5054
title: Diffusion for Audio
5155

5256
- title: Events related to the course
57+
isExpanded: true
5358
sections:
5459
- local: hackathon/README
5560
newlocal: hackathon/introduction

unit2/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,15 @@ Fine-tuning typically works best if the new data somewhat resembles the base mod
2828

2929
Unconditional models don't give much control over what is generated. We can train a conditional model (more on that in the next section) that takes additional inputs to help steer the generation process, but what if we already have a trained unconditional model we'd like to use? Enter guidance, a process by which the model predictions at each step in the generation process are evaluated against some guidance function and modified such that the final generated image is more to our liking.
3030

31-
![guidance example image](guidance_eg.png)
31+
![guidance example image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/guidance_eg.png)
3232

3333
This guidance function can be almost anything, making this a powerful technique! In the notebook, we build up from a simple example (controlling the color, as illustrated in the example output above) to one utilizing a powerful pre-trained model called CLIP which lets us guide generation based on a text description.
3434

3535
## Conditioning
3636

3737
Guidance is a great way to get some additional mileage from an unconditional diffusion model, but if we have additional information (such as a class label or an image caption) available during training then we can also feed this to the model for it to use as it makes its predictions. In doing so, we create a **conditional** model, which we can control at inference time by controlling what is fed in as conditioning. The notebook shows an example of a class-conditioned model which learns to generate images according to a class label.
3838

39-
![conditioning example](conditional_digit_generation.png)
39+
![conditioning example](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/conditional_digit_generation.png)
4040

4141
There are a number of ways to pass in this conditioning information, such as
4242
- Feeding it in as additional channels in the input to the UNet. This is often used when the conditioning information is the same shape as the image, such as a segmentation mask, a depth map or a blurry version of the image (in the case of a restoration/superresolution model). It does work for other types of conditioning too. For example, in the notebook, the class label is mapped to an embedding and then expanded to be the same width and height as the input image so that it can be fed in as additional channels.

unit3/README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -37,12 +37,12 @@ By applying the diffusion process on these **latent representations** rather tha
3737

3838
In Unit 2 we showed how feeding additional information to the UNet allows us to have some additional control over the types of images generated. We call this conditioning. Given a noisy version of an image, the model is tasked with predicting the denoised version **based on additional clues** such as a class label or, in the case of Stable Diffusion, a text description of the image. At inference time, we can feed in the description of an image we'd like to see and some pure noise as a starting point, and the model does its best to 'denoise' the random input into something that matches the caption.
3939

40-
![text encoder diagram](text_encoder_noborder.png)<br>
40+
![text encoder diagram](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/text_encoder_noborder.png)<br>
4141
_Diagram showing the text encoding process which transforms the input prompt into a set of text embeddings (the encoder_hidden_states) which can then be fed in as conditioning to the UNet._
4242

4343
For this to work, we need to create a numeric representation of the text that captures relevant information about what it describes. To do this, SD leverages a pre-trained transformer model based on something called CLIP. CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt.
4444

45-
![conditioning diagram](sd_unet_color.png)
45+
![conditioning diagram](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/sd_unet_color.png)
4646

4747
OK, so how do we actually feed this conditioning information into the UNet for it to use as it makes predictions? The answer is something called cross-attention. Scattered throughout the UNet are cross-attention layers. Each spatial location in the UNet can 'attend' to different tokens in the text conditioning, bringing in relevant information from the prompt. The diagram above shows how this text conditioning (as well as timestep-based conditioning) is fed in at different points. As you can see, at every level the UNet has ample opportunity to make use of this conditioning!
4848

0 commit comments

Comments
 (0)