Merge pull request #81 from huggingface/move_imgs

Mishig · web-flow · commit 0f0b55d29de9 · 2024-04-11T10:17:00.000+02:00
Fix broken imgs link on doc pages
diff --git a/_toctree.yml b/_toctree.yml
@@ -5,6 +5,7 @@
     title: Introduction
 
 - title: 1. Introduction to diffusion models
+  isExpanded: true
   sections:
     - local: unit1/README
       newlocal: unit1/1
@@ -17,6 +18,7 @@
       title:  Diffusion Models from Scratch
 
 - title: 2. Fine-Tuning, Guidance and Conditioning
+  isExpanded: true
   sections:
     - local: unit2/README
       newlocal: unit2/1
@@ -29,6 +31,7 @@
       title: Making a Class-Conditioned Diffusion Model
 
 - title: 3. Stable Diffusion
+  isExpanded: true
   sections:
     - local: unit3/README
       newlocal: unit3/1
@@ -38,6 +41,7 @@
       title: Stable Diffusion Introduction
 
 - title: 4. Going Further with Diffusion Models
+  isExpanded: true
   sections:
     - local: unit4/README
       newlocal: unit4/1
@@ -50,6 +54,7 @@
       title: Diffusion for Audio
 
 - title: Events related to the course
+  isExpanded: true
   sections:
     - local: hackathon/README
       newlocal: hackathon/introduction
diff --git a/unit2/README.md b/unit2/README.md
@@ -28,15 +28,15 @@ Fine-tuning typically works best if the new data somewhat resembles the base mod
 
 Unconditional models don't give much control over what is generated. We can train a conditional model (more on that in the next section) that takes additional inputs to help steer the generation process, but what if we already have a trained unconditional model we'd like to use? Enter guidance, a process by which the model predictions at each step in the generation process are evaluated against some guidance function and modified such that the final generated image is more to our liking. 
 
-![guidance example image](guidance_eg.png)
+![guidance example image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/guidance_eg.png)
 
 This guidance function can be almost anything, making this a powerful technique! In the notebook, we build up from a simple example (controlling the color, as illustrated in the example output above) to one utilizing a powerful pre-trained model called CLIP which lets us guide generation based on a text description. 
 
 ## Conditioning
 
 Guidance is a great way to get some additional mileage from an unconditional diffusion model, but if we have additional information (such as a class label or an image caption) available during training then we can also feed this to the model for it to use as it makes its predictions. In doing so, we create a **conditional** model, which we can control at inference time by controlling what is fed in as conditioning. The notebook shows an example of a class-conditioned model which learns to generate images according to a class label. 
 
-![conditioning example](conditional_digit_generation.png)
+![conditioning example](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/conditional_digit_generation.png)
 
 There are a number of ways to pass in this conditioning information, such as
 - Feeding it in as additional channels in the input to the UNet. This is often used when the conditioning information is the same shape as the image, such as a segmentation mask, a depth map or a blurry version of the image (in the case of a restoration/superresolution model). It does work for other types of conditioning too. For example, in the notebook, the class label is mapped to an embedding and then expanded to be the same width and height as the input image so that it can be fed in as additional channels.
diff --git a/unit3/README.md b/unit3/README.md
@@ -37,12 +37,12 @@ By applying the diffusion process on these **latent representations** rather tha
 
 In Unit 2 we showed how feeding additional information to the UNet allows us to have some additional control over the types of images generated. We call this conditioning. Given a noisy version of an image, the model is tasked with predicting the denoised version **based on additional clues** such as a class label or, in the case of Stable Diffusion, a text description of the image. At inference time, we can feed in the description of an image we'd like to see and some pure noise as a starting point, and the model does its best to 'denoise' the random input into something that matches the caption. 
 
-![text encoder diagram](text_encoder_noborder.png)<br>
+![text encoder diagram](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/text_encoder_noborder.png)<br>
 _Diagram showing the text encoding process which transforms the input prompt into a set of text embeddings (the encoder_hidden_states) which can then be fed in as conditioning to the UNet._
 
 For this to work, we need to create a numeric representation of the text that captures relevant information about what it describes. To do this, SD leverages a pre-trained transformer model based on something called CLIP. CLIP's text encoder was designed to process image captions into a form that could be used to compare images and text, so it is well suited to the task of creating useful representations from image descriptions. An input prompt is first tokenized (based on a large vocabulary where each word or sub-word is assigned a specific token) and then fed through the CLIP text encoder, producing a 768-dimensional (in the case of SD 1.X) or 1024-dimensional (SD 2.X) vector for each token. To keep things consistent prompts are always padded/truncated to be 77 tokens long, and so the final representation which we use as conditioning is a tensor of shape 77x1024 per prompt.
 
-![conditioning diagram](sd_unet_color.png)
+![conditioning diagram](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/diffusion-course/sd_unet_color.png)
 
 OK, so how do we actually feed this conditioning information into the UNet for it to use as it makes predictions? The answer is something called cross-attention. Scattered throughout the UNet are cross-attention layers. Each spatial location in the UNet can 'attend' to different tokens in the text conditioning, bringing in relevant information from the prompt. The diagram above shows how this text conditioning (as well as timestep-based conditioning) is fed in at different points. As you can see, at every level the UNet has ample opportunity to make use of this conditioning!