Skip to content

Commit afdf620

Browse files
committed
lecture
1 parent 42aeee0 commit afdf620

5 files changed

Lines changed: 50 additions & 51 deletions

File tree

docs/images/b1.png

191 KB
Loading

docs/images/b2.png

170 KB
Loading

docs/images/b3.png

155 KB
Loading

docs/images/ditblock.png

1020 KB
Loading

docs/index.html

Lines changed: 50 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -5890,7 +5890,7 @@ <h2>Key Takeaways</h2>
58905890
injects conditioning using cross-attention, and slowly transforms noisy latent tokens into something meaningful.
58915891
</p>
58925892
5893-
<p> Before we start all this, we need to do some preprocessing steps and a new concept. </p>
5893+
<p> Before we start all this, we need to do some preprocessing steps and learn a new concept. </p>
58945894
58955895
<h2>Convert the timestep into a vector form</h2>
58965896
@@ -5910,7 +5910,7 @@ <h2>Convert the timestep into a vector form</h2>
59105910
</pre>
59115911
59125912
<p>
5913-
But a single scalar can contain enough information about "how much noise there is".So we need to convert this one number into a rich embedding which can contain information about the noise level
5913+
But a single scalar can contain enough information about "how much noise there is". So we need to convert this one number into a rich embedding which can contain information about the noise level
59145914
</p>
59155915
59165916
<p>
@@ -6006,52 +6006,18 @@ <h2>TimestepEmbedder</h2>
60066006
60076007
60086008
<div class="callout">
6009-
<strong>Note:</strong> We use a MLP layer here too because this timestep vector is not for just providing the position or the time step, it acts as a conditioning agent too, it needs to tell the noise information to the video latent at every step
6009+
<strong>Note:</strong> We use a MLP layer here too because this timestep vector is not for just providing the position or the time step, it acts as a conditioning agent too, it needs to tell the noise information to the video latent at every step. Thus MLP helps in making this vector more information rich.
60106010
</div>
60116011
60126012
60136013
<p>
60146014
The sinusoidal part gives structure.
60156015
The MLP makes it more information rich.
6016-
</p>
6017-
6018-
<p>
60196016
This final vector will be used as conditioning in the DiT block.
60206017
</p>
60216018
6022-
<h2>AdaLN (Adaptive Layer Normalization)</h2>
6023-
6024-
<p>
6025-
Now that we have a timestep vector, the next question is:
6026-
how do we inject it into the transformer block?
6027-
</p>
6028-
6029-
<p>
6030-
We could just add it somewhere.
6031-
But that would be too weak.
6032-
</p>
6033-
6034-
<p>
6035-
In a DiT block, conditioning must influence <strong>every major sub-block</strong>:
6036-
</p>
6037-
6038-
<ul>
6039-
<li>self-attention</li>
6040-
<li>cross-attention</li>
6041-
<li>feed-forward network</li>
6042-
</ul>
60436019
6044-
<p>
6045-
This is exactly what <code>adaLN</code> does.
6046-
</p>
6047-
6048-
<p>
6049-
<code>adaLN</code> stands for <strong>adaptive LayerNorm</strong>.
6050-
It predicts scale, shift, and gate values from the conditioning signal,
6051-
then uses them to do adaptive conditioning to the hidden states.
6052-
</p>
6053-
6054-
<h2>AdaLN Modulation</h2>
6020+
<h2>AdaLN (Adaptive Layer Normalization)</h2>
60556021
60566022
<pre><span class="c-kw">class</span> <span class="c-fn">AdaLNModulation</span>(nn.Module):
60576023
</pre>
@@ -6062,7 +6028,7 @@ <h2>AdaLN Modulation</h2>
60626028
</p>
60636029
60646030
<p>
6065-
How and where do we inject this information into the transformer block?
6031+
How and where do we inject this conditioning into our block?
60666032
</p>
60676033
60686034
<p>
@@ -6231,6 +6197,20 @@ <h2>The Modulation Formula</h2>
62316197
62326198
<h2>Now let us look at the full DiT block</h2>
62336199
6200+
<div class="hero-image">
6201+
<img
6202+
src="images/ditblock.png"
6203+
alt="DiT block Architecture"
6204+
/>
6205+
<p class="caption">
6206+
DiT block Architecture
6207+
</p>
6208+
</div>
6209+
6210+
<p>
6211+
Believe me its not that complicated as it looks. FIrst row is our inputs, second row is where we will preprocess our inputs which we will cover in conditioning lecture. Now we begin with DiT block.
6212+
</p>
6213+
62346214
<p>
62356215
The DiT block has three main sub-blocks:
62366216
</p>
@@ -6244,23 +6224,22 @@ <h2>Now let us look at the full DiT block</h2>
62446224
<p>
62456225
Each one gets its own adaLN modulation.
62466226
So the conditioning is not injected once
6247-
It is injected separately where it matters.
6227+
It is injected separately and multiple times
62486228
</p>
62496229
6250-
<h2>Block Overview</h2>
62516230
6252-
<pre>
6253-
x → adaLN → Self-Attention → residual
6254-
x → adaLN → Cross-Attention → residual
6255-
x → adaLN → FFN → residual
6256-
</pre>
6231+
<h2>Block 1: Self-Attention</h2>
62576232
6258-
<p>
6259-
This is the basic flow.
6260-
Now let us break it down.
6261-
</p>
6233+
<div class="hero-image">
6234+
<img
6235+
src="images/b1.png"
6236+
alt="DiT block Architecture"
6237+
/>
6238+
<p class="caption">
6239+
6240+
</p>
6241+
</div>
62626242
6263-
<h2>Block 1: Self-Attention</h2>
62646243
62656244
<pre>
62666245
γ1, β1, α1 = self.ada1(c)
@@ -6279,6 +6258,16 @@ <h2>Block 1: Self-Attention</h2>
62796258
62806259
<h2>Block 2 Cross-Attention</h2>
62816260
6261+
<div class="hero-image">
6262+
<img
6263+
src="images/b2.png"
6264+
alt="DiT block Architecture"
6265+
/>
6266+
<p class="caption">
6267+
6268+
</p>
6269+
</div>
6270+
62826271
<pre>
62836272
γ2, β2, α2 = self.ada2(c)
62846273
x = x + α2 * self.cross_attn(modulate(self.norm2(x), γ2, β2), cond_tokens)
@@ -6306,6 +6295,16 @@ <h2>Block 2 Cross-Attention</h2>
63066295
63076296
<h2>Block 3 Feed-Forward Network</h2>
63086297
6298+
<div class="hero-image">
6299+
<img
6300+
src="images/b3.png"
6301+
alt="DiT block Architecture"
6302+
/>
6303+
<p class="caption">
6304+
6305+
</p>
6306+
</div>
6307+
63096308
<pre>
63106309
γ3, β3, α3 = self.ada3(c)
63116310
x = x + α3 * self.ffn(modulate(self.norm3(x), γ3, β3))

0 commit comments

Comments
 (0)