You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/index.html
+50-51Lines changed: 50 additions & 51 deletions
Original file line number
Diff line number
Diff line change
@@ -5890,7 +5890,7 @@ <h2>Key Takeaways</h2>
5890
5890
injects conditioning using cross-attention, and slowly transforms noisy latent tokens into something meaningful.
5891
5891
</p>
5892
5892
5893
-
<p> Before we start all this, we need to do some preprocessing steps and a new concept. </p>
5893
+
<p> Before we start all this, we need to do some preprocessing steps and learn a new concept. </p>
5894
5894
5895
5895
<h2>Convert the timestep into a vector form</h2>
5896
5896
@@ -5910,7 +5910,7 @@ <h2>Convert the timestep into a vector form</h2>
5910
5910
</pre>
5911
5911
5912
5912
<p>
5913
-
But a single scalar can contain enough information about "how much noise there is".So we need to convert this one number into a rich embedding which can contain information about the noise level
5913
+
But a single scalar can contain enough information about "how much noise there is".So we need to convert this one number into a rich embedding which can contain information about the noise level
5914
5914
</p>
5915
5915
5916
5916
<p>
@@ -6006,52 +6006,18 @@ <h2>TimestepEmbedder</h2>
6006
6006
6007
6007
6008
6008
<div class="callout">
6009
-
<strong>Note:</strong> We use a MLP layer here too because this timestep vector is not for just providing the position or the time step, it acts as a conditioning agent too, it needs to tell the noise information to the video latent at every step
6009
+
<strong>Note:</strong> We use a MLP layer here too because this timestep vector is not for just providing the position or the time step, it acts as a conditioning agent too, it needs to tell the noise information to the video latent at every step. Thus MLP helps in making this vector more information rich.
6010
6010
</div>
6011
6011
6012
6012
6013
6013
<p>
6014
6014
The sinusoidal part gives structure.
6015
6015
The MLP makes it more information rich.
6016
-
</p>
6017
-
6018
-
<p>
6019
6016
This final vector will be used as conditioning in the DiT block.
6020
6017
</p>
6021
6018
6022
-
<h2>AdaLN (Adaptive Layer Normalization)</h2>
6023
-
6024
-
<p>
6025
-
Now that we have a timestep vector, the next question is:
6026
-
how do we inject it into the transformer block?
6027
-
</p>
6028
-
6029
-
<p>
6030
-
We could just add it somewhere.
6031
-
But that would be too weak.
6032
-
</p>
6033
-
6034
-
<p>
6035
-
In a DiT block, conditioning must influence <strong>every major sub-block</strong>:
6036
-
</p>
6037
-
6038
-
<ul>
6039
-
<li>self-attention</li>
6040
-
<li>cross-attention</li>
6041
-
<li>feed-forward network</li>
6042
-
</ul>
6043
6019
6044
-
<p>
6045
-
This is exactly what <code>adaLN</code> does.
6046
-
</p>
6047
-
6048
-
<p>
6049
-
<code>adaLN</code> stands for <strong>adaptive LayerNorm</strong>.
6050
-
It predicts scale, shift, and gate values from the conditioning signal,
6051
-
then uses them to do adaptive conditioning to the hidden states.
Believe me its not that complicated as it looks. FIrst row is our inputs, second row is where we will preprocess our inputs which we will cover in conditioning lecture. Now we begin with DiT block.
6212
+
</p>
6213
+
6234
6214
<p>
6235
6215
The DiT block has three main sub-blocks:
6236
6216
</p>
@@ -6244,23 +6224,22 @@ <h2>Now let us look at the full DiT block</h2>
0 commit comments