332 + # Zero-out adaLN modulation layers in DDT blocks (adaLN-Zero):
333 + for block in self.blocks:
334 + nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
335 + nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
I noticed that unlike the standard DiT architecture, which uses zero-initialization in every transformer block (e.g., via the adaLN-Zero mechanism), your implementation only applies zero-initialization to the final projection head. Could you share the reasoning behind this design choice?
I noticed that unlike the standard DiT architecture, which uses zero-initialization in every transformer block (e.g., via the adaLN-Zero mechanism), your implementation only applies zero-initialization to the final projection head. Could you share the reasoning behind this design choice?