Hi,
Congratulations on the amazing work.
I had a small curiousity question: For the velocity decoder, in AdaLN modulation, instead of a single token, now you have K tokens (with K=256 for 256x256 image with patch size=2). As such, the adaLN_modulation linear layer, which was previously just computing the scale and shift of 1 token, now needs to compute the scale and shift of K tokens. I assume this grows the flops by K times. So for a 256x256 image, this would grow by 256 times for the layers which are considered for the velocity decoder. So I was wondering if you have some scores that show flops vs FID w.r.t. SiT as a baseline.
Thanks a lot for the cool work, and am really eager to know your thoughts.
Hi,
Congratulations on the amazing work.
I had a small curiousity question: For the velocity decoder, in AdaLN modulation, instead of a single token, now you have K tokens (with K=256 for 256x256 image with patch size=2). As such, the
adaLN_modulationlinear layer, which was previously just computing the scale and shift of 1 token, now needs to compute the scale and shift of K tokens. I assume this grows the flops by K times. So for a 256x256 image, this would grow by 256 times for the layers which are considered for the velocity decoder. So I was wondering if you have some scores that show flops vs FID w.r.t. SiT as a baseline.Thanks a lot for the cool work, and am really eager to know your thoughts.