Hello
Vocos implementation uses RoPE from torchtune, which takes tensors of shape [b, s, n_h, h_d] (https://docs.pytorch.org/torchtune/0.3/generated/torchtune.modules.RotaryPositionalEmbeddings.html).
But code passes tensors of shape [b, n_h, s, h_d], which means that RoPE uses attention head count as a sequence length
|
q, k, v = rearrange(self.c_attn(x), 'b t (r h d) -> r b h t d', r=3, h=self.n_heads) |
|
# q, k, v: (b, h, t, d) |
|
|
|
q = self.rotary_embed(q) |
|
k = self.rotary_embed(k) |