Skip to content

RoPE uses wrong dimension as a sequence length in Vocos implementation #36

@Uncomfy

Description

@Uncomfy

Hello
Vocos implementation uses RoPE from torchtune, which takes tensors of shape [b, s, n_h, h_d] (https://docs.pytorch.org/torchtune/0.3/generated/torchtune.modules.RotaryPositionalEmbeddings.html).
But code passes tensors of shape [b, n_h, s, h_d], which means that RoPE uses attention head count as a sequence length

q, k, v = rearrange(self.c_attn(x), 'b t (r h d) -> r b h t d', r=3, h=self.n_heads)
# q, k, v: (b, h, t, d)
q = self.rotary_embed(q)
k = self.rotary_embed(k)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions