RoPE uses wrong dimension as a sequence length in Vocos implementation

Hello
Vocos implementation uses RoPE from torchtune, which takes tensors of shape [b, **s, n_h**, h_d] (https://docs.pytorch.org/torchtune/0.3/generated/torchtune.modules.RotaryPositionalEmbeddings.html).
But code passes tensors of shape [b, **n_h, s**, h_d], which means that RoPE uses attention head count as a sequence length
https://github.com/zhenye234/X-Codec-2.0/blob/ccbbf340ff143dfa6a0ea7cd61ec34a8ba2f1c3d/vq/bs_roformer5.py#L74-L78

	q, k, v = rearrange(self.c_attn(x), 'b t (r h d) -> r b h t d', r=3, h=self.n_heads)
	# q, k, v: (b, h, t, d)

	q = self.rotary_embed(q)
	k = self.rotary_embed(k)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RoPE uses wrong dimension as a sequence length in Vocos implementation #36

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RoPE uses wrong dimension as a sequence length in Vocos implementation #36

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions