Is it possible to relax V shape requirements to have different head dim than q/k?

Torch's SDPA doesn't require V to have the same dimensions as inputs, it even noted in docs with different dimensions [E and Ev](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) as when V is multiplied by, head dimensions is gone and we have only L x L matrix. 

```python
In [23]: qk = torch.randn(4, 4, 4, 8).bfloat16().cuda()  

In [24]: v = torch.randn(4, 4, 4, 16).bfloat16().cuda()

In [25]: F.scaled_dot_product_attention(qk, qk, v).shape
Out[25]: torch.Size([4, 4, 4, 16])
```

same with [xfrormers](https://facebookresearch.github.io/xformers/components/ops.html), they use `K` and `Kv` in doc.

```python
In [26]: xops.memory_efficient_attention(qk, qk,v).shape
Out[26]: torch.Size([4, 4, 4, 16])
```

However flash attention 2 [2.4.2] requires head dimensions to match. 

```python
In [27]: flash_attn.flash_attn_func(qk,qk,v)....
RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size_og)
```
(as documented it requires all tensors to have `headdim` per head (error uses different name than documentation))

can it be relaxed to have different head_size for v or implementation depends on head dimensions match?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to relax V shape requirements to have different head dim than q/k? #753

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is it possible to relax V shape requirements to have different head dim than q/k? #753

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions