Multi-token attention

### 🚀 The feature, motivation and pitch

Hi! Thank you for adding support for MTA (https://github.com/linkedin/Liger-Kernel/pull/689) ! Do I understand it correctly, that this implementation only covers post-sm key-query convolution? There is also pre-sm Q-K convolution, head convolution, and gated group norm (last one should probably not be part of the kernel). We have released reference code here: https://github.com/facebookresearch/RAM/blob/main/projects/mta/mta_transformer.py#L337

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-token attention #857

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-token attention #857

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions