Skip to content

Files

Failed to load latest commit information.

Latest commit

 Cannot retrieve latest commit at this time.

History

History

Diff-Transformer

README.md

Differential Transformer

Approach

Contents

multihead_diffattn.py contains naive implementation of multi-head differential attention.

multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers). (Recommended for faster training and inference)

multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).

multihead_attention.py contains implementation of conventional multi-head attention.

example.py contains instantiation of differential attention and conventional attention in pair, which can be compared against each other.

Also refer to PR for another implementation.

We recommend using models with a sufficiently large number of heads to minimize the impact of halving heads. For instance, using Diff Transformer with more than 8 heads (the minimum used in the paper, with the same number of parameters as Transformer with 16 heads) is advisable.

Core Code