Files

Diff-Transformer

Failed to load latest commit information.

Cannot retrieve latest commit at this time.

Name		Name	Last commit message	Last commit date
parent directory ..
imgs		imgs
kernel		kernel
README.md		README.md
example.py		example.py
multihead_attention.py		multihead_attention.py
multihead_diffattn.py		multihead_diffattn.py
multihead_flashdiff_1.py		multihead_flashdiff_1.py
multihead_flashdiff_2.py		multihead_flashdiff_2.py
rms_norm.py		rms_norm.py

README.md

Differential Transformer

Approach

Contents

multihead_diffattn.py contains naive implementation of multi-head differential attention.

multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers). (Recommended for faster training and inference)

multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).

multihead_attention.py contains implementation of conventional multi-head attention.

example.py contains instantiation of differential attention and conventional attention in pair, which can be compared against each other.

Also refer to PR for another implementation.

We recommend using models with a sufficiently large number of heads to minimize the impact of halving heads. For instance, using Diff Transformer with more than 8 heads (the minimum used in the paper, with the same number of parameters as Transformer with 16 heads) is advisable.

Core Code