Open
Description
🚀 Feature Request
TransformerEngine has advanced Attention kernels, including support for FlashAttention-3 and low-precision kernels.
Motivation
Having TransformerEngine's Attention as an attn_impl
option would be super nice due to the additional features for H100 users.
[Optional] Implementation
Would require some changes in MPT configuration and adding that new Attention layer.
Additional context
Not yet sure if I am available for the implementation, but wanted to get the request and discussion out there for now. :)
There was a previous PR with a similar proposal here: #803