v0.5.0
✨ Highlights
The main new model this release is MoBA / FlashMoBA (#840, #845), which brings Moonshot's Mixture of Block Attention into the fla family.
Beyond new models, fla is also moving from a Triton-only stack to a multi-backend one: FlashKDA is added as a new backend for KDA (#852), and TileLang is introduced for GDN, KDA, and parallel attention kernels (#827, #846, #854), with more backends to come.
What's Changed
- [CP] Fix missing bos and i_h offsets in backward gk loads by @zhiyuan1i in #781
- [KDA] Clarify gate input tracking in chunk backward by @zhiyuan1i in #785
- [GDN] Fuse kkt + solve_tril kernel & unified benchmark infrastructure by @yzhangcs in #789
- [Conv] Fix int32 overflow in conv kernel pointer arithmetic for large tensors by @tmct in #783
- [GDN] Add exp2 support across chunk kernels for improved performance by @yzhangcs in #791
- Fix parameter initialization for FSDP meta device compatibility by @yzhangcs in #793
- [GDN] Fix missing mask on off-diagonal blocks in fused kkt+s… by @yzhangcs in #794
- Fix layer_norm_bwd_kernel OOB access on high-SM GPUs by @mpurland in #795
- [Misc] Upgrade minimum PyTorch requirement to 2.7.0 by @zhiyuan1i in #801
- [Conv] Fix int32 overflow in varlen conv kernel pointer arithmetic by @tmct in #803
- [GDN] Add GVA support by @zhiyuan1i in #799
- [BugFix] Fix illegal memory access in KDA backward by dropping buggy autotune configs on Hopper by @zhiyuan1i in #807
- [CE] Add logit softcapping support to fused cross entropy by @yzhangcs in #810
- [Mamba] Remove unused arguments and update to align with mamba_ssm by @PuR3Luck in #782
- [GDN] Native GVA support: remove redundant Q/K repeat and unify head naming by @yzhangcs in #812
- [KDA] Add safe_gate/lower_bound support and improve docstrings by @yzhangcs in #814
- [GDN] Add fused gate kernel with use_gate_in_kernel support by @yzhangcs in #813
- chore: add AUTHORS, unify copyright headers, and add CI workflows by @yzhangcs in #816
- [CI] Improve benchmark outputs by @yzhangcs in #817
- [CI] Fix skip-test check failing on fork PRs by @zhiyuan1i in #821
- [CP] Enable KCP for DPLR by @zhiyuan1i in #822
- [Fix] Guard checkpoint weight re-initialization in RWKV-7, Mamba, Mamba2, and LogLinearMamba2 by @puigde in #820
- fix: register default global_scratch allocator on Blackwell GPUs by @ssubbotin in #825
- [Attn] Add sliding window attention support by @yzhangcs in #824
- [Docs] Add CONTRIBUTING.md by @yzhangcs in #830
- allow neg eigvals for delta-net by @hoedt in #832
- chore: add standalone isort config by @yzhangcs in #834
- Add autotune for causal conv update by @MARD1NO in #828
- [GDN] Add TileLang backend for chunk_bwd_dqkwg kernel by @zhiyuan1i in #827
- [Refactor] Simplify TileLang backend directory structure by @yzhangcs in #835
- [CI] Post benchmark comment via workflow_run for fork-safe PRs by @yzhangcs in #841
- [KDA] Add Grouped Value Attention (GVA) support by @yzhangcs in #833
- [Attn] Add GPT-OSS-style attention sink support by @Shomvel in #831
- [Fix] respect user-provided cu_seqlens when attention_mask is present by @yzhangcs in #842
- [Fix] flatten batched qkv in varlen cu_seqlens path by @lxr-tech in #839
- [Fix] Enforce batch-size check for varlen mode across multiple ops by @zhiyuan1i in #844
- [MoBA] Integrate MOBA and FlashMOBA by @ReyJerry in #840
- [MoBA] Follow-up: fix broken import, rename layer, add modeling, tests & docs by @yzhangcs in #845
- [GDN/KDA] Fuse gate activation into fused_recurrent kernels by @yzhangcs in #848
- [TileLang] Add fwd/bwd kernel for parallel attention by @zhiyuan1i in #846
- [CI] Add issue/PR pytest command workflow by @zhiyuan1i in #843
- [GDN] Optimize
b_dgcomputation in chunk_bwd_kernel_dqkwg#USE_G by @MzeroMiko in #823 - [GDN][Tilelang] Optimize b_dg computation in chunk_bwd_kernel_dqkwg by @zhiyuan1i in #849
- [Linear Attention] Update fused_recurrent.py for inference with normalization by @yiyousong in #268
- [Fix] Fix incorrect cumsum dim for naive_chunk_linear_attn normalize by @zhiyuan1i in #851
- [Cleanup] Remove deprecated head_first parameter from public ops by @yzhangcs in #853
- [KDA] Support FLASHKDA backend by @zhiyuan1i in #852
- [KDA][TileLang] Add TileLang backend for chunk_kda_bwd_wy_dqkg_fused by @zhiyuan1i in #854
New Contributors
- @tmct made their first contribution in #783
- @mpurland made their first contribution in #795
- @PuR3Luck made their first contribution in #782
- @puigde made their first contribution in #820
- @ssubbotin made their first contribution in #825
- @hoedt made their first contribution in #832
- @MARD1NO made their first contribution in #828
- @Shomvel made their first contribution in #831
- @lxr-tech made their first contribution in #839
- @MzeroMiko made their first contribution in #823
- @yiyousong made their first contribution in #268
Full Changelog: v0.4.2...v0.5.0