Skip to content

v0.5.0

Choose a tag to compare

@yzhangcs yzhangcs released this 21 Apr 20:25
· 93 commits to main since this release
3a9ce1c

✨ Highlights

The main new model this release is MoBA / FlashMoBA (#840, #845), which brings Moonshot's Mixture of Block Attention into the fla family.

Beyond new models, fla is also moving from a Triton-only stack to a multi-backend one: FlashKDA is added as a new backend for KDA (#852), and TileLang is introduced for GDN, KDA, and parallel attention kernels (#827, #846, #854), with more backends to come.

What's Changed

  • [CP] Fix missing bos and i_h offsets in backward gk loads by @zhiyuan1i in #781
  • [KDA] Clarify gate input tracking in chunk backward by @zhiyuan1i in #785
  • [GDN] Fuse kkt + solve_tril kernel & unified benchmark infrastructure by @yzhangcs in #789
  • [Conv] Fix int32 overflow in conv kernel pointer arithmetic for large tensors by @tmct in #783
  • [GDN] Add exp2 support across chunk kernels for improved performance by @yzhangcs in #791
  • Fix parameter initialization for FSDP meta device compatibility by @yzhangcs in #793
  • [GDN] Fix missing mask on off-diagonal blocks in fused kkt+s… by @yzhangcs in #794
  • Fix layer_norm_bwd_kernel OOB access on high-SM GPUs by @mpurland in #795
  • [Misc] Upgrade minimum PyTorch requirement to 2.7.0 by @zhiyuan1i in #801
  • [Conv] Fix int32 overflow in varlen conv kernel pointer arithmetic by @tmct in #803
  • [GDN] Add GVA support by @zhiyuan1i in #799
  • [BugFix] Fix illegal memory access in KDA backward by dropping buggy autotune configs on Hopper by @zhiyuan1i in #807
  • [CE] Add logit softcapping support to fused cross entropy by @yzhangcs in #810
  • [Mamba] Remove unused arguments and update to align with mamba_ssm by @PuR3Luck in #782
  • [GDN] Native GVA support: remove redundant Q/K repeat and unify head naming by @yzhangcs in #812
  • [KDA] Add safe_gate/lower_bound support and improve docstrings by @yzhangcs in #814
  • [GDN] Add fused gate kernel with use_gate_in_kernel support by @yzhangcs in #813
  • chore: add AUTHORS, unify copyright headers, and add CI workflows by @yzhangcs in #816
  • [CI] Improve benchmark outputs by @yzhangcs in #817
  • [CI] Fix skip-test check failing on fork PRs by @zhiyuan1i in #821
  • [CP] Enable KCP for DPLR by @zhiyuan1i in #822
  • [Fix] Guard checkpoint weight re-initialization in RWKV-7, Mamba, Mamba2, and LogLinearMamba2 by @puigde in #820
  • fix: register default global_scratch allocator on Blackwell GPUs by @ssubbotin in #825
  • [Attn] Add sliding window attention support by @yzhangcs in #824
  • [Docs] Add CONTRIBUTING.md by @yzhangcs in #830
  • allow neg eigvals for delta-net by @hoedt in #832
  • chore: add standalone isort config by @yzhangcs in #834
  • Add autotune for causal conv update by @MARD1NO in #828
  • [GDN] Add TileLang backend for chunk_bwd_dqkwg kernel by @zhiyuan1i in #827
  • [Refactor] Simplify TileLang backend directory structure by @yzhangcs in #835
  • [CI] Post benchmark comment via workflow_run for fork-safe PRs by @yzhangcs in #841
  • [KDA] Add Grouped Value Attention (GVA) support by @yzhangcs in #833
  • [Attn] Add GPT-OSS-style attention sink support by @Shomvel in #831
  • [Fix] respect user-provided cu_seqlens when attention_mask is present by @yzhangcs in #842
  • [Fix] flatten batched qkv in varlen cu_seqlens path by @lxr-tech in #839
  • [Fix] Enforce batch-size check for varlen mode across multiple ops by @zhiyuan1i in #844
  • [MoBA] Integrate MOBA and FlashMOBA by @ReyJerry in #840
  • [MoBA] Follow-up: fix broken import, rename layer, add modeling, tests & docs by @yzhangcs in #845
  • [GDN/KDA] Fuse gate activation into fused_recurrent kernels by @yzhangcs in #848
  • [TileLang] Add fwd/bwd kernel for parallel attention by @zhiyuan1i in #846
  • [CI] Add issue/PR pytest command workflow by @zhiyuan1i in #843
  • [GDN] Optimize b_dg computation in chunk_bwd_kernel_dqkwg#USE_G by @MzeroMiko in #823
  • [GDN][Tilelang] Optimize b_dg computation in chunk_bwd_kernel_dqkwg by @zhiyuan1i in #849
  • [Linear Attention] Update fused_recurrent.py for inference with normalization by @yiyousong in #268
  • [Fix] Fix incorrect cumsum dim for naive_chunk_linear_attn normalize by @zhiyuan1i in #851
  • [Cleanup] Remove deprecated head_first parameter from public ops by @yzhangcs in #853
  • [KDA] Support FLASHKDA backend by @zhiyuan1i in #852
  • [KDA][TileLang] Add TileLang backend for chunk_kda_bwd_wy_dqkg_fused by @zhiyuan1i in #854

New Contributors

Full Changelog: v0.4.2...v0.5.0