Skip to content

Releases: Dao-AILab/flash-attention

fa4-v4.0.0.beta13

13 May 09:10
9bad4be

Choose a tag to compare

fa4-v4.0.0.beta13 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta12...fa4-v4.0.0.beta13

fa4-v4.0.0.beta12

06 May 08:57
2e53092

Choose a tag to compare

fa4-v4.0.0.beta12 Pre-release
Pre-release

What's Changed

  • Fix long MSVC linker commands on Windows by @jammm in #2517
  • Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
  • [Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
  • benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
  • [FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
  • [hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
  • Deterministic backward for blocksparse impl by @drisspg in #2253

New Contributors

Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12

fa4-v4.0.0.beta11

29 Apr 08:53
ba59def

Choose a tag to compare

fa4-v4.0.0.beta11 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta10...fa4-v4.0.0.beta11

fa4-v4.0.0.beta10

22 Apr 08:43
3a7694c

Choose a tag to compare

fa4-v4.0.0.beta10 Pre-release
Pre-release

What's Changed

  • Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
  • Add CLC scheduler heuristic by @drisspg in #2455
  • expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
  • [Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
  • Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
  • Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
  • [Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
  • fix causal calcs by @drisspg in #2463
  • [cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481

New Contributors

Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10

fa4-v4.0.0.beta9

15 Apr 08:41
628452c

Choose a tag to compare

fa4-v4.0.0.beta9 Pre-release
Pre-release

What's Changed

  • feat(cute): implement softcap backward pass, correct math formula, and resolve JIT cache bug by @CaesarG in #2402
  • [Cute,Sm100,Fwd] add MLA 64/512 with topk sparsity for MQA 128 heads by @jayhshah in #2441
  • Handle linter for flash mla file by @jayhshah in #2459

New Contributors

Full Changelog: fa4-v4.0.0.beta8...fa4-v4.0.0.beta9

fa4-v4.0.0.beta8

08 Apr 08:32
15270e6

Choose a tag to compare

fa4-v4.0.0.beta8 Pre-release
Pre-release

What's Changed

  • fix noisy logger by @drisspg in #2414
  • [AMD ROCm] Fix NaN in FMHA BWD when seq_q=0 by @rocking5566 in #2421
  • Add FA4 CI: GitHub Actions workflow with Apptainer on B200 runner by @Johnsonms in #2393
  • Fix some bugs of CI by @Johnsonms in #2423
  • [ROCM] Fix windows issues by @micmelesse in #2385
  • fix: add [cu13] extra to dev install instructions for CUDA 13 / B200 systems by @Johnsonms in #2430
  • Fix: disable 2-CTA backward mode when block_sparse_tensors is used by @jduprat in #2433
  • CI: extend FA4 test matrix with causal/non-causal correctness and fwd+bwd benchmark seqlen 1K-32K by @Johnsonms in #2428

Full Changelog: fa4-v4.0.0.beta7...fa4-v4.0.0.beta8

fa4-v4.0.0.beta7

01 Apr 08:35
f6a16e1

Choose a tag to compare

fa4-v4.0.0.beta7 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta5...fa4-v4.0.0.beta7

fa4-v4.0.0.beta6

25 Mar 08:21
6362bd3

Choose a tag to compare

fa4-v4.0.0.beta6 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta6

fa4-v4.0.0.beta5

23 Mar 16:50
6362bd3

Choose a tag to compare

fa4-v4.0.0.beta5 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta5

fa4-v4.0.0.beta4

05 Mar 18:02

Choose a tag to compare