Releases: Dao-AILab/flash-attention
Releases · Dao-AILab/flash-attention
fa4-v4.0.0.beta13
What's Changed
- [ROCm Windows] fix build failed by @Apophis3158 in #2519
- [CuTe,Bwd,Sm100] don't disable 2cta due to cuda 12 in bwd by @reubenconducts in #2543
- [CuTe,Bwd] guard softcap for varlen backward by @reubenconducts in #2544
- [CuTe,Flex] varlen blocksparsity by @reubenconducts in #2224
- [FA4][hd256] Fix layout of non-contiguous qkv in backward kernel by @wangsiyu in #2545
- [Cute,Bwd,Sm100] fix incorrect calculation of n_block global max for bwd deterministic by @jayhshah in #2549
- fix varlen w/ paging split kv bug by @liangel-02 in #2550
New Contributors
- @Apophis3158 made their first contribution in #2519
Full Changelog: fa4-v4.0.0.beta12...fa4-v4.0.0.beta13
fa4-v4.0.0.beta12
What's Changed
- Fix long MSVC linker commands on Windows by @jammm in #2517
- Fix test_flash_attn_fast varlen call after qv positional insert by @henrylhtsang in #2527
- [Cute,Bwd,Sm90] Fix determinism for GQA, port Sm100 approach in by @v0i0 in #2510
- benchmarks/tune_ex2_emu: hd256 sweep support and clock lock/unlock by @Johnsonms in #2495
- [FA4][hd256] Backward TMA bulk-store epilogue + LSE/dpsum coalesce by @Johnsonms in #2497
- [hd256] Add TMA paged KV support to SM100 2CTA forward kernel by @Johnsonms in #2489
- Deterministic backward for blocksparse impl by @drisspg in #2253
New Contributors
Full Changelog: fa4-v4.0.0.beta11...fa4-v4.0.0.beta12
fa4-v4.0.0.beta11
What's Changed
- Feat([FA4][CUTE DSL]) Add head_dim=256 support (forward + backward) by @wangsiyu in #2412
- [Cute,hd256] Post-merge cleanup: dead code, duplicate imports by @Johnsonms in #2487
- [CuTe,Flex] Wire up interface for flex autograd support by @reubenconducts in #2485
- [CuTe,Flex] Add score_mod_bwd param to flash_attn_varlen_func by @reubenconducts in #2496
- fix: typos and missing comments in FA4 cute kernel files by @dxasu in #2502
- [SM100] Guard gO None in empty-tile correction by @geruome in #2504
- [CuTe, Flex] simplify blocksparse interface in flash_attn_func by @reubenconducts in #2506
- Fix: pass
streamto SM100 MLA kernel by @MatthewBonanni in #2505 - Fix clc scheduling request bug by @drisspg in #2508
- [Tests,MLA] Close coverage gaps in test_flash_attn_mla_absorbed by @Johnsonms in #2483
- Add cache utils logging test by @drisspg in #2509
- [hd256] Improve forward kernel with exp2 FMA emulation (3% to 9% performance gain) by @Johnsonms in #2488
- SM90 FA4 QuACK 0.4 Compatibility by @EduardDurech in #2513
- ci: use /tmp for apptainer tmpdir to fix xattrerror on VAST by @Johnsonms in #2511
New Contributors
- @wangsiyu made their first contribution in #2412
- @dxasu made their first contribution in #2502
- @EduardDurech made their first contribution in #2513
Full Changelog: fa4-v4.0.0.beta10...fa4-v4.0.0.beta11
fa4-v4.0.0.beta10
What's Changed
- Disable 2CTA fwd non-causal on CUDA 12 to work around codegen regression by @Johnsonms in #2461
- Add CLC scheduler heuristic by @drisspg in #2455
- expose num_splits for FA2 and add option for kernel blocksize alignment by @liangel-02 in #2448
- [Cute,Fwd,Sm100] fp8 e4m3 and e5m2 support by @dcw02 in #2109
- Expose --pack-gqa and --num-splits in benchmark_attn.py by @Johnsonms in #2473
- Fix: pass num_splits through varlen_fwd Python wrapper (fixes #2448 regression) by @hsyysy in #2476
- [Cute,Fwd,Sm100] Fix the crash when seqlen_k=0 by @Johnsonms in #2470
- fix causal calcs by @drisspg in #2463
- [cute,bwd] fix PDL race in bwd_preprocess, which corrupting dpsum on SM90+ by @geruome in #2481
New Contributors
- @dcw02 made their first contribution in #2109
- @hsyysy made their first contribution in #2476
- @geruome made their first contribution in #2481
Full Changelog: fa4-v4.0.0.beta9...fa4-v4.0.0.beta10
fa4-v4.0.0.beta9
What's Changed
- feat(cute): implement softcap backward pass, correct math formula, and resolve JIT cache bug by @CaesarG in #2402
- [Cute,Sm100,Fwd] add MLA 64/512 with topk sparsity for MQA 128 heads by @jayhshah in #2441
- Handle linter for flash mla file by @jayhshah in #2459
New Contributors
Full Changelog: fa4-v4.0.0.beta8...fa4-v4.0.0.beta9
fa4-v4.0.0.beta8
What's Changed
- fix noisy logger by @drisspg in #2414
- [AMD ROCm] Fix NaN in FMHA BWD when seq_q=0 by @rocking5566 in #2421
- Add FA4 CI: GitHub Actions workflow with Apptainer on B200 runner by @Johnsonms in #2393
- Fix some bugs of CI by @Johnsonms in #2423
- [ROCM] Fix windows issues by @micmelesse in #2385
- fix: add [cu13] extra to dev install instructions for CUDA 13 / B200 systems by @Johnsonms in #2430
- Fix: disable 2-CTA backward mode when block_sparse_tensors is used by @jduprat in #2433
- CI: extend FA4 test matrix with causal/non-causal correctness and fwd+bwd benchmark seqlen 1K-32K by @Johnsonms in #2428
Full Changelog: fa4-v4.0.0.beta7...fa4-v4.0.0.beta8
fa4-v4.0.0.beta7
What's Changed
- fix: use LSE accum strides from params instead of hardcoded ones by @ZeronSix in #2388
- [Sm75] Add README link for initial Turing support by @ssiu in #2379
- [Cute,Sm100,Bwd] refine bwd swizzle for deterministic by @jayhshah in #2390
- Fix edge case when tag has no delta from previous by @drisspg in #2394
- [AMD ROCm] Update CK and add RDNA 3/4 support by @rocking5566 in #2400
- [Ai-assisted] CLC work stealing by @drisspg in #2218
- Various bug fixes / enable subtile > 2 by @drisspg in #2411
- Add to varlen by @drisspg in #2346
- Allow compact block sparse index tensors by @jduprat in #2417
New Contributors
Full Changelog: fa4-v4.0.0.beta5...fa4-v4.0.0.beta7
fa4-v4.0.0.beta6
What's Changed
- [Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
- Nicer headdim error message by @drisspg in #2227
- [Fwd,Sm100] Extract named barriers by @drisspg in #2309
- Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
- [CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
- Add Logging helper by @drisspg in #2327
- [CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
- fix: duplicate softmax_scale param by @NanoCode012 in #2328
- Fix FA2 + FA4 co-existence by @drisspg in #2331
- [Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
- [Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
- SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
- [cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
- [Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
- BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
- [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
- [AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
- [Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
- [Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
- Add SM120 varlen attention support by @blake-snc in #2333
- fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
- [Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
- [Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
- [Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
- [Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
- [AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
- [ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
- [Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
- [CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
- Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
- [Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
- benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
- Update flow to enable beta weekly releases by @drisspg in #2378
New Contributors
- @NJX-njx made their first contribution in #2315
- @blake-snc made their first contribution in #2329
- @Luosuu made their first contribution in #2335
- @risan-raja made their first contribution in #2337
- @MatthewBonanni made their first contribution in #2338
- @rainj-me made their first contribution in #2345
- @Soddentrough made their first contribution in #2343
- @Inodayy made their first contribution in #2347
- @Johnsonms made their first contribution in #2377
Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta6
fa4-v4.0.0.beta5
What's Changed
- [Cute][Testing] Minor improvements on pytest-xdist workflow by @Alkaid-Benetnash in #2311
- Nicer headdim error message by @drisspg in #2227
- [Fwd,Sm100] Extract named barriers by @drisspg in #2309
- Change 2cta opt in to have min seqlen > 2*m_block_size by @drisspg in #2320
- [CuteDSL][SM90] varlen bwd works by @KareemMusleh in #2275
- Add Logging helper by @drisspg in #2327
- [CuTeDSL][Sm80] basic fix for new api by @zhuochenKIDD in #2297
- fix: duplicate softmax_scale param by @NanoCode012 in #2328
- Fix FA2 + FA4 co-existence by @drisspg in #2331
- [Cute,Sm100] Introduce a flexible lambda-based R2P masking by @Alkaid-Benetnash in #2313
- [Cute, SM90, bwd] Wire seqused_q/k through backward pass by @NJX-njx in #2315
- SM120 forward pass (Blackwell GeForce / DGX Spark) by @blake-snc in #2329
- [cutlass] Allow compilation of cutlass FA3 for sm100 via enable_sm90 by @henrylhtsang in #2332
- [Cute] fix: rename logging module to avoid circular import at building by @Luosuu in #2335
- BUG: SeqlenInfo.create has a tile parameter that defaults to 128 by @risan-raja in #2337
- [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by @MatthewBonanni in #2338
- [AMD] Migrate to Triton Backend to Aiter by @micmelesse in #2230
- [Bwd,Sm120] Add SM120 backward pass support by @blake-snc in #2330
- [Bwd, SM80] Fix tdKrdS typo by @henrylhtsang in #2341
- Add SM120 varlen attention support by @blake-snc in #2333
- fix the create_ragged_tensor_for_tma issue by @rainj-me in #2345
- [Sm90] Fix test_mask_mod and bwd block-sparse kwarg mismatch by @henrylhtsang in #2365
- [Cute, Testing] Fix aot + tvm-ffi EnvStream related parameter mismatch by @Alkaid-Benetnash in #2369
- [Cute, Testing] Bump cutedsl to 4.4.2 and remove prior aot cache management workarounds by @Alkaid-Benetnash in #2370
- [Cute] fix: FA4 paged attention kv load for DeepSeek (192,128) on SM100 by @Luosuu in #2368
- [AMD ROCm] Update ROCm/CK backend to align with latest ComposableKernel API changes by @rocking5566 in #2363
- [ROCm] Auto-detect Triton backend if C++ extension is missing by @Soddentrough in #2343
- [Fwd,Sm90] Add paged KV attention support (tma and cp.async) by @henrylhtsang in #2360
- [CuTe,Flex] limit vec_size to 2 for score mod when not on Sm100 by @reubenconducts in #2371
- Support 2CTA for sliding window hdim 192 by @Inodayy in #2347
- [Cute,Fwd,Sm100] support irregular qhead / kvhead ratios by @timmy-feng in #2186
- benchmarks: add MFU% column to benchmark output by @Johnsonms in #2377
- Update flow to enable beta weekly releases by @drisspg in #2378
New Contributors
- @NJX-njx made their first contribution in #2315
- @blake-snc made their first contribution in #2329
- @Luosuu made their first contribution in #2335
- @risan-raja made their first contribution in #2337
- @MatthewBonanni made their first contribution in #2338
- @rainj-me made their first contribution in #2345
- @Soddentrough made their first contribution in #2343
- @Inodayy made their first contribution in #2347
- @Johnsonms made their first contribution in #2377
Full Changelog: fa4-v4.0.0.beta4...fa4-v4.0.0.beta5
fa4-v4.0.0.beta4
Full Changelog: fa4-v4.0.0.beta2...fa4-v4.0.0.beta4