Releases: ROCm/aiter
Releases · ROCm/aiter
v0.1.7
What's Changed
- add concat_and_cache_mla kernel by @yzhou103 in #1194
- Add triton_metadata_redirect module with with_metadata_path decorator by @jwu10003 in #1172
- [bug] fix qr when variable input by @lihaoyang-amd in #1191
- fix test_concat_cache_mla by @yzhou103 in #1198
- Update vllm_benchmark.yaml to use TW cluster to build vllm image by @gyohuangxin in #1200
- add hipblaslt swizzle feature by @yixionghuo in #1169
- mha fwd v3 gfx950 support dim_q=192 dim_v=128 by @minmengdie in #1188
- Fix FA cpp api multi target build by @slippedJim in #1196
- CI: parallel build Aiter whl packages for Python 3.10 and 3.12 by @gyohuangxin in #1204
- CI: Move some tests back to TW cluster by @gyohuangxin in #1199
- CI: Add timeout and retry when insatlling the python deps by @gyohuangxin in #1210
- add aiter namespace to rope by @valarLip in #1202
- update test_gemm_a16w16 by @amd-ruitang3 in #1205
- [fea]: custom all gather by @TennyWang1223 in #1207
- [TRITON] Add Positional Encoding (PE) support to Triton MHA kernels by @brunomazzottiamd in #1184
- Tune gemm op bf16 by @yzhou103 in #1190
- Add mha varlen fake for different from mha by @ZhangLirong-amd in #1214
- CI: Use the pre-built sglang image instead of building the sglang image from source. by @gyohuangxin in #1224
- [TRITON] Fix fp8 bmm op unit test bug on MI350 by @lucas-santos-amd in #1219
- CI: Change the image when building the Aiter release python3.12 package by @gyohuangxin in #1225
- CI: Add runner selection to workflow for flexible build host choice in Aiter release CI by @gyohuangxin in #1229
- [CK_TILE] FMHA BWD Optimizations for D48 for GFX950 by @DDEle in #1180
- Remove FA cpp api dependence of pandas by @slippedJim in #1230
- CI: Add unique tag after the names of uploaded packages by @gyohuangxin in #1236
- A8w8 asm codegen and tune by @yzhou103 in #1161
- Opt concat cache mla by @yzhou103 in #1239
- avoid copy ck by @valarLip in #1247
- CI: Add GPU_ARCH options in Aiter release CI by @gyohuangxin in #1253
- [MI35x] fix gfx950 build error by @yzhou103 in #1250
- update bf16 gemm asm by @amd-ruitang3 in #1238
- CI: Fix issues in Aiter release CI by @gyohuangxin in #1255
- add ninja into requirements by @amd-ruitang3 in #1254
- Wrapper gemm to fix get_config lru cache break by @ZhangLirong-amd in #1249
- Fix build bdist wheel error by @yzhou103 in #1256
- Added in GQA and 64-bit indexing by @kesavanramakrishnan in #1226
- Refine ca by @valarLip in #1263
- update mi308 fmoe fp16 asm,MI35x by @amd-ruitang3 in #1201
- Introducing opus by @carlushuang in #1264
- ca_refactor_fix by @valarLip in #1268
- fix_fp4_quant_dtype by @lalala-sh in #1271
- add sample use outer exponential by @junhaha666 in #1267
- Fix rowwise a8w8 gemm in swizzled hipb_mm by @kliuae-amd in #1258
- CI: Use TW cluster to run sglang tests by @gyohuangxin in #1273
- Catchall PR for all 355_wip related changes by @azaidy in #1148
- [MI35X] fix core check by @amd-ruitang3 in #1276
- Refactor gemm bf16 tuner by @yzhou103 in #1275
- CI: Operators tuning pipelines by @gyohuangxin in #1163
- fix the problem that v3's performance is worse than ck's by @minmengdie in #1237
- fix the python mha test run_perftest error by @minmengdie in #1277
- Reuse custom decorator in core and torch guard by @ZhangLirong-amd in #1278
- integrate deep gemm by @lalala-sh in #1265
- add a tuned fp4 gemm ds config and insert entries in untuned config by @hongxiayang in #1243
- Enable large batch size and optimization of non-Ragged batching by @valechen in #1269
- added a few more fw ds f4 untuned and tuned shapes in order to use asm module or kernel by @hongxiayang in #1298
- CI: Optimize autotuning pipeline and inital the docs by @gyohuangxin in #1286
- topk per row kernel by @ukannika in #1262
- fix aot by @fsx950223 in #1279
- Fix ATOM fp8 model quant fail issue in torch compile by @ZhangLirong-amd in #1299
- feat - pa_fwd support block map with stride in num_kv_heads_dim by @alibaba-miji in #1301
- Fix how to update accumulator for dot_scaled by @zhanglx13 in #1297
- CI: Optimize autotuning pipeline docs by @gyohuangxin in #1300
- Fix the lint issue by @gyohuangxin in #1307
- fix fwd perf calc error by @minmengdie in #1305
- add the asm kernel performance of fwd and bwd by @minmengdie in #1270
- Fused TopK and Sigmoid kernel by @samremes in #1251
- Ar rms by @TennyWang1223 in #1290
- Dsv32 cache by @junhaha666 in #1314
- Fix displaying supported architectures by @HollowMan6 in #1316
- using standalone pybind by @valarLip in #1317
- Enable mha bwd hd192_hd128 by @slippedJim in #1308
- CI: Add pre-check status check by @gyohuangxin in #1252
- [CK_TILE] fmha: Add backward pass support for padded inputs by @Jeff-Huang in #1212
- Mla splitkv enhance split alg inte by @valarLip in #1233
- Fix gemm tuner error mi350 by @yzhou103 in #1313
- CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton by @gyohuangxin in #1325
- Fix global variable torch_fp8 initialization caused issue by @huizhougit in #1322
- [FEAT] [Triton] Add transpose scale to the triton fused_rms_fp8_group_quant by @tjtanaa in #1291
- [Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning by @k50112113 in #1315
- Kernel naming: add reusable constexpr repr helper and testing it with gemm_a16w16 by @Boss2002n in #1260
- Merge tuned file by @yzhou103 in #1327
- fix graph_breaks by return tensor for bool op by @ZhangLirong-amd in #1333
- fix_bf16gemm_asm by @amd-ruitang3 in #1329
- Improve Memory Usage in MLA by @ruanjm in #1338
- fix tune error caused by merge tuned_file by @yzhou103 in #1342
- rm rocblas op in aiter by @yzhou103 in #1337
- [Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant by @k50112113 in #1328
- Add block_m=16 for a8w8_ck_moe_blockscale by @huaiguxu in #1081
- Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel by @farlukas in #1330
- [TRITON] GEMM kernels nomenclature changes by @Boss2002n in #1283
- Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. by @gyohuangxin in #1359
- [Triton] Disable failing lean attention tests by @cagrikymk in #1357
- [Fix] update ck to fix fp4 gemm issue by @gino-lu in #1361
- add config by @valarLip in #1355
- add how_v3_bf16_cvt control to the Python API by @minmengdie in #1351
- [fix]: car 6 rank coredump by @TennyWang1223 in #1335
- Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic by @ZhangLirong-amd in #1348
- [TRITON] GEMM kernels nomenclature changes by @Boss2002n in https://github.com/ROCm/aiter/pull...
v0.1.5 release
What's Changed
- Update gfx942 FA fwd kernel by @slippedJim in #648
- Fix Precision Issue in RoPE Tests by @ruanjm in #627
- [TRITON]: add json config and refactor by @rahulbatra85 in #595
- [TRITON] Refactor Triton RMSNorm and LayerNorm unit tests by @lucas-santos-amd in #598
- [TRITON]: Add Triton PodAttention by @valechen in #651
- Update MI300 FA fwd kernel by @slippedJim in #655
- update moe sorting and CK by @junhaha666 in #660
- refactor by @fsx950223 in #664
- [Triton] DS fused custom ops by @k50112113 in #607
- Fix ck_gemm_a4w4_blockscale tune with splitK by @ukannika-amd in #653
- add fmoe_int8_g1u1_smf_subGU_256 by @valarLip in #667
- Add option to choose between CK RMSNorm pipelines by @ClementLinCF in #647
- Update CK by @poyenc in #669
- edit gemm_a4w8_asm api by @junhaha666 in #672
- Optimize the topK Softmax kernel to reduce one round of topK reduce(idea by cui cu) by @junhaha666 in #673
- Remove dpad==dvpad limit in CK FA bwd codegen by @slippedJim in #677
- [TRITON]: Benchmarking scripts updates by @willzhou-amd in #650
- [TRITON]: Adding Lean + Paged Attention, for decode by @alexdutu in #376
- [TRITON] Tune fp4xfp4 GEMM by @willzhou-amd in #641
- slice acc into two parts to reduce vgpr usage by @xiaohuguo2023 in #659
- fix gemm a4w4 compile issue by @rocking5566 in #681
- FA bwd asm kernel update by @slippedJim in #679
- Gemm a8w8 bpreshuffle api fix by @junhaha666 in #682
- Refine FA impl by @slippedJim in #683
- fix fmoe a8w8 ck stage2 not support inter_dim % 256 = 0 by @junhaha666 in #684
- [TRITON]: add hstu attn op to aiter by @scxiao in #629
- add support for load json.gz by @valarLip in #687
- add blockscale ps asm moe by @junhaha666 in #624
- Pa fp8 mfma by @fsx950223 in #694
- [fea]: new kernel for allreduce optimize by @TennyWang1223 in #699
- fmoe_codegen_asm by @amd-ruitang3 in #690
- add moe_fuse_gate_topK from sglang by @junhaha666 in #700
- fix prebuild file path by @fsx950223 in #692
- [TRITON]: Add benchmark test for leanAttention by @valechen in #688
- [TRITON] Add LayerNorm Backward Triton Kernels by @lucas-santos-amd in #546
- [TRITON] Add Torch unit test reference to PA Prefill Triton Kernels by @lucas-santos-amd in #676
- [TRITON]: Add missing GEMM benchmarks by @willzhou-amd in #680
- A4w4_asm_pro by @zufayu in #649
- fix topk bug by @junhaha666 in #708
- Fix swa condition in FA bwd v3 api by @slippedJim in #707
- use ck_tile::get_warp_size() by @junhaha666 in #710
- fix bug in splitK select by @zufayu in #717
- enable gemm_a4w4 asm kernel to tune splitk by @yzhou103 in #662
- refine moe by @valarLip in #701
- [TRITON]: extend attention bf16 text fix by @Chi-Chu319 in #705
- [Bugfix] Skinny GEMM in tuned gemm.py: add output conversion to tuned_gemm.mm by @vllmellm in #665
- [TRITON]: Add logging to GEMM ops by @rahulbatra85 in #722
- [TRITON] Shaoclee/ds mxfp4 gemm tune by @k50112113 in #693
- [TRITON] shaoclee/triton gemm a8w8 dev by @k50112113 in #709
- [TRITON]: enable buffer ops for lean attention by @xiaohuguo2023 in #725
- update ptpc bpreshuffle gemm tune by @valarLip in #719
- Try to get cu num from env first by @slippedJim in #739
- [fea]: new ar interface by @TennyWang1223 in #750
- A4w4_asm_pro_max_v2 by @zufayu in #741
- asm_fmoe_codegen by @amd-ruitang3 in #702
- Fix fmha codegen when pip install aiter by @slippedJim in #734
- Add sglang ci tests by @gyohuangxin in #735
- [TRITON]: LeanAttention implement loop unrolling to reduce VGPR usage by @valechen in #744
- increase build core num by @valarLip in #730
- [TRITON] mha benchmark fix by @Chi-Chu319 in #748
- fix confilct between AITER_REBUILD and gen_func by @valarLip in #761
- add more bpreshuffle instances by @solinzby1 in #747
- fix random precision issues:192/224x256 tile asm so files by @zufayu in #751
- [TRITON]: MLA and Lean Attention updates by @willzhou-amd in #720
- [TRITON]: Add fused GEMMs to optimize FF block by @willzhou-amd in #736
- [TRITON]: Clear cache allocator in Triton tests by @rahulbatra85 in #743
- mdf_UT_args by @amd-ruitang3 in #752
- Enable custom op and avoid graph breaks by @ZhangLirong-amd in #740
- Create docs folder and the doc 'Build and Run the Aiter Container as a Non-root User' by @gyohuangxin in #760
- fix quant_type=1x128(128x128) can't use tuned_fmoe cfg by @junhaha666 in #758
- add prebuild options in ck_moe by @lalala-sh in #732
- optimize test args by @amd-ruitang3 in #768
- [TRITON]: Add logging info to Triton Kernels by @rahulbatra85 in #729
- fix multiprocess tuning problem by @yzhou103 in #733
- add layout limitation for FA fwd v3 by @slippedJim in #764
- Sampling by @fsx950223 in #727
- Fix issues in sglang ci test when it's from a forked repo. by @gyohuangxin in #769
- Support torch.library.infer_schema for torch < 2.5 by @ZhangLirong-amd in #773
- Fix FA fwd asm limitation by @slippedJim in #782
- LeanAttention code modularization by @valechen in #765
- fix arg parser in pa_v1.py main entry by @842974287 in #772
- fix missing-braces warning during compilation by @842974287 in #770
- Fix MHA build failed by @ZhangLirong-amd in #787
- Wrapper import torch to avoid build issue by @ZhangLirong-amd in #780
- Add assert to prevent user forget to return lse for training by @rocking5566 in #776
- fix test_rmsnorm2dFusedAddQuant.py --mode 3 by @valarLip in #794
- Make Gemm and other ops return Tesnor And graph break fix by @ZhangLirong-amd in #783
- Batch gemm tuning in parallel by @yzhou103 in #711
- fix typehint for rmsnorm2d_fwd_with_add_smoothquant by @valarLip in #796
- Fix issues in sglang test by @gyohuangxin in #800
- Add receipt for pytorch by @alugorey in #791
- [TRITON]: Benchmarking changes for performance CI by @willzhou-amd in #762
- fix ep test by @valarLip in #799
- [TRITON] Add Chunked PA Prefill Triton Kernel by @lucas-santos-amd in #745
- update ck and compiler to c++20 by @rocking5566 in #803
- update aiter paramsupported arguments configuration in readme by @minmengdie in #789
- Enable FA multi target build by @slippedJim in #774
- Optimize topksoftmax: top-K-only softmax + 32B vector loads by @CuiCu-618 in #804
- Fix get_num, gfx, get_padded_m and other breaks in dynamo by @ZhangLirong-amd in #797
- [fix]: fix ar 1stage sync error by @TennyWang1223 in #807
- update CK to fix fa fwd build error by @slippedJim in #810
- Fix issues in Triton Test by @gyohuangxin in #813
- LeanAttention optimization by @valechen in https://github.com/R...
v0.1.4 July release
- mxfp4 enable for gfx950, including GEMM, MoE, and per1x32 Quant
- multi-gpu tuning enable for most kind of GEMMs
- fp8 all reduce
- numbers of triton kernels
What's Changed
- [TRITON] Add Triton Topk Kernel by @hubertlu-tw in #458
- Find executable in rocm home when not found in PATH by @xli in #549
- [TRITON]: Disable int4 moe UT by @rahulbatra85 in #563
- add a4w4 asm_moe by @valarLip in #482
- Improved detection of setup.py install by @ekuznetsov139 in #534
- Disable mha related modules in prebuild by @slippedJim in #567
- Fix format error in .clang-format by @poyenc in #568
- update pa asm by @amd-ruitang3 in #553
- [TRITON]: Reorg mha code and use common fp8 type by @rahulbatra85 in #561
- [TRITON]: Gemm refactor by @rahulbatra85 in #558
- [Triton]: Add has_attr check in get_config by @rahulbatra85 in #572
- [TRITON]: GEMM updates for DS by @rahulbatra85 in #573
- update_codegen by @amd-ruitang3 in #581
- mi350_pa by @amd-ruitang3 in #579
- Change input tensor format to [B,S,H,d] and add batch support for causal by @valechen in #578
- update tune config file by @solinzby1 in #569
- [TRITON] Add RMSNorm bwd Triton Kernels by @lucas-santos-amd in #576
- fix prebuild by @junhaha666 in #592
- [TRITON]: Quantization updates(add int8 and use common fp8 dtypes) by @rahulbatra85 in #588
- Dispatch combine by @junhaha666 in #571
- update args by @amd-ruitang3 in #590
- Pa rocm refresh4 by @fsx950223 in #591
- [update]: update all-reduce by @TennyWang1223 in #552
- Fix compile error in MI350 with ROCm7 by @rocking5566 in #599
- new codegen for elementwise by @TennyWang1223 in #585
- [fix]: elementwise prebuild slow by @TennyWang1223 in #609
- [TRITON]: Fp4gemm m=256 tuning by @Chi-Chu319 in #533
- add MI350 support for skinny_gemm by @yanguahe in #602
- Fix prebuild 350 by @junhaha666 in #608
- [fix]: change ar namespace by @TennyWang1223 in #611
- compile flag clean up by @valarLip in #615
- DIY_args by @amd-ruitang3 in #596
- fix NUM_Q_HEADS - 1 in remap_xcd in _attn_fwd by @juuso-oskari in #612
- add ck gemm a4w4 blockscale with splitK support by @ukannika-amd in #603
- [TRITON]: pid grid fix by @Chi-Chu319 in #618
- Refine ck instance and update a8w8_bpreshuffle_tuned_gemm.csv by @solinzby1 in #621
- merge moe from 350 launch by @lalala-sh in #580
- Remove seqlen limit on FA fwd kernel by @slippedJim in #622
- (Triton] RoPE dev by @k50112113 in #606
- [TRITON]: Fix num_warps typo which was causing performance issues by @valechen in #604
- Topksoftmax_opt by @junhaha666 in #626
- update hip quant for corner case by @valarLip in #633
- [TRITON]: use int64 strides by default for MHA by @rahulbatra85 in #634
- [TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) by @willzhou-amd in #597
- [TRITON] Add Softmax Triton Kernel by @lucas-santos-amd in #605
- Enable gfx942 FA fwd asm kernels by @slippedJim in #619
- Update CK by @poyenc in #635
- Fix error message for rocminfo by @Rohan138 in #636
- [TRITON]: Moe tuning mi350 by @Chi-Chu319 in #610
- Fix test_pa_ragged.py use_alibi=True test cases by @poyenc in #639
- Fix FA fwd nan issue by @slippedJim in #646
- fix for fp8 e4m3fn by @valarLip in #640
- [TRITON]: Kernel benchmarking improvements (for op_benchmarks/triton) by @willzhou-amd in #594
- [Triton]: Disable fused+causal for MHA bkwd by @rahulbatra85 in #642
- enable parallel tuning on CK kernels by @yzhou103 in #625
- Pa fix2 by @fsx950223 in #645
- Update dependencies and add backup for unknown hw by @kunaltyagi in #623
- Optimize topksoftmax WARPS_PER_TB for higher occupancy and remove redundant precision conversion by @CuiCu-618 in #652
New Contributors
- @hubertlu-tw made their first contribution in #458
- @xli made their first contribution in #549
- @ekuznetsov139 made their first contribution in #534
- @valechen made their first contribution in #578
- @willzhou-amd made their first contribution in #597
- @Rohan138 made their first contribution in #636
- @yzhou103 made their first contribution in #625
- @kunaltyagi made their first contribution in #623
- @CuiCu-618 made their first contribution in #652
Full Changelog: v0.1.3...v0.1.4