Releases · ROCm/aiter

11 Nov 02:38

valarLip

v0.1.7

de14bec

v0.1.7 Latest

Latest

What's Changed

add concat_and_cache_mla kernel by @yzhou103 in #1194
Add triton_metadata_redirect module with with_metadata_path decorator by @jwu10003 in #1172
[bug] fix qr when variable input by @lihaoyang-amd in #1191
fix test_concat_cache_mla by @yzhou103 in #1198
Update vllm_benchmark.yaml to use TW cluster to build vllm image by @gyohuangxin in #1200
add hipblaslt swizzle feature by @yixionghuo in #1169
mha fwd v3 gfx950 support dim_q=192 dim_v=128 by @minmengdie in #1188
Fix FA cpp api multi target build by @slippedJim in #1196
CI: parallel build Aiter whl packages for Python 3.10 and 3.12 by @gyohuangxin in #1204
CI: Move some tests back to TW cluster by @gyohuangxin in #1199
CI: Add timeout and retry when insatlling the python deps by @gyohuangxin in #1210
add aiter namespace to rope by @valarLip in #1202
update test_gemm_a16w16 by @amd-ruitang3 in #1205
[fea]: custom all gather by @TennyWang1223 in #1207
[TRITON] Add Positional Encoding (PE) support to Triton MHA kernels by @brunomazzottiamd in #1184
Tune gemm op bf16 by @yzhou103 in #1190
Add mha varlen fake for different from mha by @ZhangLirong-amd in #1214
CI: Use the pre-built sglang image instead of building the sglang image from source. by @gyohuangxin in #1224
[TRITON] Fix fp8 bmm op unit test bug on MI350 by @lucas-santos-amd in #1219
CI: Change the image when building the Aiter release python3.12 package by @gyohuangxin in #1225
CI: Add runner selection to workflow for flexible build host choice in Aiter release CI by @gyohuangxin in #1229
[CK_TILE] FMHA BWD Optimizations for D48 for GFX950 by @DDEle in #1180
Remove FA cpp api dependence of pandas by @slippedJim in #1230
CI: Add unique tag after the names of uploaded packages by @gyohuangxin in #1236
A8w8 asm codegen and tune by @yzhou103 in #1161
Opt concat cache mla by @yzhou103 in #1239
avoid copy ck by @valarLip in #1247
CI: Add GPU_ARCH options in Aiter release CI by @gyohuangxin in #1253
[MI35x] fix gfx950 build error by @yzhou103 in #1250
update bf16 gemm asm by @amd-ruitang3 in #1238
CI: Fix issues in Aiter release CI by @gyohuangxin in #1255
add ninja into requirements by @amd-ruitang3 in #1254
Wrapper gemm to fix get_config lru cache break by @ZhangLirong-amd in #1249
Fix build bdist wheel error by @yzhou103 in #1256
Added in GQA and 64-bit indexing by @kesavanramakrishnan in #1226
Refine ca by @valarLip in #1263
update mi308 fmoe fp16 asm,MI35x by @amd-ruitang3 in #1201
Introducing opus by @carlushuang in #1264
ca_refactor_fix by @valarLip in #1268
fix_fp4_quant_dtype by @lalala-sh in #1271
add sample use outer exponential by @junhaha666 in #1267
Fix rowwise a8w8 gemm in swizzled hipb_mm by @kliuae-amd in #1258
CI: Use TW cluster to run sglang tests by @gyohuangxin in #1273
Catchall PR for all 355_wip related changes by @azaidy in #1148
[MI35X] fix core check by @amd-ruitang3 in #1276
Refactor gemm bf16 tuner by @yzhou103 in #1275
CI: Operators tuning pipelines by @gyohuangxin in #1163
fix the problem that v3's performance is worse than ck's by @minmengdie in #1237
fix the python mha test run_perftest error by @minmengdie in #1277
Reuse custom decorator in core and torch guard by @ZhangLirong-amd in #1278
integrate deep gemm by @lalala-sh in #1265
add a tuned fp4 gemm ds config and insert entries in untuned config by @hongxiayang in #1243
Enable large batch size and optimization of non-Ragged batching by @valechen in #1269
added a few more fw ds f4 untuned and tuned shapes in order to use asm module or kernel by @hongxiayang in #1298
CI: Optimize autotuning pipeline and inital the docs by @gyohuangxin in #1286
topk per row kernel by @ukannika in #1262
fix aot by @fsx950223 in #1279
Fix ATOM fp8 model quant fail issue in torch compile by @ZhangLirong-amd in #1299
feat - pa_fwd support block map with stride in num_kv_heads_dim by @alibaba-miji in #1301
Fix how to update accumulator for dot_scaled by @zhanglx13 in #1297
CI: Optimize autotuning pipeline docs by @gyohuangxin in #1300
Fix the lint issue by @gyohuangxin in #1307
fix fwd perf calc error by @minmengdie in #1305
add the asm kernel performance of fwd and bwd by @minmengdie in #1270
Fused TopK and Sigmoid kernel by @samremes in #1251
Ar rms by @TennyWang1223 in #1290
Dsv32 cache by @junhaha666 in #1314
Fix displaying supported architectures by @HollowMan6 in #1316
using standalone pybind by @valarLip in #1317
Enable mha bwd hd192_hd128 by @slippedJim in #1308
CI: Add pre-check status check by @gyohuangxin in #1252
[CK_TILE] fmha: Add backward pass support for padded inputs by @Jeff-Huang in #1212
Mla splitkv enhance split alg inte by @valarLip in #1233
Fix gemm tuner error mi350 by @yzhou103 in #1313
CI: Skip triton setup in Aiter standard/multigpu tests and add retries when setting up triton by @gyohuangxin in #1325
Fix global variable torch_fp8 initialization caused issue by @huizhougit in #1322
[FEAT] [Triton] Add transpose scale to the triton fused_rms_fp8_group_quant by @tjtanaa in #1291
[Triton] 355 wip Llama FP4 triton fusion + TP8 triton decode shape tunning by @k50112113 in #1315
Kernel naming: add reusable constexpr repr helper and testing it with gemm_a16w16 by @Boss2002n in #1260
Merge tuned file by @yzhou103 in #1327
fix graph_breaks by return tensor for bool op by @ZhangLirong-amd in #1333
fix_bf16gemm_asm by @amd-ruitang3 in #1329
Improve Memory Usage in MLA by @ruanjm in #1338
fix tune error caused by merge tuned_file by @yzhou103 in #1342
rm rocblas op in aiter by @yzhou103 in #1337
[Triton] DS a16w8 GEMM and fused reduce_rms_fp8_group_quant by @k50112113 in #1328
Add block_m=16 for a8w8_ck_moe_blockscale by @huaiguxu in #1081
Add Fused RMSNorm + FP8 Per-tensor Static Quantization Triton Kernel by @farlukas in #1330
[TRITON] GEMM kernels nomenclature changes by @Boss2002n in #1283
Temporarily run aiter standard and multigpu tests on the TW cluster, will switch back once the mirror registry is ready. by @gyohuangxin in #1359
[Triton] Disable failing lean attention tests by @cagrikymk in #1357
[Fix] update ck to fix fp4 gemm issue by @gino-lu in #1361
add config by @valarLip in #1355
add how_v3_bf16_cvt control to the Python API by @minmengdie in #1351
[fix]: car 6 rank coredump by @TennyWang1223 in #1335
Wrapper_flash_attn_backward custom op to avoid functionalize fallback and fix guard logic by @ZhangLirong-amd in #1348
[TRITON] GEMM kernels nomenclature changes by @Boss2002n in https://github.com/ROCm/aiter/pull...

Contributors

Jeff-Huang, ruanjm, and 40 other contributors

Assets 2

28 Aug 16:01

valarLip

v0.1.5

d4ea3cf

v0.1.5 release

What's Changed

Update gfx942 FA fwd kernel by @slippedJim in #648
Fix Precision Issue in RoPE Tests by @ruanjm in #627
[TRITON]: add json config and refactor by @rahulbatra85 in #595
[TRITON] Refactor Triton RMSNorm and LayerNorm unit tests by @lucas-santos-amd in #598
[TRITON]: Add Triton PodAttention by @valechen in #651
Update MI300 FA fwd kernel by @slippedJim in #655
update moe sorting and CK by @junhaha666 in #660
refactor by @fsx950223 in #664
[Triton] DS fused custom ops by @k50112113 in #607
Fix ck_gemm_a4w4_blockscale tune with splitK by @ukannika-amd in #653
add fmoe_int8_g1u1_smf_subGU_256 by @valarLip in #667
Add option to choose between CK RMSNorm pipelines by @ClementLinCF in #647
Update CK by @poyenc in #669
edit gemm_a4w8_asm api by @junhaha666 in #672
Optimize the topK Softmax kernel to reduce one round of topK reduce(idea by cui cu) by @junhaha666 in #673
Remove dpad==dvpad limit in CK FA bwd codegen by @slippedJim in #677
[TRITON]: Benchmarking scripts updates by @willzhou-amd in #650
[TRITON]: Adding Lean + Paged Attention, for decode by @alexdutu in #376
[TRITON] Tune fp4xfp4 GEMM by @willzhou-amd in #641
slice acc into two parts to reduce vgpr usage by @xiaohuguo2023 in #659
fix gemm a4w4 compile issue by @rocking5566 in #681
FA bwd asm kernel update by @slippedJim in #679
Gemm a8w8 bpreshuffle api fix by @junhaha666 in #682
Refine FA impl by @slippedJim in #683
fix fmoe a8w8 ck stage2 not support inter_dim % 256 = 0 by @junhaha666 in #684
[TRITON]: add hstu attn op to aiter by @scxiao in #629
add support for load json.gz by @valarLip in #687
add blockscale ps asm moe by @junhaha666 in #624
Pa fp8 mfma by @fsx950223 in #694
[fea]: new kernel for allreduce optimize by @TennyWang1223 in #699
fmoe_codegen_asm by @amd-ruitang3 in #690
add moe_fuse_gate_topK from sglang by @junhaha666 in #700
fix prebuild file path by @fsx950223 in #692
[TRITON]: Add benchmark test for leanAttention by @valechen in #688
[TRITON] Add LayerNorm Backward Triton Kernels by @lucas-santos-amd in #546
[TRITON] Add Torch unit test reference to PA Prefill Triton Kernels by @lucas-santos-amd in #676
[TRITON]: Add missing GEMM benchmarks by @willzhou-amd in #680
A4w4_asm_pro by @zufayu in #649
fix topk bug by @junhaha666 in #708
Fix swa condition in FA bwd v3 api by @slippedJim in #707
use ck_tile::get_warp_size() by @junhaha666 in #710
fix bug in splitK select by @zufayu in #717
enable gemm_a4w4 asm kernel to tune splitk by @yzhou103 in #662
refine moe by @valarLip in #701
[TRITON]: extend attention bf16 text fix by @Chi-Chu319 in #705
[Bugfix] Skinny GEMM in tuned gemm.py: add output conversion to tuned_gemm.mm by @vllmellm in #665
[TRITON]: Add logging to GEMM ops by @rahulbatra85 in #722
[TRITON] Shaoclee/ds mxfp4 gemm tune by @k50112113 in #693
[TRITON] shaoclee/triton gemm a8w8 dev by @k50112113 in #709
[TRITON]: enable buffer ops for lean attention by @xiaohuguo2023 in #725
update ptpc bpreshuffle gemm tune by @valarLip in #719
Try to get cu num from env first by @slippedJim in #739
[fea]: new ar interface by @TennyWang1223 in #750
A4w4_asm_pro_max_v2 by @zufayu in #741
asm_fmoe_codegen by @amd-ruitang3 in #702
Fix fmha codegen when pip install aiter by @slippedJim in #734
Add sglang ci tests by @gyohuangxin in #735
[TRITON]: LeanAttention implement loop unrolling to reduce VGPR usage by @valechen in #744
increase build core num by @valarLip in #730
[TRITON] mha benchmark fix by @Chi-Chu319 in #748
fix confilct between AITER_REBUILD and gen_func by @valarLip in #761
add more bpreshuffle instances by @solinzby1 in #747
fix random precision issues:192/224x256 tile asm so files by @zufayu in #751
[TRITON]: MLA and Lean Attention updates by @willzhou-amd in #720
[TRITON]: Add fused GEMMs to optimize FF block by @willzhou-amd in #736
[TRITON]: Clear cache allocator in Triton tests by @rahulbatra85 in #743
mdf_UT_args by @amd-ruitang3 in #752
Enable custom op and avoid graph breaks by @ZhangLirong-amd in #740
Create docs folder and the doc 'Build and Run the Aiter Container as a Non-root User' by @gyohuangxin in #760
fix quant_type=1x128(128x128) can't use tuned_fmoe cfg by @junhaha666 in #758
add prebuild options in ck_moe by @lalala-sh in #732
optimize test args by @amd-ruitang3 in #768
[TRITON]: Add logging info to Triton Kernels by @rahulbatra85 in #729
fix multiprocess tuning problem by @yzhou103 in #733
add layout limitation for FA fwd v3 by @slippedJim in #764
Sampling by @fsx950223 in #727
Fix issues in sglang ci test when it's from a forked repo. by @gyohuangxin in #769
Support torch.library.infer_schema for torch < 2.5 by @ZhangLirong-amd in #773
Fix FA fwd asm limitation by @slippedJim in #782
LeanAttention code modularization by @valechen in #765
fix arg parser in pa_v1.py main entry by @842974287 in #772
fix missing-braces warning during compilation by @842974287 in #770
Fix MHA build failed by @ZhangLirong-amd in #787
Wrapper import torch to avoid build issue by @ZhangLirong-amd in #780
Add assert to prevent user forget to return lse for training by @rocking5566 in #776
fix test_rmsnorm2dFusedAddQuant.py --mode 3 by @valarLip in #794
Make Gemm and other ops return Tesnor And graph break fix by @ZhangLirong-amd in #783
Batch gemm tuning in parallel by @yzhou103 in #711
fix typehint for rmsnorm2d_fwd_with_add_smoothquant by @valarLip in #796
Fix issues in sglang test by @gyohuangxin in #800
Add receipt for pytorch by @alugorey in #791
[TRITON]: Benchmarking changes for performance CI by @willzhou-amd in #762
fix ep test by @valarLip in #799
[TRITON] Add Chunked PA Prefill Triton Kernel by @lucas-santos-amd in #745
update ck and compiler to c++20 by @rocking5566 in #803
update aiter paramsupported arguments configuration in readme by @minmengdie in #789
Enable FA multi target build by @slippedJim in #774
Optimize topksoftmax: top-K-only softmax + 32B vector loads by @CuiCu-618 in #804
Fix get_num, gfx, get_padded_m and other breaks in dynamo by @ZhangLirong-amd in #797
[fix]: fix ar 1stage sync error by @TennyWang1223 in #807
update CK to fix fa fwd build error by @slippedJim in #810
Fix issues in Triton Test by @gyohuangxin in #813
LeanAttention optimization by @valechen in https://github.com/R...

Contributors

poyenc, ruanjm, and 35 other contributors

Assets 2

13 Jul 09:00

valarLip

v0.1.4

980c240

v0.1.4 July release

mxfp4 enable for gfx950, including GEMM, MoE, and per1x32 Quant
multi-gpu tuning enable for most kind of GEMMs
fp8 all reduce
numbers of triton kernels

What's Changed

[TRITON] Add Triton Topk Kernel by @hubertlu-tw in #458
Find executable in rocm home when not found in PATH by @xli in #549
[TRITON]: Disable int4 moe UT by @rahulbatra85 in #563
add a4w4 asm_moe by @valarLip in #482
Improved detection of setup.py install by @ekuznetsov139 in #534
Disable mha related modules in prebuild by @slippedJim in #567
Fix format error in .clang-format by @poyenc in #568
update pa asm by @amd-ruitang3 in #553
[TRITON]: Reorg mha code and use common fp8 type by @rahulbatra85 in #561
[TRITON]: Gemm refactor by @rahulbatra85 in #558
[Triton]: Add has_attr check in get_config by @rahulbatra85 in #572
[TRITON]: GEMM updates for DS by @rahulbatra85 in #573
update_codegen by @amd-ruitang3 in #581
mi350_pa by @amd-ruitang3 in #579
Change input tensor format to [B,S,H,d] and add batch support for causal by @valechen in #578
update tune config file by @solinzby1 in #569
[TRITON] Add RMSNorm bwd Triton Kernels by @lucas-santos-amd in #576
fix prebuild by @junhaha666 in #592
[TRITON]: Quantization updates(add int8 and use common fp8 dtypes) by @rahulbatra85 in #588
Dispatch combine by @junhaha666 in #571
update args by @amd-ruitang3 in #590
Pa rocm refresh4 by @fsx950223 in #591
[update]: update all-reduce by @TennyWang1223 in #552
Fix compile error in MI350 with ROCm7 by @rocking5566 in #599
new codegen for elementwise by @TennyWang1223 in #585
[fix]: elementwise prebuild slow by @TennyWang1223 in #609
[TRITON]: Fp4gemm m=256 tuning by @Chi-Chu319 in #533
add MI350 support for skinny_gemm by @yanguahe in #602
Fix prebuild 350 by @junhaha666 in #608
[fix]: change ar namespace by @TennyWang1223 in #611
compile flag clean up by @valarLip in #615
DIY_args by @amd-ruitang3 in #596
fix NUM_Q_HEADS - 1 in remap_xcd in _attn_fwd by @juuso-oskari in #612
add ck gemm a4w4 blockscale with splitK support by @ukannika-amd in #603
[TRITON]: pid grid fix by @Chi-Chu319 in #618
Refine ck instance and update a8w8_bpreshuffle_tuned_gemm.csv by @solinzby1 in #621
merge moe from 350 launch by @lalala-sh in #580
Remove seqlen limit on FA fwd kernel by @slippedJim in #622
(Triton] RoPE dev by @k50112113 in #606
[TRITON]: Fix num_warps typo which was causing performance issues by @valechen in #604
Topksoftmax_opt by @junhaha666 in #626
update hip quant for corner case by @valarLip in #633
[TRITON]: use int64 strides by default for MHA by @rahulbatra85 in #634
[TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) by @willzhou-amd in #597
[TRITON] Add Softmax Triton Kernel by @lucas-santos-amd in #605
Enable gfx942 FA fwd asm kernels by @slippedJim in #619
Update CK by @poyenc in #635
Fix error message for rocminfo by @Rohan138 in #636
[TRITON]: Moe tuning mi350 by @Chi-Chu319 in #610
Fix test_pa_ragged.py use_alibi=True test cases by @poyenc in #639
Fix FA fwd nan issue by @slippedJim in #646
fix for fp8 e4m3fn by @valarLip in #640
[TRITON]: Kernel benchmarking improvements (for op_benchmarks/triton) by @willzhou-amd in #594
[Triton]: Disable fused+causal for MHA bkwd by @rahulbatra85 in #642
enable parallel tuning on CK kernels by @yzhou103 in #625
Pa fix2 by @fsx950223 in #645
Update dependencies and add backup for unknown hw by @kunaltyagi in #623
Optimize topksoftmax WARPS_PER_TB for higher occupancy and remove redundant precision conversion by @CuiCu-618 in #652