-
Notifications
You must be signed in to change notification settings - Fork 129
Expand file tree
/
Copy pathperf-changelog.yaml
More file actions
1017 lines (898 loc) · 41.2 KB
/
perf-changelog.yaml
File metadata and controls
1017 lines (898 loc) · 41.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
- config-keys:
- dsr1-fp8-b200-dynamo-trt
- dsr1-fp8-h200-dynamo-trt
- dsr1-fp4-gb200-dynamo-trt
description:
- "Fix metadata inconsistencies in nvidia-master.yaml - TP/EP/DP-attn values now match actual recipe files"
- "B200 FP8 TRT 8K/1K: prefill_ep 8→1 (15 entries), prefill_dp_attn true→false (1 entry)"
- "H200 FP8 TRT 1K/1K: prefill_dp_attn false→true (9 entries)"
- "H200 FP8 TRT 8K/1K: prefill_dp_attn true→false (8 entries)"
- "GB200 FP4 TRT 8K/1K: decode_dp_attn true→false (2 entries)"
- "All fixes are metadata-only; no recipe files were modified"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/919
- config-keys:
- kimik2.5-int4-mi325x-vllm
description:
- "Add Kimi K2.5 INT4 single-node MI325X vLLM benchmark (TP8)"
- "Uses vLLM ROCm v0.16.0 image following AMD Andy Luo's recipe"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/857
- config-keys:
- 70b-fp8-*-vllm
description:
- 'Add compilation-config ''{"custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}'' as extra config to all benchmarks/70b_fp8_mi*.sh scripts'
- "6-7% uplift for llama for 6/8 configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/95
- config-keys:
- gptoss-fp4-*-trt
description:
- "Upgrade GPT-OSS TRT images from 'release:1.1.0rc2.post2' to '1.2.0rc0.post1'"
- "Add NCCL_GRAPH_REGISTER=0 to benchmarks/gptoss_fp4_b200_trt_slurm.sh"
- "Change kv_cache_config.dtype from 'auto' to 'fp8' in benchmarks/gptoss_fp4_b200_trt_slurm.sh"
- "Remove MOE_BACKEND=CUTLASS, now just defaults to TRTLLM"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/110
- config-keys:
- gptoss*
- dsr1*
description:
- "Remove Llama 70B runs to make room for multi-node disagg prefill+wideEP on h100/h200/b200/mi300/mi325/mi355"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/149
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Upgrade vLLM from 0.10.2 to 0.11.0 for GPT-OSS NVIDIA single-node configs"
- 'Add compilation-config ''{"cudagraph_mode":"PIECEWISE"}'' since vLLM 0.11.0 now defaults to FULL_AND_PIECEWISE'
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/159
- config-keys:
- dsr1*
description:
- "Fix bug where 1k8k and 8k1k full sweeps had incorrect max-model-len for DeepSeek"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/163
- config-keys:
- dsr1-fp4-b200-sglang
- dsr1-fp8-b200-sglang
- dsr1-fp8-h200-sglang
description:
- "Consolidate H200 and B200 SGLang configurations to use unified v0.5.5-cu129-amd64 image tag"
- "Update deprecated SGLang server arguments to current equivalents"
- "Replace --enable-ep-moe with --ep-size $EP_SIZE"
- "Replace --enable-flashinfer-trtllm-moe with --moe-runner-backend flashinfer_trtllm"
- "Add -e EP_SIZE to Docker run commands in launch scripts"
- "Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/204
- config-keys:
- gptoss-fp4-mi355x-vllm
- gptoss-fp4-b200-vllm
description:
- "Extend concurrency to 128 for gptoss mi355x/b200 vllm configurations"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/209
- config-keys:
- gptoss-fp4-b200-trt
description:
- "Extend concurrency to 128 for gptoss b200 TRT configurations"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/233
- config-keys:
- "*gb200-dynamo-sglang"
description:
- "Introduce improvements in GB200 SGLang DSR1 submission"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/257
- config-keys:
- dsr1-fp8-h200-trt
description:
- "Update TRT image from nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc0.post1 to nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2"
- "Increase concurrency for some configurations"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/266
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Update vLLM image for NVIDIA configs from vLLM 0.11.0 to vLLM 0.11.2"
- "Add kv-cache-dtype: fp8 to benchmarks/gptoss_fp4_b200_docker.sh"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/273
- config-keys:
- gptoss-fp4-b200-trt
description:
- "Add benchmark script for GPTOSS FP4 B200 TRT-LLM"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/256
- config-keys:
- dsr1-fp4-gb200-dynamo-trt
- dsr1-fp4-gb200-dynamo-sglang
- dsr1-fp8-gb200-dynamo-sglang
description:
- "Add more configurations for GB200 SGLang DSR1"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/335
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Update MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post1"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/330
- config-keys:
- dsr1-fp4-gb200-dynamo-sglang
- dsr1-fp8-gb200-dynamo-sglang
description:
- "fix: Pruning unnecessary concurrencies "
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/358
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Updating MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post2"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/369
- config-keys:
- dsr1-fp4-b200-sglang
- dsr1-fp8-b200-sglang
- dsr1-fp8-h200-sglang
description:
- "Update NVIDIA DeepSeek sglang Docker image from v0.5.5 to v0.5.6"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/276
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Update vLLM image from v0.11.2 to v0.13.0"
- "Add VLLM_MXFP4_USE_MARLIN=1 to H100 and H200 benchmark scripts"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/327
- config-keys:
- dsr1-fp8-mi300x-sglang
- dsr1-fp8-mi325x-sglang
- dsr1-fp8-mi355x-sglang
description:
- Use upstream SGLang images on mi300, mi325 and mi355 for dsr1fp8
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/332
- config-keys:
- gptoss-fp4-gb200-dynamo-trt
- gptoss-fp4-b200-trt
description:
- Explicitly add EP=TP for DP attention configs for B200 AGG nvidia-master file. Multinode Refactor inadvertently changed default EP=1
- Add GPTOSS DISAGG configurations for GB200 1k1k and 8k1k.
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/387
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add PD disaggregation (1P2D) for Mi355X"
- "Includes with and without speculative decoding"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/348
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Updating MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.7"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/395
- config-keys:
- dsr1-fp8-b200-sglang
description:
- "Adds TP4 configurations to DSR1-FP8 B200 SGLang deployment experiments"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/411
- config-keys:
- dsr1-fp4-b200-trt-mtp
- dsr1-fp8-b200-trt-mtp
- dsr1-fp8-h200-trt-mtp
description:
- Add MTP (Multi-Token Prediction) support for single-node TRT configs
- Add spec-decoding field to config entries and update launch scripts to select MTP benchmark scripts
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/392
- config-keys:
- dsr1-fp8-mi355x-atom
- dsr1-fp4-mi355x-atom
- gptoss-fp4-mi355x-atom
description:
- Add internal AMD ATOM inference engine for DeepSeek R1 FP8, FP4 and GPTOSS FP4 Mi355X
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/419
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
description:
- "Update AMD MI300X and MI325X GPT-OSS 120B vLLM to use upstream ROCm image vllm/vllm-openai-rocm:v0.14.0"
- "Remove deprecated --async-scheduling flag (now enabled by default in vLLM v0.14.0)"
- "Remove deprecated --max-seq-len-to-capture flag"
- "Add HIP_VISIBLE_DEVICES env var for Ray compatibility in vLLM 0.14+"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/496
- config-keys:
- dsr1-fp8-h200-sglang
description:
- "Update H200 DeepSeek R1 FP8 SGLang image from v0.5.6 to v0.5.7"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/538
- config-keys:
- dsr1-fp8-mi300x-sglang
description:
- "Update MI300X DeepSeek R1 FP8 SGLang image from v0.5.5.post3 to v0.5.7"
- "Add SGLANG_AITER_MLA_PERSIST=1 for persistent MLA kernel optimization"
- "Set --kv-cache-dtype fp8_e4m3 for fp8 KV cache"
- "Set --attention-backend aiter for AMD aiter attention backend"
- "Update chunked-prefill-size and max-prefill-tokens from 196608 to 131072"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/544
- config-keys:
- dsr1-fp8-mi325x-sglang
description:
- "Update MI325X DeepSeek R1 FP8 SGLang image from v0.5.5.post3 to v0.5.7"
- "Add SGLANG_AITER_MLA_PERSIST=1 for persistent MLA kernel with fp8 KV cache"
- "Add --kv-cache-dtype fp8_e4m3 for explicit FP8 KV cache"
- "Add --attention-backend aiter for AMD aiter attention backend"
- "Reduce chunked-prefill-size from 196608 to 131072"
- "Reduce max-prefill-tokens from 196608 to 131072"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/545
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
description:
- "Fix AITER env vars for vLLM v0.14.0 on AMD MI300X and MI325X"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/535
- config-keys:
- dsr1-fp8-mi355x-sglang
description:
- "Update MI355X DeepSeek R1 FP8 SGLang image from v0.5.5.post3 to v0.5.8"
- "Key fix: Disables mla persistent kernel when not using fp8 kv_cache (https://github.com/sgl-project/sglang/pull/17327)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/572
- config-keys:
# NVIDIA single-node
- dsr1-fp4-b200-sglang
- dsr1-fp4-b200-trt
- dsr1-fp4-b200-trt-mtp
- dsr1-fp8-b200-sglang
- dsr1-fp8-b200-trt
- dsr1-fp8-b200-trt-mtp
- dsr1-fp8-h200-sglang
- dsr1-fp8-h200-trt
- dsr1-fp8-h200-trt-mtp
- gptoss-fp4-b200-trt
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-trt
- gptoss-fp4-h200-vllm
# AMD single-node
- dsr1-fp4-mi355x-sglang
- dsr1-fp4-mi355x-atom
- dsr1-fp8-mi300x-sglang
- dsr1-fp8-mi325x-sglang
- dsr1-fp8-mi355x-sglang
- dsr1-fp8-mi355x-atom
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
- gptoss-fp4-mi355x-vllm
- gptoss-fp4-mi355x-atom
description:
- Add official GSM8k eval results to GPT-OSS and DeepSeek R1 scenarios
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/558
evals-only: true
- config-keys:
- dsr1-fp8-h200-sglang
description:
- "Update H200 DeepSeek R1 FP8 SGLang image from v0.5.7 to v0.5.9"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- config-keys:
- dsr1-fp4-b300-dynamo-trt
description:
- "Add DSR1 FP4 B300 Dynamo TRT configurations"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/585
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Update SGLang image from v0.5.7 to v0.5.8 for DeepSeek-R1 FP4 on MI355x"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/595
- config-keys:
- dsr1-fp8-b200-trt
description:
- "Update TensorRT-LLM container from release:1.1.0rc2.post2 to release:1.2.0rc6.post2"
- "Change default MOE backend from DEEPGEMM to TRTLLM"
- "Add dynamic piecewise CUDA graphs for 1k1k (CONC≥64) and 8k1k (CONC≥64) workloads"
- "Add delay batching (batch_wait_timeout_iters/batch_wait_max_tokens_ratio) for 1k1k high-concurrency"
- "Add dynamic KV cache memory fraction tuning (0.7-0.8) based on ISL/OSL/TP configuration"
- "Update search space: remove EP=TP constraint, add TP=4 configurations, extend concurrency ranges"
- "Add TLLM_OVERRIDE_LAYER_NUM=61 to avoid OOM errors"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/594
- config-keys:
- dsr1-fp4-b200-dynamo-trt
description:
- "Update DSR1 FP4 B200 Dynamo TRT configurations"
- "Update TRTLLM version to 1.2.0rc6.post2"
- "Transform to use srt-slurm recipes"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/588
- config-keys:
- dsr1-fp8-h200-dynamo-trt
description:
- "Add DSR1 FP8 H200 Dynamo TRT-LLM disaggregated multinode configuration"
- "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
- "Runner: h200-dgxc with multinode and disagg enabled"
- "Includes MTP and STP configurations for 1k1k and 8k1k sequence lengths"
- "Concurrency levels: 4, 8, 16, 32, 64, 128, 256, 512"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/570
- config-keys:
- dsr1-fp4-gb200-dynamo-trt
description:
- "Update Dynamo TRT image from 0.5.1-rc0.pre3 to 0.8.1.post2"
- "Update TRT configurations"
- "Refactor configurations to use CONFIG_FILE-based recipes instead of inline parameter settings"
- "Introduce srt-slurm workflow for launching Dynamo jobs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/510
- config-keys:
- dsr1-fp8-mi355x-sglang
description:
- "Disable torch.compile for MI355X DeepSeek-R1 FP8 SGLang"
- "set cuda-graph-max-bs to CONC"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/613
- config-keys:
- dsr1-fp8-h200-dynamo-sglang
description:
- "Add DSR1 FP8 H200 Dynamo SGLang disaggregated multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8-cu130-runtime"
- "Runner: h200-multinode-slurm with multinode and disagg enabled"
- "Recipes sourced from srtslurm repo (recipes/h200/)"
- "1k1k configs: aggregated, low-latency (1P9D), high-throughput TEP (1P6D), DEP (1P6D)"
- "8k1k configs: aggregated, TEP configs (1P7D, 1P6D, 1P3D, 2P3D), DEP (1P1D)"
- "Concurrency levels range from 1 to 2048 depending on configuration"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/582
- config-keys:
- dsr1-fp4-b200-trt
description:
- "Update TensorRT-LLM container from release:1.1.0rc2.post2 to release:1.2.0rc6.post2"
- "Change default MOE backend from DEEPGEMM to TRTLLM"
- "Add dynamic piecewise CUDA graphs for 1k1k (TEP8 and CONC64)"
- "Update search space: remove EP=TP constraint, add TP=4 configurations, extend concurrency ranges"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/620
- config-keys:
- dsr1-fp4-gb300-dynamo-trt
description:
- "Add DeepSeek-R1 FP4 GB300 Dynamo TRT disaggregated multinode configurations"
- "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post2"
- "Includes MTP and STP configs for 1k1k, 1k8k, and 8k1k sequence lengths"
- "Add gb300-nv runner and launch script for srt-slurm integration"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/618
- config-keys:
- dsr1-fp4-mi355x-sglang-disagg
description:
- "enable PD/D for both MTP and non-MTP MI355X DeepSeek-R1 FP4 SGLang"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/622
- config-keys:
- dsr1-fp8-gb200-dynamo-trt
description:
- "Add DeepSeek R1 FP8 GB200 Dynamo TRT-LLM disaggregated multinode configurations"
- "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post2"
- "1k1k: 14 scenarios (7 MTP, 7 STP) with varying DP attention/TEP modes"
- "1k8k: 10 scenarios (5 MTP, 5 STP) for long output generation"
- "8k1k: 14 scenarios (7 MTP, 7 STP) for long context workloads"
- "Prefill workers: 1-5P, Decode workers: 1-4D, TP/EP: 8/16/32"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/617
- config-keys:
- dsr1-fp8-gb200-dynamo-trt
description:
- "Fix model_prefix argument in yaml configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/646
- config-keys:
- dsr1-fp8-b200-trt-mtp
description:
- Update to the latest TRTLLM 1.2 release container version
- Fine-tune the choice of parallelism in nvidia-master file, mainly going to TP for most points
- Enable piecewise CUDA graphs under most conditions
- fine-tune max batch sizes and other optimizations
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/632
- config-keys:
- dsr1-fp8-gb300-dynamo-trt
description:
- "Add DeepSeek R1 FP8 GB300 Dynamo TRT-LLM disaggregated multinode configurations for 8k1k and 1k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/627
- config-keys:
- gptoss-fp4-b200-trt
description:
- "Update GPT-OSS FP4 B200 TRT pareto configurations and new container image"
- "Extend maximum concurrency to 256 across all sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/639
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add --use-chat-template argument to benchmark_serving script"
- "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/647
- config-keys:
- dsr1-fp8-b200-sglang-mtp
description:
- "Add MTP (Multi-Token Prediction) support for DeepSeek R1 FP8 B200 SGLang using EAGLE speculative decoding"
- "Image: lmsysorg/sglang:v0.5.8-cu130-amd64"
- "Add benchmark script dsr1_fp8_b200_mtp.sh with EAGLE speculative decoding (num-steps=2, draft-tokens=3, topk=1)"
- "Update launch_b200-dgxc.sh to support SPEC_SUFFIX for MTP script selection"
- "Configurations: TP=8, EP=1, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/626
- config-keys:
- dsr1-fp4-b200-trt-mtp
description:
- "Upgrade TensorRT-LLM container from release:1.1.0rc2.post2 to release:1.2.0rc6.post3"
- "Enable dynamic piecewise CUDA graphs for several conditions"
- "Adjust TP8/TP4 search space to reduce overlapping points"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/642
- config-keys:
- dsr1-fp8-b200-dynamo-sglang
description:
- "Add DSR1 FP8 B200 disaggregated SGLang multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
- "9 recipes: 4x 1k1k + 5x 8k1k, low-latency and max-throughput profiles"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/658
- config-keys:
- dsr1-fp8-gb300-dynamo-trt
description:
- "Add DeepSeek R1 FP8 GB300 Dynamo TRT-LLM disaggregated multinode configurations for 1k8k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/654
- config-keys:
- dsr1-fp8-h100-dynamo-trt
description:
- "Add DeepSeek R1 FP8 H100 Dynamo TRT-LLM disaggregated multinode configurations"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/651
- config-keys:
- dsr1-fp8-h200-dynamo-sglang
description:
- "Add MTP (EAGLE speculative decoding) configs alongside STP"
- "Update container to lmsysorg/sglang:v0.5.8.post1-cu130"
- "Remove aggregated configs, keep only disaggregated"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/640
- config-keys:
- dsr1-fp8-b300-dynamo-trt
description:
- "New B300 FP8 Dynamo TRT configurations"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/638
- config-keys:
- dsr1-fp8-h100-dynamo-trt
description:
- "Add DeepSeek R1 FP8 H100 Dynamo TRT-LLM disaggregated multinode configurations"
- "fix model_prefix bug from https://github.com/SemiAnalysisAI/InferenceX/pull/651"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/663
- config-keys:
- dsr1-fp4-gb200-dynamo-sglang
description:
- "Update SGLang image from v0.5.5.post2 to v0.5.8-cu130"
- "Add FP4 model path separation via SRT_SLURM_MODEL_PREFIX in launch script"
- "Refactor to use CONFIG_FILE-based srt-slurm recipes instead of inline parameters"
- "Add 1k1k configurations: low-latency (1P2D), mid-curve (4P8D), max-tpt (4P12D)"
- "Add 8k1k configurations: low-latency (1P4D), mid-curve (6P12D), max-tpt (10P8D)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/633
- config-keys:
- dsr1-fp8-gb200-dynamo-sglang
- dsr1-fp8-gb300-dynamo-sglang
description:
- "Update GB200 and GB300 configs for DSR1 FP8 SGLANG STP mode"
- "Image: lmsysorg/sglang:v0.5.8-cu130"
- "Update prefill/decode worker counts, TP/EP parallelism, and dp-attn settings for 1k1k and 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/635
- config-keys:
- dsr1-fp8-b200-dynamo-sglang-mtp
description:
- "Add DSR1 FP8 B200 disaggregated SGLang MTP multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
- "9 recipes: 4x 1k1k + 5x 8k1k, low-latency and max-throughput with EAGLE speculative decoding"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/667
- config-keys:
- dsr1-fp8-h100-dynamo-sglang
description:
- "Add DeepSeek-R1 FP8 H100 Dynamo SGLang STP disaggregated multinode configurations"
- "Image: lmsysorg/sglang:v0.5.8-cu130"
- "1k1k, 1k8k, 8k1k sequence lengths"
- "Two modes per seq-len: Max throughput TEP (1P2D) and Max throughput DEP (1P1D with dp-attention)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/643
- config-keys:
- dsr1-fp8-h100-dynamo-sglang
description:
- "Add DeepSeek-R1 FP8 H100 Dynamo SGLang MTP disaggregated multinode configurations"
- "Image: lmsysorg/sglang:v0.5.8-cu130"
- "1k1k, 1k8k, 8k1k sequence lengths with MTP speculative decoding"
- "Two modes per seq-len: Max throughput TEP (1P2D) and Max throughput DEP (1P1D with dp-attention)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/644
- config-keys:
- dsr1-fp8-mi355x-atom-mtp
- dsr1-fp4-mi355x-atom-mtp
description:
- "Add DSR1 FP8/FP4 MI355X ATOM with MTP configuration"
- "Image: rocm/atom:rocm7.2.0-ubuntu24.04-pytorch2.9-atom0.1.1"
- "Deepseek R1 with speculative decoding: 1k1k, 1k8k, 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/673
- config-keys:
- dsr1-fp4-b200-dynamo-sglang
description:
- "Add DSR1 FP4 B200 Dynamo SGLang STP mode"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-runtime"
- "1k1k configs: low-latency DEP (1P5D, 1P6D), max-throughput DEP (1P1D, 1P2D)"
- "8k1k configs: low-latency DEP/TEP (1P1D, 1P5D, 2P5D), TEP (1P1D), max-throughput DEP (7P2D)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/672
- config-keys:
- dsr1-fp8-b200-dynamo-trt
description:
- "Introduce new DSR1 FP8 B200 Dynamo TRT configurations for 8k1k and 1k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/616
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
- dsr1-fp4-mi355x-sglang-disagg
description:
- "Bump MI355X MORI FP8/FP4 image to latest rocm/sgl-dev:sglang-0.5.8-rocm700-mi35x-mori-0210"
- "Bump mi355x sglang disagg recipe to sa-260211"
- "Add conc 4/8/16"
- "Use Pure TP with MTP=2 for 1k1k conc smaller than 128 and reduce MTP to 1 for DEP configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/674
- config-keys:
- dsr1-fp4-b200-dynamo-sglang-mtp
description:
- "Add B200 configs for DSR1 FP4 SGLANG MTP mode for 1k1k and 8k1k"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/683
- config-keys:
- dsr1-fp4-gb300-dynamo-sglang
description:
- "Add GB300 FP4 Dynamo SGLang disaggregated multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-runtime"
- "Recipes sourced from srt-slurm repo (recipes/gb300-fp4/ folder)"
- "Add 1k1k configurations: low-latency (1P2D), mid-curve (4P8D), max-tpt (4P12D)"
- "Add 8k1k configurations: low-latency (1P4D), mid-curve (6P12D), max-tpt (10P8D)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/636
- config-keys:
- dsr1-fp8-b200-dynamo-sglang-mtp
description:
- "Patches one missing concurrency point for "
- "DSR1 FP8 B200 disaggregated SGLang MTP multinode configuration. "
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/691
- config-keys:
- dsr1-fp8-b300-dynamo-trt
description:
- "Update max_num_tokens and max_batch_size for min-latency decode workers"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/690
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add more sweep points for DSR1 FP8 both MTP and non-MTP 1k1k, 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/689
- config-keys:
- dsr1-fp8-b200-dynamo-trt
description:
- "Update max_num_tokens and max_batch_size for min-latency decode workers"
- "See srt-slurm recipe changes: https://github.com/ishandhanani/srt-slurm/pull/173"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/686
- config-keys:
- dsr1-fp8-mi325x-sglang
description:
- "Update MI325X DeepSeek R1 FP8 SGLang image from v0.5.7 to v0.5.8"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/692
- config-keys:
- dsr1-fp8-mi300x-sglang
description:
- "Update MI300X DeepSeek R1 FP8 SGLang image from v0.5.7 to v0.5.8"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/696
- config-keys:
- dsr1-fp4-mi355x-sglang-disagg
description:
- "Add more sweep points for DSR1 FP4 both MTP and non-MTP 1k1k, 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/692
- config-keys:
- dsr1-fp8-b200-dynamo-sglang-mtp
description:
- "Add new 1P2D max-throughput MTP config for 1k1k"
- "MTP settings: speculative-num-steps=1, speculative-num-draft-tokens=2"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/697
- config-keys:
- dsr1-fp8-h200-dynamo-trt
description:
- "Add min-latency configurations for H200 Dynamo TRT"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/698
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Bump MI355X disagg FP8 recipe commit to 953f7c5 (bug fixes in sglang_disagg fork)"
- "1k1k: Switch prefill workers from DEP (tp:1, ep:8, dp-attn) to Pure TP (tp:8, ep:1) for MTP and non-MTP middle-of-curve 1P2D configs"
- "1k1k: Extend middle-of-curve concurrency range by adding conc=128 for both MTP and non-MTP 1P2D configs"
- "8k1k: Add new 2P1D (2-prefill, 1-decode) configs at conc [512, 1024] for both MTP (DECODE_MTP_SIZE=0) and non-MTP, with Pure TP prefill and DEP decode"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/700
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Bump MI355X disagg FP8 recipe commit to fix perf regression on 8k1k DEP8"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/701
- config-keys:
- qwen3.5-bf16-b200-sglang
description:
- "Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark"
- "Image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/704
- config-keys:
- qwen3.5-bf16-mi355x-sglang
description:
- "Add Qwen3.5-397B-A17B BF16 SGLang benchmark for MI355X"
- "Image: rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260215"
- "Uses triton attention backend, TP=8, concurrency 4-64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/705
- config-keys:
- kimik2.5-int4-mi355x-vllm
description:
- "Add Kimi-K2.5 INT4 vLLM benchmark for MI355X"
- "Model: moonshotai/Kimi-K2.5 with --mm-encoder-tp-mode data"
- "Image: vllm/vllm-openai-rocm:v0.15.1"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/734
- config-keys:
- minimaxm2.5-fp8-mi355x-vllm
description:
- "Add MiniMax-M2.5 FP8 vLLM benchmark for MI355X"
- "Model: MiniMaxAI/MiniMax-M2.5 with --trust-remote-code"
- "Image: vllm/vllm-openai-rocm:v0.15.1"
- "Environment: VLLM_ROCM_USE_AITER=1"
- "TP=2 and TP=4, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/755
- config-keys:
- qwen3.5-fp8-mi355x-sglang
description:
- "Add Qwen3.5-397B-A17B-FP8 SGLang benchmark configuration for MI355X"
- "Image: rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260218"
- "Uses triton attention backend, TP=8, concurrency 4-64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/768
- config-keys:
- qwen3.5-bf16-b200-sglang
description:
- "Update Qwen3.5-397B-A17B BF16 SGLang B200 benchmark launch config"
- "Image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e"
- "Add trtllm_mha attention backend, flashinfer_trtllm MOE runner"
- "Add context-length, tokenizer-worker-num, env tuning (NCCL_NVLS_ENABLE, SGLANG_ENABLE_FLASHINFER_GEMM)"
- "Set cuda-graph-max-bs to match concurrency, scheduler-recv-interval based on concurrency"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/758
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add more configs for MI355X FP8 Disagg"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/770
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
description:
- "Update vLLM ROCm image from v0.14.0 to v0.15.1 for MI300X and MI325X GPT-OSS"
- "Gains: ROCm skinny GEMM dispatch fix, MoRI EP all2all backend, KV cache shuffle + paged attention for AITER"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/781
- config-keys:
- kimik2.5-int4-b200-vllm
description:
- "Add Kimi-K2.5 INT4 vLLM benchmark for B200"
- "Model: moonshotai/Kimi-K2.5 with --mm-encoder-tp-mode data and --trust-remote-code"
- "Image: vllm/vllm-openai:v0.15.1"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/735
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Update vLLM image from v0.13.0 to v0.15.1 for NVIDIA GPT-OSS configs"
- "Remove deprecated async-scheduling flag (now enabled by default since v0.14.0)"
- "Gains: CUTLASS MoE optimizations (~8% throughput), FP4 kernel improvements (~4% E2E on B200), torch.compile cold-start fix"
- "v0.15.1 includes fix for prefix cache hit rate of 0% on GPT-OSS hybrid attention models"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/789
- config-keys:
- dsr1-fp4-mi355x-atom
- dsr1-fp4-mi355x-atom-mtp
description:
- "Update search-space configurations for DSR1 FP4 MI355X ATOM and ATOM-MTP"
- "Comment out TP=4 configs, consolidate to TP=8 only"
- "Extend concurrency range to conc-end: 256 across all sequence lengths (1k1k, 1k8k, 8k1k)"
- "Fix MTP 1k8k conc-start from 256 to 4 to enable full concurrency sweep"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/699
- config-keys:
- glm5-fp8-mi355x-sglang
description:
- "Add GLM-5 FP8 SGLang benchmark for MI355X"
- "Model: zai-org/GLM-5-FP8 with NSA tilelang backends"
- "Image: rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
- gptoss-fp4-mi355x-vllm
description:
- "Update AMD GPT-OSS vLLM images to v0.16.0 (MI300X/MI325X from v0.15.1, MI355X from custom v0.10.1)"
- "MI355X: Fix env vars (VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION), add VLLM_ROCM_USE_AITER=1, remove deprecated flags"
- "MI355X: Simplify compilation config to cudagraph_mode FULL_AND_PIECEWISE, add HIP_VISIBLE_DEVICES Ray fix"
- "Gains: fused add+rmsnorm+pad for GPT-OSS (automatic via PassManager), AITER attention block size fix"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/806
- config-keys:
- dsr1-fp4-mi355x-sglang
- dsr1-fp8-mi300x-sglang
- dsr1-fp8-mi325x-sglang
- dsr1-fp8-mi355x-sglang
description:
- "Update SGLang image from v0.5.8 to v0.5.9 for AMD single-node DeepSeek R1 configs"
- "Key changes: AITER v0.1.10.post3 with FP8 Prefill/Decode/KV Cache, FP8 prefill attention kernel, MORI EP two-batch overlapping, OOM fix for DeepSeek weight loading"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/816
- config-keys:
- minimaxm2.5-fp8-h200-vllm
description:
- "Add MiniMax M2.5 FP8 single-node config for H200 with vLLM v0.16.0 (TP4)"
- "New benchmark script with --trust-remote-code for MiniMaxAI/MiniMax-M2.5"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- config-keys:
- minimaxm2.5-fp8-mi325x-vllm
description:
- "Add MiniMax-M2.5 FP8 vLLM benchmark for MI325X"
- "Model: MiniMaxAI/MiniMax-M2.5 with --trust-remote-code"
- "Image: vllm/vllm-openai-rocm:v0.16.0"
- "Environment: VLLM_ROCM_USE_AITER=1"
- "TP=2 and TP=4, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/836
- config-keys:
- minimaxm2.5-fp8-mi300x-vllm
description:
- "Add MiniMax-M2.5 FP8 vLLM benchmark for MI300X"
- "Model: MiniMaxAI/MiniMax-M2.5 with --trust-remote-code"
- "Image: vllm/vllm-openai-rocm:v0.16.0"
- "Environment: VLLM_ROCM_USE_AITER=1"
- "TP=2 and TP=4, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/837
- config-keys:
- kimik2.5-fp4-mi355x-vllm
description:
- "Add Kimi-K2.5 MXFP4 vLLM benchmark for MI355X"
- "Model: amd/Kimi-K2.5-MXFP4 with --mm-encoder-tp-mode data and --trust-remote-code"
- "Image: vllm/vllm-openai-rocm:v0.15.1"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/825
- config-keys:
- qwen3.5-bf16-mi325x-sglang
description:
- "Add Qwen3.5-397B-A17B BF16 SGLang benchmark for MI325X"
- "Image: lmsysorg/sglang:v0.5.9-rocm720-mi30x"
- "Uses triton attention backend, TP=8, concurrency 4-64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- config-keys:
- qwen3.5-bf16-mi300x-sglang
description:
- "Add Qwen3.5-397B-A17B BF16 SGLang benchmark for MI300X"
- "Image: lmsysorg/sglang:v0.5.9-rocm720-mi30x"
- "Uses triton attention backend, TP=8, concurrency 4-64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/843
- config-keys:
- qwen3.5-fp8-mi325x-sglang
description:
- "Add Qwen3.5-397B-A17B-FP8 SGLang benchmark for MI325X"
- "Image: lmsysorg/sglang:v0.5.9-rocm720-mi30x"
- "Following AMD Andy Luo's recipe with triton attention backend"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
- config-keys:
- qwen3.5-fp8-mi300x-sglang
description:
- "Add Qwen3.5-397B-A17B-FP8 SGLang benchmark for MI300X"
- "Image: lmsysorg/sglang:v0.5.9-rocm720-mi30x"
- "Uses triton attention backend, TP=8, concurrency 4-64"
- "Following AMD Andy Luo's recipe"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/850
- config-keys:
- kimik2.5-int4-h200-vllm
description:
- "Add Kimi-K2.5 INT4 vLLM benchmark for H200"
- "Model: moonshotai/Kimi-K2.5 with --reasoning-parser kimi_k2 and --trust-remote-code"
- "Image: vllm/vllm-openai:v0.16.0"
- "TP=8, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
- "following https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/839
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
- dsr1-fp8-mi355x-sglang-disagg-mtp
- dsr1-fp4-mi355x-sglang-disagg
- dsr1-fp4-mi355x-sglang-disagg-mtp
description:
- "Add more sweep configs for MI355X FP8/FP4 Disagg"
- "Add TP/DP/EP size < 8 support "
- "Support DSR1-0528 MTP Disagg"
- "Bump SGL mori image to Feb 27"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/823
- config-keys:
- minimaxm2.5-fp8-h100-vllm
description:
- "Add MiniMax-M2.5 FP8 vLLM benchmark for H100"
- "Model: MiniMaxAI/MiniMax-M2.5 with --trust-remote-code"
- "Image: vllm/vllm-openai:v0.16.0"
- "Switch from TP=8/EP=8 to TP=4/EP=4, concurrency 4-64 for 1k1k, 1k8k, and 8k1k"
- "Script uses conditional --enable-expert-parallel based on EP_SIZE env var"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/832
- config-keys:
- qwen3.5-fp8-b200-sglang
description:
- "Add Qwen3.5-397B-A17B-FP8 SGLang benchmark configuration for B200"
- "Image: lmsysorg/sglang:v0.5.9-cu129-amd64"
- "Uses trtllm_mha attention backend and flashinfer_trtllm MOE runner"
- "Enable SGLANG_ENABLE_FLASHINFER_GEMM=true, NCCL_NVLS_ENABLE=1"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/804
- config-keys:
- qwen3.5-fp8-h200-sglang
description:
- "Add Qwen 3.5 FP8 H200 SGLang configuration"
- "Model: Qwen/Qwen3.5-397B-A17B-FP8, runner: h200, image: lmsysorg/sglang:v0.5.8-cu130-amd64"
- "Benchmark script: benchmarks/single_node/qwen3.5_fp8_h200.sh"
- "Server: reasoning-parser qwen3, tool-call-parser qwen3_coder, enable-flashinfer-allreduce-fusion, mem-fraction-static 0.8"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/855
- config-keys:
- dsr1-fp8-mi355x-sglang
description:
- "Expanding TP search space"
- "Adding kv-cache-fp8"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/865
- config-keys:
- qwen3.5-bf16-b200-sglang
- qwen3.5-bf16-mi300x-sglang
- qwen3.5-bf16-mi325x-sglang
- qwen3.5-bf16-mi355x-sglang
- qwen3.5-fp8-b200-sglang
- qwen3.5-fp8-h200-sglang
- qwen3.5-fp8-mi300x-sglang
- qwen3.5-fp8-mi325x-sglang
- qwen3.5-fp8-mi355x-sglang
description:
- "Redo qwen eval"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/892
evals-only: true
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
- gptoss-fp4-mi355x-vllm
description:
- "Update AMD GPT-OSS vLLM image from v0.16.0 to v0.17.0 for MI300X, MI325X, and MI355X"
- "MI355X: Switch model to amd/gpt-oss-120b-w-mxfp4-a-fp8 (MXFP4 weights + FP8 activations)"
- "MI355X: Add VLLM_ROCM_USE_AITER_TRITON_ROPE=1 for AITER triton RoPE kernel"
- "Add AMDGCN_USE_BUFFER_OPS=0 and VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 env vars"
- "Switch to --attention-backend ROCM_AITER_UNIFIED_ATTN and add fuse_rope_kvcache compilation pass"
- "Remove deprecated VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION/VLLM_ROCM_USE_AITER_MHA env vars and compilation-config cudagraph_mode"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/867
- config-keys:
- kimik2.5-fp4-b200-vllm
description:
- "Add Kimi K2.5 FP4 B200 vLLM benchmark configuration"
- "Image: vllm/vllm-openai:v0.17.0"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/862
- config-keys:
- minimaxm2.5-fp8-b200-vllm
description:
- "Add MiniMax-M2.5 FP8 vLLM benchmark for B200"
- "Model: MiniMaxAI/MiniMax-M2.5 with --trust-remote-code"
- "Image: vllm/vllm-openai:v0.17.0"
- "TP=2 and TP=4, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/757
- config-keys:
- dsr1-fp4-mi355x-sglang-disagg
- dsr1-fp4-mi355x-sglang-disagg-mtp
description:
- "Add more sweep configs for MI355X FP4 Disagg"
- "Bump FP4 image to Feb 27 latest"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/899
- config-keys:
- minimaxm2.5-fp8-h200-vllm
description:
- "Extend MiniMax M2.5 FP8 single-node config for H200 with vLLM v0.16.0 (TP8)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/869
- config-keys:
- dsr1-fp8-b200-dynamo-sglang
- dsr1-fp8-b200-dynamo-sglang-mtp
description:
- "Update B200 FP8 DSR1 8k1k dynamo-sglang recipes to new pareto configs"
- "Replace old per-file recipes with resolved variants from consolidated 8k1k.yaml"
- "14 variants: STP/MTP x low-latency/max-throughput with updated concurrencies and scale points"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/907
- config-keys:
- glm5-fp8-h200-sglang
description:
- "Add GLM-5 FP8 SGLang H200 single-node benchmark"
- "Model: zai-org/GLM-5-FP8, image: lmsysorg/sglang:glm5-hopper"
- "Benchmark script: benchmarks/single_node/glm5_fp8_h200.sh"
- "Tool-call-parser glm47, reasoning-parser glm45, mem-fraction-static 0.85"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/914
- config-keys:
- glm5-fp8-b200-sglang
description:
- "Add GLM-5 FP8 SGLang benchmark for B200"
- "Supports TP8 (low latency) and DEP8 (high throughput) modes with NSA attention backend"