forked from SemiAnalysisAI/InferenceX
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathperf-changelog.yaml
More file actions
611 lines (536 loc) · 24.1 KB
/
perf-changelog.yaml
File metadata and controls
611 lines (536 loc) · 24.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
- config-keys:
- 70b-fp8-*-vllm
description:
- 'Add compilation-config ''{"custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}'' as extra config to all benchmarks/70b_fp8_mi*.sh scripts'
- "6-7% uplift for llama for 6/8 configs"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/95
- config-keys:
- gptoss-fp4-*-trt
description:
- "Upgrade GPT-OSS TRT images from 'release:1.1.0rc2.post2' to '1.2.0rc0.post1'"
- "Add NCCL_GRAPH_REGISTER=0 to benchmarks/gptoss_fp4_b200_trt_slurm.sh"
- "Change kv_cache_config.dtype from 'auto' to 'fp8' in benchmarks/gptoss_fp4_b200_trt_slurm.sh"
- "Remove MOE_BACKEND=CUTLASS, now just defaults to TRTLLM"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/110
- config-keys:
- gptoss*
- dsr1*
description:
- "Remove Llama 70B runs to make room for multi-node disagg prefill+wideEP on h100/h200/b200/mi300/mi325/mi355"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/149
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Upgrade vLLM from 0.10.2 to 0.11.0 for GPT-OSS NVIDIA single-node configs"
- 'Add compilation-config ''{"cudagraph_mode":"PIECEWISE"}'' since vLLM 0.11.0 now defaults to FULL_AND_PIECEWISE'
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/159
- config-keys:
- dsr1*
description:
- "Fix bug where 1k8k and 8k1k full sweeps had incorrect max-model-len for DeepSeek"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/163
- config-keys:
- dsr1-fp4-b200-sglang
- dsr1-fp8-b200-sglang
- dsr1-fp8-h200-sglang
description:
- "Consolidate H200 and B200 SGLang configurations to use unified v0.5.5-cu129-amd64 image tag"
- "Update deprecated SGLang server arguments to current equivalents"
- "Replace --enable-ep-moe with --ep-size $EP_SIZE"
- "Replace --enable-flashinfer-trtllm-moe with --moe-runner-backend flashinfer_trtllm"
- "Add -e EP_SIZE to Docker run commands in launch scripts"
- "Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/204
- config-keys:
- gptoss-fp4-mi355x-vllm
- gptoss-fp4-b200-vllm
description:
- "Extend concurrency to 128 for gptoss mi355x/b200 vllm configurations"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/209
- config-keys:
- gptoss-fp4-b200-trt
description:
- "Extend concurrency to 128 for gptoss b200 TRT configurations"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/233
- config-keys:
- "*gb200-dynamo-sglang"
description:
- "Introduce improvements in GB200 SGLang DSR1 submission"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/257
- config-keys:
- dsr1-fp8-h200-trt
description:
- "Update TRT image from nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc0.post1 to nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2"
- "Increase concurrency for some configurations"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/266
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Update vLLM image for NVIDIA configs from vLLM 0.11.0 to vLLM 0.11.2"
- "Add kv-cache-dtype: fp8 to benchmarks/gptoss_fp4_b200_docker.sh"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/273
- config-keys:
- gptoss-fp4-b200-trt
description:
- "Add benchmark script for GPTOSS FP4 B200 TRT-LLM"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/256
- config-keys:
- dsr1-fp4-gb200-dynamo-trt
- dsr1-fp4-gb200-dynamo-sglang
- dsr1-fp8-gb200-dynamo-sglang
description:
- "Add more configurations for GB200 SGLang DSR1"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/335
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Update MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post1"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/330
- config-keys:
- dsr1-fp4-gb200-dynamo-sglang
- dsr1-fp8-gb200-dynamo-sglang
description:
- "fix: Pruning unnecessary concurrencies "
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/358
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Updating MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.6.post2"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/369
- config-keys:
- dsr1-fp4-b200-sglang
- dsr1-fp8-b200-sglang
- dsr1-fp8-h200-sglang
description:
- "Update NVIDIA DeepSeek sglang Docker image from v0.5.5 to v0.5.6"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/276
- config-keys:
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
description:
- "Update vLLM image from v0.11.2 to v0.13.0"
- "Add VLLM_MXFP4_USE_MARLIN=1 to H100 and H200 benchmark scripts"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/327
- config-keys:
- dsr1-fp8-mi300x-sglang
- dsr1-fp8-mi325x-sglang
- dsr1-fp8-mi355x-sglang
description:
- Use upstream SGLang images on mi300, mi325 and mi355 for dsr1fp8
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/332
- config-keys:
- gptoss-fp4-gb200-dynamo-trt
- gptoss-fp4-b200-trt
description:
- Explicitly add EP=TP for DP attention configs for B200 AGG nvidia-master file. Multinode Refactor inadvertently changed default EP=1
- Add GPTOSS DISAGG configurations for GB200 1k1k and 8k1k.
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/387
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add PD disaggregation (1P2D) for Mi355X"
- "Includes with and without speculative decoding"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/348
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Updating MI355x Deepseek-R1 FP4 SGLang Image to upstream v0.5.7"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/395
- config-keys:
- dsr1-fp8-b200-sglang
description:
- "Adds TP4 configurations to DSR1-FP8 B200 SGLang deployment experiments"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/411
- config-keys:
- dsr1-fp4-b200-trt-mtp
- dsr1-fp8-b200-trt-mtp
- dsr1-fp8-h200-trt-mtp
description:
- Add MTP (Multi-Token Prediction) support for single-node TRT configs
- Add spec-decoding field to config entries and update launch scripts to select MTP benchmark scripts
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/392
- config-keys:
- dsr1-fp8-mi355x-atom
- dsr1-fp4-mi355x-atom
- gptoss-fp4-mi355x-atom
description:
- Add internal AMD ATOM inference engine for DeepSeek R1 FP8, FP4 and GPTOSS FP4 Mi355X
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/419
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
description:
- "Update AMD MI300X and MI325X GPT-OSS 120B vLLM to use upstream ROCm image vllm/vllm-openai-rocm:v0.14.0"
- "Remove deprecated --async-scheduling flag (now enabled by default in vLLM v0.14.0)"
- "Remove deprecated --max-seq-len-to-capture flag"
- "Add HIP_VISIBLE_DEVICES env var for Ray compatibility in vLLM 0.14+"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/496
- config-keys:
- dsr1-fp8-h200-sglang
description:
- "Update H200 DeepSeek R1 FP8 SGLang image from v0.5.6 to v0.5.7"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/538
- config-keys:
- dsr1-fp8-mi300x-sglang
description:
- "Update MI300X DeepSeek R1 FP8 SGLang image from v0.5.5.post3 to v0.5.7"
- "Add SGLANG_AITER_MLA_PERSIST=1 for persistent MLA kernel optimization"
- "Set --kv-cache-dtype fp8_e4m3 for fp8 KV cache"
- "Set --attention-backend aiter for AMD aiter attention backend"
- "Update chunked-prefill-size and max-prefill-tokens from 196608 to 131072"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/544
- config-keys:
- dsr1-fp8-mi325x-sglang
description:
- "Update MI325X DeepSeek R1 FP8 SGLang image from v0.5.5.post3 to v0.5.7"
- "Add SGLANG_AITER_MLA_PERSIST=1 for persistent MLA kernel with fp8 KV cache"
- "Add --kv-cache-dtype fp8_e4m3 for explicit FP8 KV cache"
- "Add --attention-backend aiter for AMD aiter attention backend"
- "Reduce chunked-prefill-size from 196608 to 131072"
- "Reduce max-prefill-tokens from 196608 to 131072"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/545
- config-keys:
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
description:
- "Fix AITER env vars for vLLM v0.14.0 on AMD MI300X and MI325X"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/535
- config-keys:
- dsr1-fp8-mi355x-sglang
description:
- "Update MI355X DeepSeek R1 FP8 SGLang image from v0.5.5.post3 to v0.5.8"
- "Key fix: Disables mla persistent kernel when not using fp8 kv_cache (https://github.com/sgl-project/sglang/pull/17327)"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/572
- config-keys:
# NVIDIA single-node
- dsr1-fp4-b200-sglang
- dsr1-fp4-b200-trt
- dsr1-fp4-b200-trt-mtp
- dsr1-fp8-b200-sglang
- dsr1-fp8-b200-trt
- dsr1-fp8-b200-trt-mtp
- dsr1-fp8-h200-sglang
- dsr1-fp8-h200-trt
- dsr1-fp8-h200-trt-mtp
- gptoss-fp4-b200-trt
- gptoss-fp4-b200-vllm
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-trt
- gptoss-fp4-h200-vllm
# AMD single-node
- dsr1-fp4-mi355x-sglang
- dsr1-fp4-mi355x-atom
- dsr1-fp8-mi300x-sglang
- dsr1-fp8-mi325x-sglang
- dsr1-fp8-mi355x-sglang
- dsr1-fp8-mi355x-atom
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm
- gptoss-fp4-mi355x-vllm
- gptoss-fp4-mi355x-atom
description:
- Add official GSM8k eval results to GPT-OSS and DeepSeek R1 scenarios
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/558
evals-only: true
- config-keys:
- dsr1-fp4-b300-dynamo-trt
description:
- "Add DSR1 FP4 B300 Dynamo TRT configurations"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/585
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Update SGLang image from v0.5.7 to v0.5.8 for DeepSeek-R1 FP4 on MI355x"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/595
- config-keys:
- dsr1-fp8-b200-trt
description:
- "Update TensorRT-LLM container from release:1.1.0rc2.post2 to release:1.2.0rc6.post2"
- "Change default MOE backend from DEEPGEMM to TRTLLM"
- "Add dynamic piecewise CUDA graphs for 1k1k (CONC≥64) and 8k1k (CONC≥64) workloads"
- "Add delay batching (batch_wait_timeout_iters/batch_wait_max_tokens_ratio) for 1k1k high-concurrency"
- "Add dynamic KV cache memory fraction tuning (0.7-0.8) based on ISL/OSL/TP configuration"
- "Update search space: remove EP=TP constraint, add TP=4 configurations, extend concurrency ranges"
- "Add TLLM_OVERRIDE_LAYER_NUM=61 to avoid OOM errors"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/594
- config-keys:
- dsr1-fp4-b200-dynamo-trt
description:
- "Update DSR1 FP4 B200 Dynamo TRT configurations"
- "Update TRTLLM version to 1.2.0rc6.post2"
- "Transform to use srt-slurm recipes"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/588
- config-keys:
- dsr1-fp8-h200-dynamo-trt
description:
- "Add DSR1 FP8 H200 Dynamo TRT-LLM disaggregated multinode configuration"
- "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
- "Runner: h200-dgxc with multinode and disagg enabled"
- "Includes MTP and STP configurations for 1k1k and 8k1k sequence lengths"
- "Concurrency levels: 4, 8, 16, 32, 64, 128, 256, 512"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/570
- config-keys:
- dsr1-fp4-gb200-dynamo-trt
description:
- "Update Dynamo TRT image from 0.5.1-rc0.pre3 to 0.8.1.post2"
- "Update TRT configurations"
- "Refactor configurations to use CONFIG_FILE-based recipes instead of inline parameter settings"
- "Introduce srt-slurm workflow for launching Dynamo jobs"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/510
- config-keys:
- dsr1-fp8-mi355x-sglang
description:
- "Disable torch.compile for MI355X DeepSeek-R1 FP8 SGLang"
- "set cuda-graph-max-bs to CONC"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/613
- config-keys:
- dsr1-fp8-h200-dynamo-sglang
description:
- "Add DSR1 FP8 H200 Dynamo SGLang disaggregated multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8-cu130-runtime"
- "Runner: h200-multinode-slurm with multinode and disagg enabled"
- "Recipes sourced from srtslurm repo (recipes/h200/)"
- "1k1k configs: aggregated, low-latency (1P9D), high-throughput TEP (1P6D), DEP (1P6D)"
- "8k1k configs: aggregated, TEP configs (1P7D, 1P6D, 1P3D, 2P3D), DEP (1P1D)"
- "Concurrency levels range from 1 to 2048 depending on configuration"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/582
- config-keys:
- dsr1-fp4-b200-trt
description:
- "Update TensorRT-LLM container from release:1.1.0rc2.post2 to release:1.2.0rc6.post2"
- "Change default MOE backend from DEEPGEMM to TRTLLM"
- "Add dynamic piecewise CUDA graphs for 1k1k (TEP8 and CONC64)"
- "Update search space: remove EP=TP constraint, add TP=4 configurations, extend concurrency ranges"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/620
- config-keys:
- dsr1-fp4-gb300-dynamo-trt
description:
- "Add DeepSeek-R1 FP4 GB300 Dynamo TRT disaggregated multinode configurations"
- "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post2"
- "Includes MTP and STP configs for 1k1k, 1k8k, and 8k1k sequence lengths"
- "Add gb300-nv runner and launch script for srt-slurm integration"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/618
- config-keys:
- dsr1-fp4-mi355x-sglang-disagg
description:
- "enable PD/D for both MTP and non-MTP MI355X DeepSeek-R1 FP4 SGLang"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/622
- config-keys:
- dsr1-fp8-gb200-dynamo-trt
description:
- "Add DeepSeek R1 FP8 GB200 Dynamo TRT-LLM disaggregated multinode configurations"
- "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post2"
- "1k1k: 14 scenarios (7 MTP, 7 STP) with varying DP attention/TEP modes"
- "1k8k: 10 scenarios (5 MTP, 5 STP) for long output generation"
- "8k1k: 14 scenarios (7 MTP, 7 STP) for long context workloads"
- "Prefill workers: 1-5P, Decode workers: 1-4D, TP/EP: 8/16/32"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/617
- config-keys:
- dsr1-fp8-gb200-dynamo-trt
description:
- "Fix model_prefix argument in yaml configs"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/646
- config-keys:
- dsr1-fp8-b200-trt-mtp
description:
- Update to the latest TRTLLM 1.2 release container version
- Fine-tune the choice of parallelism in nvidia-master file, mainly going to TP for most points
- Enable piecewise CUDA graphs under most conditions
- fine-tune max batch sizes and other optimizations
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/632
- config-keys:
- dsr1-fp8-gb300-dynamo-trt
description:
- "Add DeepSeek R1 FP8 GB300 Dynamo TRT-LLM disaggregated multinode configurations for 8k1k and 1k1k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/627
- config-keys:
- gptoss-fp4-b200-trt
description:
- "Update GPT-OSS FP4 B200 TRT pareto configurations and new container image"
- "Extend maximum concurrency to 256 across all sequence lengths"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/639
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add --use-chat-template argument to benchmark_serving script"
- "Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/647
- config-keys:
- dsr1-fp8-b200-sglang-mtp
description:
- "Add MTP (Multi-Token Prediction) support for DeepSeek R1 FP8 B200 SGLang using EAGLE speculative decoding"
- "Image: lmsysorg/sglang:v0.5.8-cu130-amd64"
- "Add benchmark script dsr1_fp8_b200_mtp.sh with EAGLE speculative decoding (num-steps=2, draft-tokens=3, topk=1)"
- "Update launch_b200-dgxc.sh to support SPEC_SUFFIX for MTP script selection"
- "Configurations: TP=8, EP=1, concurrency 4-64 for 1k1k, 1k8k, and 8k1k sequence lengths"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/626
- config-keys:
- dsr1-fp4-b200-trt-mtp
description:
- "Upgrade TensorRT-LLM container from release:1.1.0rc2.post2 to release:1.2.0rc6.post3"
- "Enable dynamic piecewise CUDA graphs for several conditions"
- "Adjust TP8/TP4 search space to reduce overlapping points"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/642
- config-keys:
- dsr1-fp8-b200-dynamo-sglang
description:
- "Add DSR1 FP8 B200 disaggregated SGLang multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
- "9 recipes: 4x 1k1k + 5x 8k1k, low-latency and max-throughput profiles"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/658
- config-keys:
- dsr1-fp8-gb300-dynamo-trt
description:
- "Add DeepSeek R1 FP8 GB300 Dynamo TRT-LLM disaggregated multinode configurations for 1k8k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/654
- config-keys:
- dsr1-fp8-h100-dynamo-trt
description:
- "Add DeepSeek R1 FP8 H100 Dynamo TRT-LLM disaggregated multinode configurations"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/651
- config-keys:
- dsr1-fp8-h200-dynamo-sglang
description:
- "Add MTP (EAGLE speculative decoding) configs alongside STP"
- "Update container to lmsysorg/sglang:v0.5.8.post1-cu130"
- "Remove aggregated configs, keep only disaggregated"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/640
- config-keys:
- dsr1-fp8-b300-dynamo-trt
description:
- "New B300 FP8 Dynamo TRT configurations"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/638
- config-keys:
- dsr1-fp8-h100-dynamo-trt
description:
- "Add DeepSeek R1 FP8 H100 Dynamo TRT-LLM disaggregated multinode configurations"
- "fix model_prefix bug from https://github.com/InferenceMAX/InferenceMAX/pull/651"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/663
- config-keys:
- dsr1-fp4-gb200-dynamo-sglang
description:
- "Update SGLang image from v0.5.5.post2 to v0.5.8-cu130"
- "Add FP4 model path separation via SRT_SLURM_MODEL_PREFIX in launch script"
- "Refactor to use CONFIG_FILE-based srt-slurm recipes instead of inline parameters"
- "Add 1k1k configurations: low-latency (1P2D), mid-curve (4P8D), max-tpt (4P12D)"
- "Add 8k1k configurations: low-latency (1P4D), mid-curve (6P12D), max-tpt (10P8D)"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/633
- config-keys:
- dsr1-fp8-gb200-dynamo-sglang
- dsr1-fp8-gb300-dynamo-sglang
description:
- "Update GB200 and GB300 configs for DSR1 FP8 SGLANG STP mode"
- "Image: lmsysorg/sglang:v0.5.8-cu130"
- "Update prefill/decode worker counts, TP/EP parallelism, and dp-attn settings for 1k1k and 8k1k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/635
- config-keys:
- dsr1-fp8-b200-dynamo-sglang-mtp
description:
- "Add DSR1 FP8 B200 disaggregated SGLang MTP multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
- "9 recipes: 4x 1k1k + 5x 8k1k, low-latency and max-throughput with EAGLE speculative decoding"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/667
- config-keys:
- dsr1-fp8-h100-dynamo-sglang
description:
- "Add DeepSeek-R1 FP8 H100 Dynamo SGLang STP disaggregated multinode configurations"
- "Image: lmsysorg/sglang:v0.5.8-cu130"
- "1k1k, 1k8k, 8k1k sequence lengths"
- "Two modes per seq-len: Max throughput TEP (1P2D) and Max throughput DEP (1P1D with dp-attention)"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/643
- config-keys:
- dsr1-fp8-h100-dynamo-sglang
description:
- "Add DeepSeek-R1 FP8 H100 Dynamo SGLang MTP disaggregated multinode configurations"
- "Image: lmsysorg/sglang:v0.5.8-cu130"
- "1k1k, 1k8k, 8k1k sequence lengths with MTP speculative decoding"
- "Two modes per seq-len: Max throughput TEP (1P2D) and Max throughput DEP (1P1D with dp-attention)"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/644
- config-keys:
- dsr1-fp8-mi355x-atom-mtp
- dsr1-fp4-mi355x-atom-mtp
description:
- "Add DSR1 FP8/FP4 MI355X ATOM with MTP configuration"
- "Image: rocm/atom:rocm7.2.0-ubuntu24.04-pytorch2.9-atom0.1.1"
- "Deepseek R1 with speculative decoding: 1k1k, 1k8k, 8k1k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/673
- config-keys:
- dsr1-fp4-b200-dynamo-sglang
description:
- "Add DSR1 FP4 B200 Dynamo SGLang STP mode"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-runtime"
- "1k1k configs: low-latency DEP (1P5D, 1P6D), max-throughput DEP (1P1D, 1P2D)"
- "8k1k configs: low-latency DEP/TEP (1P1D, 1P5D, 2P5D), TEP (1P1D), max-throughput DEP (7P2D)"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/672
- config-keys:
- dsr1-fp8-b200-dynamo-trt
description:
- "Introduce new DSR1 FP8 B200 Dynamo TRT configurations for 8k1k and 1k1k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/616
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
- dsr1-fp4-mi355x-sglang-disagg
description:
- "Bump MI355X MORI FP8/FP4 image to latest rocm/sgl-dev:sglang-0.5.8-rocm700-mi35x-mori-0210"
- "Bump mi355x sglang disagg recipe to sa-260211"
- "Add conc 4/8/16"
- "Use Pure TP with MTP=2 for 1k1k conc smaller than 128 and reduce MTP to 1 for DEP configs"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/674
- config-keys:
- dsr1-fp4-b200-dynamo-sglang-mtp
description:
- "Add B200 configs for DSR1 FP4 SGLANG MTP mode for 1k1k and 8k1k"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/683
- config-keys:
- dsr1-fp4-gb300-dynamo-sglang
description:
- "Add GB300 FP4 Dynamo SGLang disaggregated multinode configuration"
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-runtime"
- "Recipes sourced from srt-slurm repo (recipes/gb300-fp4/ folder)"
- "Add 1k1k configurations: low-latency (1P2D), mid-curve (4P8D), max-tpt (4P12D)"
- "Add 8k1k configurations: low-latency (1P4D), mid-curve (6P12D), max-tpt (10P8D)"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/636
- config-keys:
- dsr1-fp8-b200-dynamo-sglang-mtp
description:
- "Patches one missing concurrency point for "
- "DSR1 FP8 B200 disaggregated SGLang MTP multinode configuration. "
- "Image: lmsysorg/sglang:v0.5.8.post1-cu130-amd64"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/691
- config-keys:
- dsr1-fp8-b300-dynamo-trt
description:
- "Update max_num_tokens and max_batch_size for min-latency decode workers"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/690
- config-keys:
- dsr1-fp8-mi355x-sglang-disagg
description:
- "Add more sweep points for DSR1 FP8 both MTP and non-MTP 1k1k, 8k1k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/689
- config-keys:
- dsr1-fp8-b200-dynamo-trt
description:
- "Update max_num_tokens and max_batch_size for min-latency decode workers"
- "See srt-slurm recipe changes: https://github.com/ishandhanani/srt-slurm/pull/173"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/686
- config-keys:
- dsr1-fp8-mi325x-sglang
description:
- "Update MI325X DeepSeek R1 FP8 SGLang image from v0.5.7 to v0.5.8"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/692
- config-keys:
- dsr1-fp8-mi300x-sglang
description:
- "Update MI300X DeepSeek R1 FP8 SGLang image from v0.5.7 to v0.5.8"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/696
- config-keys:
- dsr1-fp4-mi355x-sglang-disagg
description:
- "Add more sweep points for DSR1 FP4 both MTP and non-MTP 1k1k, 8k1k"
pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/692