@@ -76,15 +76,15 @@ All performance values are measured in **output tokens per second per GPU**.
7676## Table of Contents
7777
7878- [ Deepseek R1 0528] ( #deepseek-r1-0528 )
79- - [ Deepseek R1 0528 - RTX Configurations ] ( #deepseek-r1-0528-rtx-configurations )
79+ - [ Deepseek R1 0528 - RTX 6000 Pro Blackwell Server Edition ] ( #deepseek-r1-0528-rtx-configurations )
8080- [ GPT-OSS 120B] ( #gpt-oss-120b )
8181- [ GPT-OSS 20B] ( #gpt-oss-20b )
8282- [ LLaMA v3.3 70B] ( #llama-v33-70b )
83- - [ LLaMA v3.3 70B - RTX Configurations ] ( #llama-v33-70b-rtx-configurations )
83+ - [ LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition ] ( #llama-v33-70b-rtx-configurations )
8484- [ Qwen3 235B A22B] ( #qwen3-235b-a22b )
85- - [ Qwen3 235B A22B - RTX Configurations ] ( #qwen3-235b-a22b-rtx-configurations )
85+ - [ Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition ] ( #qwen3-235b-a22b-rtx-configurations )
8686- [ Qwen3 30B A3B] ( #qwen3-30b-a3b )
87- - [ Qwen3 30B A3B - RTX Configurations ] ( #qwen3-30b-a3b-rtx-configurations )
87+ - [ Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition ] ( #qwen3-30b-a3b-rtx-configurations )
8888
8989---
9090
@@ -105,7 +105,7 @@ All performance values are measured in **output tokens per second per GPU**.
105105
106106<a id =" deepseek-r1-0528-rtx-configurations " ></a >
107107
108- # Deepseek R1 0528 - RTX Configurations (TP/PP)
108+ # Deepseek R1 0528 - RTX 6000 Pro Blackwell Server Edition (TP/PP)
109109
110110* Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
111111
@@ -165,7 +165,7 @@ All performance values are measured in **output tokens per second per GPU**.
165165
166166<a id =" llama-v33-70b-rtx-configurations " ></a >
167167
168- # LLaMA v3.3 70B - RTX Configurations (TP/PP)
168+ # LLaMA v3.3 70B - RTX 6000 Pro Blackwell Server Edition (TP/PP)
169169
170170* Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
171171
@@ -197,7 +197,7 @@ All performance values are measured in **output tokens per second per GPU**.
197197
198198<a id =" qwen3-235b-a22b-rtx-configurations " ></a >
199199
200- # Qwen3 235B A22B - RTX Configurations (TP/PP)
200+ # Qwen3 235B A22B - RTX 6000 Pro Blackwell Server Edition (TP/PP)
201201
202202* Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
203203
@@ -229,7 +229,7 @@ All performance values are measured in **output tokens per second per GPU**.
229229
230230<a id =" qwen3-30b-a3b-rtx-configurations " ></a >
231231
232- # Qwen3 30B A3B - RTX Configurations (TP/PP)
232+ # Qwen3 30B A3B - RTX 6000 Pro Blackwell Server Edition (TP/PP)
233233
234234* Shows Tensor Parallel (TP) and Pipeline Parallel (PP) configurations*
235235
@@ -313,7 +313,7 @@ a model name (HuggingFace reference or path to a local model), a [generated data
313313
314314For dense / non-MoE models:
315315``` shell
316- trtllm-bench --tp $tp_size --pp $pp_size --model $model_name throughput --dataset $dataset_file --backend pytorch --config $llm_options
316+ trtllm-bench --tp $tp_size --pp $pp_size --model $model_name throughput --dataset $dataset_file --backend pytorch --config $llm_options --concurrency -1
317317```
318318Llama 3.3
319319
@@ -342,7 +342,7 @@ kv_cache_config:
342342 dtype : fp8
343343 # Hopper: use auto
344344moe_config :
345- backend : CUTLASS
345+ backend : TRTLLM
346346 # Hopper: use TRITON
347347```
348348
@@ -364,6 +364,20 @@ kv_cache_config:
364364 dtype : fp8
365365` ` `
366366
367+ Kimi K2:
368+
369+ ` llm_options.yml`
370+ ` ` ` yaml
371+ enable_attention_dp: true
372+ cuda_graph_config:
373+ enable_padding: true
374+ batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 384]
375+ moe_config:
376+ backend: CUTLASS
377+ kv_cache_config:
378+ dtype: auto
379+ ` ` `
380+
367381Qwen3 MoE, Llama4 Maverick :
368382
369383` llm_options.yml`
0 commit comments