Skip to content

Commit 6f9def6

Browse files
authored
test(bench_serving_dense_tp_4): cap context-length to 3072 (#1077)
The current TestBenchServingDenseTp4 launches the server without --context-length, so the control plane pads per-seq KV allocation to Qwen3-8B's full 40K context. RPA v3's DECODE block-size heuristic then picks bkv_sz=16384 - far larger than the test's actual ~1K-token sequences - causing ~13% throughput regression after #934 routed decode batches to the DECODE sub-kernel. Bound context-length to 3072 (still covers test_itl's 1024+1024 token worst case) so RPA picks bkv_sz=3072 instead, and raise the throughput threshold from 9866 to 11000 (single-run measured 12101 tok/s). See #1044 for full root-cause analysis.
1 parent c68d685 commit 6f9def6

1 file changed

Lines changed: 3 additions & 1 deletion

File tree

test/srt/test_bench_serving_dense_tp_4.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ def setUpClass(cls):
5252
"--page-size",
5353
"256",
5454
"--disable-radix-cache",
55+
"--context-length",
56+
"3072",
5557
],
5658
env={
5759
"JAX_COMPILATION_CACHE_DIR": "/tmp/jax_compilation_cache",
@@ -105,7 +107,7 @@ def test_output_throughput_default_tp_4(self):
105107
f"### test_output_throughput_default_tp_4\n"
106108
f"Output throughput: {res['output_throughput']:.2f} token/s\n"
107109
)
108-
self.assertGreater(res["output_throughput"], 9866)
110+
self.assertGreater(res["output_throughput"], 11000)
109111

110112
def test_ttft_default_tp_4(self):
111113
args = get_benchmark_args(

0 commit comments

Comments
 (0)