Commit 6f9def6
authored
test(bench_serving_dense_tp_4): cap context-length to 3072 (#1077)
The current TestBenchServingDenseTp4 launches the server without
--context-length, so the control plane pads per-seq KV allocation to
Qwen3-8B's full 40K context. RPA v3's DECODE block-size heuristic then
picks bkv_sz=16384 - far larger than the test's actual ~1K-token
sequences - causing ~13% throughput regression after #934 routed decode
batches to the DECODE sub-kernel.
Bound context-length to 3072 (still covers test_itl's 1024+1024 token
worst case) so RPA picks bkv_sz=3072 instead, and raise the throughput
threshold from 9866 to 11000 (single-run measured 12101 tok/s).
See #1044 for full root-cause analysis.1 parent c68d685 commit 6f9def6
1 file changed
Lines changed: 3 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
55 | 57 | | |
56 | 58 | | |
57 | 59 | | |
| |||
105 | 107 | | |
106 | 108 | | |
107 | 109 | | |
108 | | - | |
| 110 | + | |
109 | 111 | | |
110 | 112 | | |
111 | 113 | | |
| |||
0 commit comments