test(bench_serving_dense_tp_4): cap context-length to 3072#1077
Merged
Conversation
The current TestBenchServingDenseTp4 launches the server without --context-length, so the control plane pads per-seq KV allocation to Qwen3-8B's full 40K context. RPA v3's DECODE block-size heuristic then picks bkv_sz=16384 - far larger than the test's actual ~1K-token sequences - causing ~13% throughput regression after #934 routed decode batches to the DECODE sub-kernel. Bound context-length to 3072 (still covers test_itl's 1024+1024 token worst case) so RPA picks bkv_sz=3072 instead, and raise the throughput threshold from 9866 to 11000 (single-run measured 12101 tok/s). See #1044 for full root-cause analysis.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
aolemila
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--context-length 3072to TestBenchServingDenseTp4's server launch argstest_output_throughput_default_tp_4threshold from 9866 → 11000 to match new baselineWhy
After #934 routed decode-only batches to RPA v3's DECODE sub-kernel, the kernel's default
bkv_sz = min(min_bkv_sz_to_peak, max_kv)heuristic picks 16384 for this test, because the control plane pads per-seq KV allocation to Qwen3-8B's full 40Kcontext_len. The DMA tile is ~16× larger than the test's actual ≤1K-token sequences, wasting HBM bandwidth and causing ~13% throughput regression — main has been intermittently failing this test since 2026-04-22 (e.g. runs25787533738,25713412493,25668305532).Bounding
--context-lengthto 3072 lets RPA pickbkv_sz=3072instead. 3072 still covers the worst-case sub-test (test_itl: 1024 input + 1024 output = 2048 tokens) with margin.Full root-cause analysis and alternative fixes are in #1044.
Verification (TPU v6e-4, single run)
For comparison: PR #934 baseline (default context-length) measured 9985 tok/s on the same hardware; this change recovers it to 12101 (+21%), even slightly above PR #930's pre-regression 11484.
Test plan