Skip to content

test(bench_serving_dense_tp_4): cap context-length to 3072#1077

Merged
JamesBrianD merged 1 commit into
mainfrom
fix/dense-tp4-context-length
May 14, 2026
Merged

test(bench_serving_dense_tp_4): cap context-length to 3072#1077
JamesBrianD merged 1 commit into
mainfrom
fix/dense-tp4-context-length

Conversation

@JamesBrianD
Copy link
Copy Markdown
Collaborator

Summary

  • Adds --context-length 3072 to TestBenchServingDenseTp4's server launch args
  • Raises test_output_throughput_default_tp_4 threshold from 9866 → 11000 to match new baseline

Why

After #934 routed decode-only batches to RPA v3's DECODE sub-kernel, the kernel's default bkv_sz = min(min_bkv_sz_to_peak, max_kv) heuristic picks 16384 for this test, because the control plane pads per-seq KV allocation to Qwen3-8B's full 40K context_len. The DMA tile is ~16× larger than the test's actual ≤1K-token sequences, wasting HBM bandwidth and causing ~13% throughput regression — main has been intermittently failing this test since 2026-04-22 (e.g. runs 25787533738, 25713412493, 25668305532).

Bounding --context-length to 3072 lets RPA pick bkv_sz=3072 instead. 3072 still covers the worst-case sub-test (test_itl: 1024 input + 1024 output = 2048 tokens) with margin.

Full root-cause analysis and alternative fixes are in #1044.

Verification (TPU v6e-4, single run)

sub-test metric threshold measured headroom
test_input_throughput input_throughput > 64960 68859 +6%
test_output_throughput output_throughput > 11000 12101 +10%
test_ttft median_ttft_ms < 38 28.56 25%
test_itl median_itl_ms < 8 6.29 21%

For comparison: PR #934 baseline (default context-length) measured 9985 tok/s on the same hardware; this change recovers it to 12101 (+21%), even slightly above PR #930's pre-regression 11484.

Test plan

  • All 4 sub-tests pass on TPU v6e-4
  • CI green on PR

The current TestBenchServingDenseTp4 launches the server without
--context-length, so the control plane pads per-seq KV allocation to
Qwen3-8B's full 40K context. RPA v3's DECODE block-size heuristic then
picks bkv_sz=16384 - far larger than the test's actual ~1K-token
sequences - causing ~13% throughput regression after #934 routed decode
batches to the DECODE sub-kernel.

Bound context-length to 3072 (still covers test_itl's 1024+1024 token
worst case) so RPA picks bkv_sz=3072 instead, and raise the throughput
threshold from 9866 to 11000 (single-run measured 12101 tok/s).

See #1044 for full root-cause analysis.
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@JamesBrianD JamesBrianD merged commit 6f9def6 into main May 14, 2026
21 checks passed
@JamesBrianD JamesBrianD deleted the fix/dense-tp4-context-length branch May 14, 2026 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants