test(bench_serving_dense_tp_4): cap context-length to 3072 by JamesBrianD · Pull Request #1077 · sgl-project/sglang-jax

JamesBrianD · 2026-05-14T05:17:01Z

Summary

Adds --context-length 3072 to TestBenchServingDenseTp4's server launch args
Raises test_output_throughput_default_tp_4 threshold from 9866 → 11000 to match new baseline

Why

After #934 routed decode-only batches to RPA v3's DECODE sub-kernel, the kernel's default bkv_sz = min(min_bkv_sz_to_peak, max_kv) heuristic picks 16384 for this test, because the control plane pads per-seq KV allocation to Qwen3-8B's full 40K context_len. The DMA tile is ~16× larger than the test's actual ≤1K-token sequences, wasting HBM bandwidth and causing ~13% throughput regression — main has been intermittently failing this test since 2026-04-22 (e.g. runs 25787533738, 25713412493, 25668305532).

Bounding --context-length to 3072 lets RPA pick bkv_sz=3072 instead. 3072 still covers the worst-case sub-test (test_itl: 1024 input + 1024 output = 2048 tokens) with margin.

Full root-cause analysis and alternative fixes are in #1044.

Verification (TPU v6e-4, single run)

sub-test	metric	threshold	measured	headroom
test_input_throughput	input_throughput	> 64960	68859	+6%
test_output_throughput	output_throughput	> 11000	12101	+10%
test_ttft	median_ttft_ms	< 38	28.56	25%
test_itl	median_itl_ms	< 8	6.29	21%

For comparison: PR #934 baseline (default context-length) measured 9985 tok/s on the same hardware; this change recovers it to 12101 (+21%), even slightly above PR #930's pre-regression 11484.

Test plan

All 4 sub-tests pass on TPU v6e-4
CI green on PR

The current TestBenchServingDenseTp4 launches the server without --context-length, so the control plane pads per-seq KV allocation to Qwen3-8B's full 40K context. RPA v3's DECODE block-size heuristic then picks bkv_sz=16384 - far larger than the test's actual ~1K-token sequences - causing ~13% throughput regression after #934 routed decode batches to the DECODE sub-kernel. Bound context-length to 3072 (still covers test_itl's 1024+1024 token worst case) so RPA picks bkv_sz=3072 instead, and raise the throughput threshold from 9866 to 11000 (single-run measured 12101 tok/s). See #1044 for full root-cause analysis.

gemini-code-assist · 2026-05-14T05:17:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

aolemila approved these changes May 14, 2026

View reviewed changes

JamesBrianD merged commit 6f9def6 into main May 14, 2026
21 checks passed

JamesBrianD deleted the fix/dense-tp4-context-length branch May 14, 2026 05:57

hashkanna mentioned this pull request May 14, 2026

perf(multimodal): avoid feature hashing for pad values #1030

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench_serving_dense_tp_4): cap context-length to 3072#1077

test(bench_serving_dense_tp_4): cap context-length to 3072#1077
JamesBrianD merged 1 commit into
mainfrom
fix/dense-tp4-context-length

JamesBrianD commented May 14, 2026

Uh oh!

gemini-code-assist Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JamesBrianD commented May 14, 2026

Summary

Why

Verification (TPU v6e-4, single run)

Test plan

Uh oh!

gemini-code-assist Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants