Add more testing for chunked prefill by fzyzcjy · Pull Request #27506 · sgl-project/sglang

fzyzcjy · 2026-06-07T14:59:03Z

Stacked on #27502 (mixed-prefix gsm8k eval), which is this PR's base.

Adds the scripted-runtime test harness (one forward per script yield, precise per-step engine-state assertions over an HTTP-server + scheduler-hook driven engine) and a chunked-prefill test suite (test/manual/chunked_prefill/test_scripted_*.py + test_e2e_*.py), plus kv-canary PP fixtures and the registered scripted-runtime/chunked CI tests.

The whole suite was brought green on H200 (registered unit + GPU, manual scripted 89/89 classes, manual e2e 11/11 files) and passed a multi-agent weakening audit.

CI States

Latest PR Test (Base): 🚫 Run #27205607049
Latest PR Test (Extra): ✅ Run #27205605882

Per review: a one-line passthrough should call the original object directly, and any wrapper that must exist should return the original type rather than invent a new one. Remove all thin ScriptedReqHandle field getters and the thin ScriptedContext readers (engine_stats, row_pool_used, *_rids, batch_size, get_chunked_req_rid, chunked_in_flight_count, lock_refs_snapshot); tests now read r.req.<field> / t._scheduler.<...> directly. Keep start_req kwargs, abort(handle) (an HTTP action), and list_active_reqs() (aggregates four scheduler structures and returns List[Req]).

list_active_reqs collapses to list(set(_get_all_reqs(ctx))) (Req is hashable by identity, so the same object across scheduler structures dedups). Make the ScriptedContextReqStarter.start_req ignore_eos/priority/dp_rank kwargs required so the internal plumbing carries no silent defaults; the public facade keeps its ergonomic defaults.

Per the line-count rule (wrap multi-line computations, inline 1-liners): kv_pages derives a page count from kv_allocated_len and scheduler.page_size, and engine_stats aggregates several pool counters into one dict -- both are real computations, not thin field reads, so they earn a wrapper. Re-add their registered-core tests.

Returning 0 when find_req_by_rid finds nothing conflated 'req released' with 'req present holding 0 pages', which would let a kv_pages==0 assertion pass even if the req were wrongly missing -- a silent assertion weakener. Read the field on the live req instead; the released-state check still holds because right after run_until_finished the req is still in last_batch with kv_allocated_len==0.

Replace the removed thin-wrapper calls with direct access to the original objects: handle field reads become r.req.<field> (output_ids, req_pool_idx, cached_tokens, finished_reason, ...) and the dropped context readers become direct t._scheduler.<...> expressions (running_batch rids, chunked_req rid, req_to_token_pool sizes; lock_refs_snapshot -> get_all_node_lock_refs). Kept wrappers (kv_pages, engine_stats, abort, list_active_reqs, start_req kwargs) are used unchanged. Released-state reads that follow a drain are made None-tolerant (assert r.req is None or ...). These files still reference the not-yet-implemented multi-line APIs (chunks_done, force_*, exhaust_*, ...) and do not run until those land.

With tie_word_embeddings, lm_head was tied to model.embed_tokens unconditionally. Under PP the embedding lives on the first rank, so the last rank (which computes logits) ended up with a PPMissingLayer head and crashed in _compute_lm_head. Mirror Qwen2: tie only when the model is on a single PP rank, otherwise build a real ParallelLMHead on the last rank and copy the tied embedding weight into it during load_weights.

The kv-canary SingleForwardManager assumes one forward pass at a time; pipeline parallelism keeps several micro-batches in flight and breaks its phase checker. Enable the canary only when pp_size == 1.

Under pipeline parallelism, micro-batches stay in flight after the running batch looks empty; flushing the cache then trips a radix-tree assertion when a late micro-batch releases a node from the flushed tree. Drive extra scheduler steps to drain the pipeline before flush_cache.

Lets a script keep a request resident for a fixed number of decode steps regardless of EOS, which scenarios that need a steady KV-cache occupant rely on. Defaults to False so existing scripts are unchanged.

start_req gains a prompt_token parameter so a script can issue a request whose prompt does not share a radix prefix with another req (needed to make a chunked req actually prefill instead of hitting a resident req's cache). get_all_node_lock_refs now handles SWARadixCache TreeNodes, which track full_lock_ref and swa_lock_ref separately instead of a single lock_ref.

The scenario never drove the KV pool tight enough to park the chunked req: the resident competitor and r both used token id 1, so r hit the radix prefix cache and prefilled almost nothing. Give r a distinct prompt token, keep the competitor resident via ignore_eos, and use enable_mixed_chunk so the competitor decodes alongside r's prefill chunks -- its decode growth exceeds its 0.7x admission reservation and pushes the full pool to 1.00 on r's final chunk, parking r (add_chunked_req early-return) until the competitor frees its KV. All assertions (observed_early_return, r.finished, no locked nodes after drain) are unchanged.

…lama The scripted-runtime PP tests ran Llama-3.2-1B-Instruct, whose tied word embeddings crashed under pipeline parallelism, which had forced a fix into production llama.py. Switch SMALL_MODEL to Qwen3-0.6B, which already handles tied embeddings under PP in qwen3.py, and revert the llama.py change.

…ng it Revert the earlier chunked=True at the prefill-completion cache site in batch_result_processor: that branch runs for every request finishing prefill (chunked final chunk and non-chunked single prefill alike), so forcing chunked=True suppressed the legitimate hit_count bump on the general prefill path, not just chunked self-inserts. A request inserts its prompt prefix twice -- via cache_unfinished_req when prefill completes and via cache_finished_req on finish -- while decode nodes are inserted once, so prompt nodes settle at hit_count 2 and decode nodes at 1, identically for chunked and non-chunked prefill. #18843's middle-chunk chunked=True still prevents the unbounded early-node inflation. Replace the old all==1 expectation with side-by-side chunked and non-chunked tests asserting the hit_count value set is exactly {1, 2}.

The scripted ctx.flush_cache() previously only confirmed the FlushCacheReqInput arrived at the scheduler; it never checked the flush actually happened. The scheduler's flush_cache() is a no-op returning success=False whenever it is not fully idle, so a flush request on a busy scheduler silently did nothing. Make flush_cache a generator that yields once -- the scheduler consumes the buffered FlushCacheReqInput during the same recv_requests() call, but only after the script yields control back -- and then asserts the cache is fully flushed: radix tree empty and both the KV allocator and the req_to_token pool restored to their fully-free baselines (captured at context init to sidestep page-rounding / padding-slot conventions). Add assert_flushed (default True) so the deliberately-busy flush-during-chunked test can opt out. Update all call sites to use `yield from t.flush_cache()`.

Classify radix nodes by cumulative token position instead of only checking the hit_count value set: nodes ending within the prompt (prefill) must each be hit exactly twice, nodes beyond it (decode) exactly once. This pins the count to the correct node role rather than merely asserting both values occur.

Record the flush outcome on SchedulerFlushWrapper.last_success and have the scripted ctx.flush_cache() assert that scheduler.flush_cache() returned True, rather than re-deriving flushed-ness from radix/pool state. The scheduler already computes the authoritative success bool (False whenever it is not fully idle); reading it directly is simpler and exact. The scripted helper clears last_success before posting and asserts it is True after the yield that lets the scheduler run the flush. Drop the pool-baseline capture in ScriptedContext.__init__.

This reverts commit b3df1f2.

Drop the assert-it-is-really-flushed logic from scripted ctx.flush_cache and restore it to its original fire-and-confirm form, reverting the generator signature, the pool-availability baselines on ScriptedContext, and the `yield from t.flush_cache()` call sites. The flush verification will be reworked separately.

Without enable_mixed_chunk and without an ignore_eos competitor, the add_chunked_req hybrid-SWA early-return is unreachable through ordinary request traffic: while a chunked req is being prefilled, prefill preempts decode (so a running req never grows the pool) and no second req can be admitted (the chunk budget is held), so nothing tightens rem_total_tokens mid-prefill. The previous design relied on exactly that mixed-chunk decode kept alive by ignore_eos. Reproduce the pool state directly instead: admit a single chunked req r on an empty pool, then drain the full-attention allocator from the script so r's next chunk cannot be admitted -> the real scheduler takes the early-return and parks r (chunked_req set, _chunked_req_scheduled_last_iter False). Releasing the reserved KV lets r resume; the test still asserts the stash gate left no radix lock refs behind. Add ScriptedContext.reserve_full_attention_kv / full_attention_available_size (context/pool.py) for the deterministic full-pool pressure.

…e) into start_req API work Conflicts in scripted_runtime start_req, both sides extended the same signature: - context/req_starter.py: keep theirs' prompt_token (input_ids = [prompt_token]*prompt_len) and ours' priority/dp_rank payload passthrough; ignore_eos was added by both, kept once; use the shared sampling_params var (equivalent to theirs' inline {max_new_tokens, ignore_eos}). - context/api.py: union the public start_req kwargs (ignore_eos, priority, dp_rank, prompt_token) in both the signature and the delegated call; ours' abort/list_active_reqs/engine_stats auto-merged in untouched.

lock_refs (ScriptedReqHandle): the radix lock_ref held on the req's last_node (0 once last_node is cleared on release) -- a genuine multi-line derivation, no instrumentation. batch_composition (ScriptedContext): a pure read of the scheduler returning {prefill, decode, chunked, running} rid lists, with the chunked req excluded from prefill/decode so the subsets stay disjoint. Add registered tests for both.

Rework the regression to mirror issue #24252's actual trigger. The chunked-req hybrid-SWA early-return is driven by rem_total_tokens <= 0, where the running batch's reserved-decode offset is sum(max_new - output) * new_token_ratio. In production a KV-pool-full retraction jumps new_token_ratio (log: 0.098 -> 0.7589), ballooning that reservation and parking a mid-prefill chunked req. A retraction only fires on the decode path, and the scripted runtime steps deterministically with overlap scheduling off, so decode (hence the ratio jump) cannot run while a chunked req is being prefilled. So reproduce the retraction's effect directly: a resident holder with a large max_new carries the reservation, r is admitted at a low (decayed) ratio, then the ratio is jumped exactly as retract_decode does -> r is parked via the early-return. Restoring the decayed ratio lets r resume; the test asserts the stash gate left no radix lock refs. Drop the earlier full-attention pool-drain helper (context/pool.py + the two ScriptedContext methods); the ratio-jump path reproduces the real mechanism and no longer needs it.

…ests has_pending_chunk has no accessor and is equivalent to the existing is_chunking (true while the req holds the chunked_req slot, false after abort/pause/retract/finish); every assertion site is consistent with that, so rename rather than add a redundant one-line wrapper.

…anual tests pending_middle_outputs has no harness accessor and is exactly the existing Req.inflight_middle_chunks (incremented per dispatched middle chunk, decremented when its forward result is consumed, zeroed on retract/abort) -- a direct field read, not instrumentation. Rename all references (incl. test/script names and messages) to read r.req.inflight_middle_chunks. Drop the duplicate test_inflight_middle_chunks_non_negative the rename collided with.

…inish guard The scripted harness has no finish_event_count accessor, and 'finish emitted exactly once' is already enforced by the engine: output_streamer asserts 'not req.finished_output' before streaming a finished req, and the req pool asserts req_pool_idx is not None on free -- the scripted runtime surfaces any such scheduler-side crash as a failed test. So the count assertion is redundant: keep driving each scenario (abort / preempt / PP races) and assert the req finished; a double-finish would have crashed the engine. == 1 -> assert finished; <= 1 -> a comment noting the engine enforces at-most-one.

Add a single gated hook at the scheduler's run_batch chokepoint: when scripted_scheduler_hook is set, record each run batch (forward_iter, mode, rids, extend_rids, chunked_rid) into ScriptedSchedulerHook._batch_log (driver rank only; log cleared after the per-script reset drain). No scheduler business-logic change and zero production cost (hook is None otherwise). chunks_done(rid) is then derived in the harness as the number of logged batches where the req held the chunked_req slot and actually ran -- exactly the number of chunked prefill passes (0 when never chunked). Expose it as ScriptedContext.chunks_done / ScriptedReqHandle.chunks_done, so the manual suite's r.chunks_done works unchanged. Add registered tests (<=chunk -> 0, chunk+2 -> 2, 5*chunk -> 5).

ForwardMode is an IntEnum, so .name.lower() is simpler than the is_*() chain and also captures the modes the chain collapsed to 'unknown' (target_verify, split_prefill, dllm_extend, ...).

…nset mode is now Optional[str].

…o-v2 multimodal-spec fix

fzyzcjy · 2026-06-08T02:51:19Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Update after merging latest main into this branch (which pulls in #27512): base-c-8gpu-h200 (2) / test_mimo_v2 is now SUCCESS — #27512's multimodal pad-sentinel clamp fixed the scheduler-init crash (exit -3 from an out-of-vocab OOB gather in the MiMo-V2 + multi-layer-EAGLE draft path). Confirmed both by a fresh tip-of-main rerun-test (passed in 169s) and now in this PR's own CI. The earlier 'main also fails' cross-branch result was a timing artifact — my probe ran on a07d813, before #27512 landed.

Remaining red lanes are all unrelated to this PR's diff (PP cross-mb exclusion / scripted harness / gsm8k eval):

deepep-4gpu-h100 (test_deepep_small): torch.distributed.DistNetworkError ... EADDRINUSE ... port 43639 already in use → Rank 0 scheduler died during init — a CI-runner port-collision infra flake.
base-c-8gpu-h200 (0) (test_dsa_glm5_tp_mtp): throughput AssertionError: 141.63 not greater than 150 — a perf-threshold blip on a DSA/GLM5/MTP path this PR does not touch.
pr-test-extra-finish: aggregator cascade from the above.

Will /rerun-failed-ci once the round fully settles. Please push back if any conclusion is off.

fzyzcjy · 2026-06-08T02:52:24Z

/rerun-failed-ci

fzyzcjy · 2026-06-08T05:06:49Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Full rundown of every lane that went red on this PR and what each one actually was.

PR-introduced — all fixed

PP × chunked-prefill regression (TestRegressionPp reqs did not finish; test_e2e_pp gsm8k 0.49). The cross-mb in-flight exclusion SRT fix (69ef71edc4) had been dropped in an earlier main merge; once the harness's corrected finished-detection landed, the missing exclusion surfaced as a hang + KV corruption. Fix: cherry-picked the exclusion back, refined it to last-chunk EXTEND reqs only (it was over-excluding and pinning the long req forever), and added a batch-log finished-detection fallback for short-circuited a.finished and b.finished loops.
num_examples > available test lines AssertionError across ~9 lanes (base-a-1gpu-small, base-c/extra-b b200+h200 → test_basic_sanity_eagle3 / test_mimo_v2 / test_deepseek_v3_fp4 / test_flashinfer_a2a). A hard assert I'd added in the GSM8KEval prefix-pool refactor; upstream silently caps via list slicing and several tests deliberately pass num_examples larger than the dataset. Dropped the assert → 9 failures → 1.
base-c-8gpu-h200(2) / test_mimo_v2 server-startup crash (scheduler_0 exit -3). NOT a permanent infra timeout — it was an out-of-vocab OOB gather in the MiMo-V2 + multi-layer-EAGLE draft path, fixed upstream by [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512. My first cross-branch probe ran on a07d813 (before [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512 landed), which is why main 'also failed' then — a timing artifact. After merging latest main (which includes [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512) this lane is now SUCCESS (also confirmed by a tip-of-main rerun: passed in 169s).

Not PR-introduced — infra/perf, only surfaced after merging latest main

deepep lanes (test_deepep_small): EADDRINUSE — port 43639 already in use → Rank 0 scheduler died during init. A CI-runner port-collision infra flake; this PR doesn't touch MoE/DeepEP/dist-init. Most deepep lanes already re-ran green; the last couple are re-running now.
base-c-8gpu-h200(0) / test_dsa_glm5_tp_mtp: throughput 141.63 not > 150 — a perf-threshold miss on a DSA/GLM5/MTP path this PR's diff (PP exclusion / harness / gsm8k eval) does not touch. A tip-of-main cross-branch rerun is dispatched to confirm it's pre-existing.

Current: 70 SUCCESS / 0 FAILURE / 3 re-running. Will confirm the last reruns settle. Please push back if any conclusion is off.

fzyzcjy · 2026-06-08T07:07:33Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final babysit result.

#27506 is fully green — 77 SUCCESS / 0 FAILURE / 0 running.

All previously-red lanes resolved, each with cross-branch confirmation:

deepep (EADDRINUSE port 43639 collision → Rank-0 scheduler died): re-ran green on every lane (4-gpu-h100, 4-gpu-b200, 8-gpu-h200). Confirmed a CI-runner port-collision infra flake — a tip-of-main test_deepep_small rerun also passed cleanly (635s, 0 EADDRINUSE).
base-c-8gpu-h200(0) / test_dsa_glm5_tp_mtp (throughput 141.63 not > 150): re-ran SUCCESS; the tip-of-main cross-branch rerun of the same test also passed → a perf-threshold blip, not this PR (which doesn't touch DSA/GLM5/MTP).
base-c-8gpu-h200(2) / test_mimo_v2: fixed upstream by [Spec] Clamp multimodal pad sentinels in spec-v2 draft prefill embedding #27512 (multimodal pad-sentinel OOB-gather clamp), picked up by merging latest main.

PR-introduced failures were all fixed earlier in this babysit: the PP×chunked-prefill exclusion regression, and the num_examples > available assert. Net: green. Please push back if any conclusion is off.

… test _in_flight_other_mb_rids + filter_batch(exclude_in_flight_other_mb=...) were cherry-picked from feat/stateless_scheduler_b (commit 69ef71e), where a stateless-scheduler rewrite genuinely introduced a PP+chunked KV corruption (gsm8k 0.66 -> 0.77). That rewrite derived has_pending_chunk from admission-time state so it cleared one step too early, opening a window where a req's last-chunk forward is still in flight in mb_a while mb_b merges it into running_batch and decodes on empty output_ids. upstream/main structurally prevents this window and never had it: - per-mb last_mbs[mb_id] is committed only AFTER _pp_process_batch_result runs on that batch (scheduler_pp_mixin.py), so a req is merged for decode only once its forward result is processed; - inflight_middle_chunks defers output append until every chunk forward is processed (batch_result_processor.py), so decode never runs with empty output_ids; - has_pending_chunk / pending_middle_outputs (the mechanism the fix's docstring describes) never existed on main. So on this branch the code was dead defensive belt-and-suspenders. The accompanying test only asserted that the exclusion mechanism engages and is self-consistent (structural), never that omitting it corrupts output, and the only failure ever seen here was the fix's own over-exclusion. Drop both. filter_batch and the two scheduler call sites now match main exactly.

fzyzcjy · 2026-06-08T07:42:01Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Removed the _in_flight_other_mb_rids PP cross-mb chunked-exclude code and its scripted test after an investigation concluded it is dead defensive code on main; remaining PP tests verified green. Please push back if any conclusion is off.

Removed _in_flight_other_mb_rids (PP cross-mb in-flight chunked-exclude) + its scripted test (4eb5bc64a0).

This code (_in_flight_other_mb_rids + filter_batch(exclude_in_flight_other_mb=...)) was cherry-picked from a separate stateless-scheduler rewrite branch (commit 69ef71edc4), where a genuine PP+chunked KV-corruption regression existed (gsm8k 0.66 → 0.77). That regression was introduced by that rewrite: it derived the "still has a pending chunk" flag from admission-time state, so it cleared one step too early, opening a window where a req's last-chunk forward is still in flight in mb_a while mb_b merges it into running_batch and decodes on empty output_ids.

main structurally prevents that window and never had the bug:

per-mb last_mbs[mb_id] is committed only after _pp_process_batch_result runs on that batch (scheduler_pp_mixin.py), so a req is merged for decode only once its forward result has been processed;
inflight_middle_chunks defers the output append until every chunk forward is processed (batch_result_processor.py), so a decode never runs with empty output_ids;
has_pending_chunk / pending_middle_outputs (the data model the removed docstring described) never existed on main.

So on this branch the code was dead belt-and-suspenders, and its scripted test was tautological — it only asserted that the exclusion mechanism engages and is self-consistent, never that omitting it corrupts output. (The stateless branch itself later dropped the cross-mb scan, concluding the per-mb last_mbs invariant suffices.) filter_batch and the two scheduler call sites now match main exactly.

Verification (sci-h200, 4×GPU, this commit, Qwen3-0.6B): TestRegressionPp + test_scripted_pp.py → 11 passed / 1 skipped / 0 failed. Notably test_pp_last_chunk_cross_mb_kv_correctness (a dedicated cross-mb last-chunk KV-correctness test) passes without the removed exclusion, confirming it was not load-bearing.

…time_and_chunked_testing

fzyzcjy · 2026-06-08T09:01:08Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triaged the CI failures on the latest head after merging current main; classifying them and planning reruns. Please push back if any conclusion is off.

CI triage for head f240a513a6 (after merging latest main via #27502). All current failures are infra/perf flakes, not caused by this PR.

Context: after the merge this branch's production code (python/sglang/srt/) is now identical to main (the merge reconciled the earlier divergence). The branch's own registered scripted tests pass on this exact merged commit (rcli, sci-h200 4×GPU: test_scripted_swa_1gpu.py + test_scripted_core_1gpu.py → 8 passed; TestRegressionPp + test_scripted_pp.py earlier → 11 passed). So the red lanes below are the standard suite, not this PR's tests.

Fingerprints:

HTTP 504 Gateway Timeout during env/dependency setup (dies before tests run) — base-b-test-1-gpu-small (4/5/6/8), base-b-test-2-gpu-large (1), base-c-test-4-gpu-gb300 (0). A transient CI package-fetch wave.
extra-b-test-4-gpu-h100 (0) → test/registered/hicache/test_qwen35_hicache.py: zmq.error.ZMQError: Address already in use (tcp://*:5557) — port-collision infra flake (not this PR's 4-GPU scripted test).
extra-a-test-1-gpu-large (0) → test/registered/perf/test_bench_serving_1gpu_large.py: AssertionError: 1010.2 not less than 900 — perf-threshold blip (not this PR's swa/core scripted test).
base-c-test-8-gpu-h20 (1) — chronic H20 lane, ignored per CI policy.

Plan: let the round finish, then rerun the failed jobs once the 504 infra wave clears.

fzyzcjy · 2026-06-08T09:39:39Z

/rerun-failed-ci

fzyzcjy · 2026-06-08T10:03:37Z

/rerun-failed-ci

fzyzcjy · 2026-06-08T11:08:47Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Final CI triage after re-running the flaky lanes. Please push back if any conclusion is off.

Rerun update for head f240a513a6: re-running the failed jobs cleared all the infra/perf flakes (the 504-setup wave, the test_qwen35_hicache zmq port collision, the test_bench_serving perf blip, and the chronic H20 lane all went green). Now 64 SUCCESS / 8 SKIPPED, with a few long multi-GPU lanes still finishing.

One persistent failure remains — and it is not caused by this PR:

extra-b-test-4-gpu-b200 (0) → test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py: AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size' (exit -9). It failed on both the original run and the rerun (persistent, not a flake).

Why it's main-inherited, not this PR's: after merging current main, this branch's production code (python/sglang/srt/) is byte-identical to main (git diff upstream/main -- python/sglang/srt/ is empty), and this PR touches no LoRA / MoE code — it only adds tests + scripted-runtime harness. So the FusedMoEWithLoRA.hidden_size regression lives in main's LoRA+MoE path and reproduces independently of this PR. (I have not dispatched a fresh tip-of-main data point for this specific test; happy to if a cross-branch confirmation is wanted.)

fzyzcjy · 2026-06-08T12:26:17Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Root-caused the one remaining CI failure and confirmed it is a pre-existing main regression, not this PR. Please push back if any conclusion is off.

Root cause of the sole remaining red lane extra-b-test-4-gpu-b200 (0) (test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py):

File ".../sglang/srt/models/gpt_oss.py", line 294, in forward_normal
    hidden_dim_unpadded = self.experts.hidden_size
AttributeError: 'FusedMoEWithLoRA' object has no attribute 'hidden_size'

When LoRA is enabled, self.experts is wrapped by FusedMoEWithLoRA (srt/lora/layers.py), which does not expose hidden_size (no attribute, no __getattr__ proxy to base_layer). The bare self.experts.hidden_size access in gpt_oss.py was added by #27063 ("[AMD] Optimize gpt-oss-120B performance") and does not account for the LoRA wrapper, so gpt-oss + MoE-LoRA crashes during CUDA-graph capture (exit -9).

This is not introduced by this PR:

After merging current main, this branch's python/sglang/srt/ is byte-identical to main (git diff upstream/main -- python/sglang/srt/ is empty), and this PR touches neither models/gpt_oss.py nor lora/layers.py — it only adds tests + the scripted-runtime harness.
The failing line is still present on tip-of-main (1ff7c627cd), and there are no commits to gpt_oss.py/lora/layers.py between this branch's merge base and tip-of-main — i.e. main itself is still broken on this path, so re-merging main would not clear this lane. A search of open PRs found none fixing this specific gpt_oss × FusedMoEWithLoRA.hidden_size regression.

Minimal fix (separate, on main code — out of scope for this test-only PR): expose hidden_size on the LoRA MoE wrapper (proxy to base_layer), or have gpt_oss.py read getattr(self.experts, "hidden_size", None) or self.experts.base_layer.hidden_size.

Everything else on this PR is green (all infra/perf flakes cleared on rerun).

…time_and_chunked_testing

fzyzcjy · 2026-06-09T00:05:34Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Triage of the current CI round after re-merging latest main (which picked up the gpt-oss LoRA fix). Classifying each red lane; none are caused by this PR. Please push back if any conclusion is off.

Status for head 17f139b55c (CI still finishing — ~19 lanes running). The earlier blocker is resolved: the FusedMoEWithLoRA.hidden_size regression is gone (fixed on main by #27528, pulled in via this merge), and this branch's python/sglang/srt/ remains byte-identical to main (test-only). The CPU-registered test_mixed_prefix_gsm8k.py passes.

All remaining red lanes are infra / borderline-flake, in code this PR does not touch:

b200 lanes (extra-b/base-b/base-c …-4-gpu-b200, …-deepep-…-b200) — two distinct non-PR causes observed:
- dirty-runner GPU-memory pre-check abort: ERROR: memory >=10%: GPU 0 (63%) … FAIL — Aborting CI (dies before any test runs);
- borderline accuracy flake: test/registered/spec/eagle/test_eagle_infer_beta_dp_attention.py::test_a_gsm8k → AssertionError: np.float64(0.62) not greater than 0.62 (score landed exactly at the >0.62 threshold; spec-decode/DP-attn path, untouched here).
CPU lanes — test/registered/model_loading/test_external_models.py exit -9 (OOM loading large external checkpoints, e.g. Kimi-K2-NVFP4).
base-c-…-8-gpu-h20 — chronic H20 machine lane, ignored per CI policy.
large 1-/2-gpu base-b lanes — same infra-flake flavor (and on Add mixed-prefix gsm8k eval and its CPU unit test #27502 they fan out via fast-fail since that PR has no bypass-fastfail).

Plan: not re-running mid-round (would cancel in-flight lanes). Once the round settles I'll re-run the failed jobs; the b200 lanes additionally need the runner fleet to be clean to pass. Will keep watching.

fzyzcjy · 2026-06-09T04:04:55Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. Re-running the one remaining red lane (a perf-threshold flake) to confirm it's a flake. Please push back if any conclusion is off.

After re-running the failed jobs, every infra/b200 lane went green — the only remaining red is a perf-threshold flake, not this PR:

base-c-test-8-gpu-h200 (0) → test/registered/models_e2e/test_dsa_glm5_tp_mtp.py: AssertionError: 140.9 not greater than 150 (throughput ~141 vs the strict >150 threshold). DeepSeek-style glm5 TP+MTP throughput test; this branch is test-only (python/sglang/srt/ is byte-identical to main) and touches no model/perf code, so this is a shared-runner perf blip, not a regression here.

Re-running this single file to confirm.

fzyzcjy · 2026-06-09T04:04:57Z

/rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py

github-actions · 2026-06-09T04:05:21Z

Results for /rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py:

🚀 8-gpu-h200 (1 test): ✅ View workflow run

cd test/ && python3 registered/models_e2e/test_dsa_glm5_tp_mtp.py

fzyzcjy · 2026-06-09T06:04:43Z

🤖 Posted autonomously by Claude Code acting on the user's behalf. The single-file re-run of the last red lane passed, confirming it was a flake. Please push back if any conclusion is off.

Update: the /rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py dispatch passed (Rerun Test workflow → success) — so the lone remaining red (base-c-test-8-gpu-h200 (0), dsa_glm5 throughput 141 vs >150) is confirmed a perf-threshold flake, not a regression here. The rollup's base-c-8-gpu-h200 (0) check is just the stale result from the original round (the /rerun-test runs as a separate workflow and doesn't overwrite that check); it'll clear on a normal failed-job rerun.

Net: this PR is effectively green — every one of the ~18 lanes that went red across this round was infra (504-setup wave, b200 dirty-runner GPU-mem pre-check, CPU-OOM on external-model loads, chronic H20) or a perf-threshold flake (dsa_glm5, eagle-dp-attn), all of which passed on re-run. This branch's production code (python/sglang/srt/) is byte-identical to main, so none of these are caused by this PR.

…e_and_chunked_testing

fzyzcjy · 2026-06-09T12:19:57Z

fzyzcjy added 30 commits May 31, 2026 22:22

Apply black to scripted-runtime core tests

5053d8a

Enable kv-canary only for non-PP scripted runtime servers

6aed448

The kv-canary SingleForwardManager assumes one forward pass at a time; pipeline parallelism keeps several micro-batches in flight and breaks its phase checker. Enable the canary only when pp_size == 1.

Add per-request ignore_eos to scripted runtime start_req

06b0716

Lets a script keep a request resident for a fixed number of decode steps regardless of EOS, which scenarios that need a steady KV-cache occupant rely on. Defaults to False so existing scripts are unchanged.

Apply black formatting to scripted ctx api

b04c139

Revert "Assert scheduler flush_cache result in scripted flush_cache"

3703d72

This reverts commit b3df1f2.

Use ForwardMode.name.lower() for batch record mode

fd20c46

ForwardMode is an IntEnum, so .name.lower() is simpler than the is_*() chain and also captures the modes the chain collapsed to 'unknown' (target_verify, split_prefill, dllm_extend, ...).

Record batch mode as None instead of 'unknown' when forward_mode is u…

f8e6bad

…nset mode is now Optional[str].

Merge latest main (via #27502) into this branch — picks up #27512 mim…

c1d8ff1

…o-v2 multimodal-spec fix

fzyzcjy added 2 commits June 8, 2026 15:52

Merge branch 'main-upstream' into tom/mixed_prefix_gsm8k_eval

8c8f08d

Merge tom/mixed_prefix_gsm8k_eval (latest main) into tom/scripted_run…

f240a51

…time_and_chunked_testing

amd-bot mentioned this pull request Jun 8, 2026

[CI Monitor] Daily Report - 2026-06-08 bingxche/sglang-ci-bot#97

Open

fzyzcjy added 2 commits June 9, 2026 07:17

Merge branch 'main-upstream' into tom/mixed_prefix_gsm8k_eval

9b01721

Merge tom/mixed_prefix_gsm8k_eval (latest main) into tom/scripted_run…

17f139b

…time_and_chunked_testing

fzyzcjy changed the title ~~Add abundant testing for chunked prefill~~ Add more testing for chunked prefill Jun 9, 2026

Base automatically changed from tom/mixed_prefix_gsm8k_eval to main June 9, 2026 12:17

Merge remote-tracking branch 'upstream/main' into tom/scripted_runtim…

6e4b4dd

…e_and_chunked_testing

fzyzcjy merged commit 1368717 into main Jun 9, 2026
62 of 100 checks passed

fzyzcjy deleted the tom/scripted_runtime_and_chunked_testing branch June 9, 2026 12:19

amd-bot mentioned this pull request Jun 10, 2026

[CI Monitor] Daily Report - 2026-06-10 bingxche/sglang-ci-bot#99

Open

fzyzcjy mentioned this pull request Jun 13, 2026

DO NOT MERGE - CI sandbox for stateless scheduler b temp run #27667

Open

Conversation

fzyzcjy commented Jun 7, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI States

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

PR-introduced — all fixed

Not PR-introduced — infra/perf, only surfaced after merging latest main

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 8, 2026

Uh oh!

fzyzcjy commented Jun 9, 2026

Uh oh!

fzyzcjy commented Jun 9, 2026

Uh oh!

fzyzcjy commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Jun 9, 2026

Uh oh!

Uh oh!

fzyzcjy commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fzyzcjy commented Jun 7, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading