DO NOT MERGE - CI sandbox for stateless scheduler b temp run#27667
DO NOT MERGE - CI sandbox for stateless scheduler b temp run#27667fzyzcjy wants to merge 322 commits into
Conversation
User-requested cleanup: retract_all's only meaningful caller is pause_generation(retract). Inline at the call site. UTs that import retract_all will break — accepted per user direction. Part of waiting_queue refactor chain.
The original code had the if/else order 'real-sample first / middle- chunk second'. The earlier refactor inverted it into 'middle-first + continue, then assert + real-sample', which introduced a structural churn that diffed nearly the entire function body. Invert back to the original branch order, replacing only the condition: if req.pending_middle_outputs <= 0: → if not mode.is_intermediate(): ...else: req.pending_middle_outputs -= 1 → ...else: (line removed) Plus the for-loop zip gets a 'mode' element. Imports drop the now- unused OutputProcessMode. Net diff in batch_result_processor and disaggregation/prefill is ~8 condition-level lines per loop instead of a full-block rewrite. Behavior unchanged.
Combines running_batch.reqs + chunked_reqs() into a single retract iteration. Each req goes through release_req + _deactivate + _add_request_to_queue uniformly. The previous explicit chunked-orphan release loop (~25 lines) is now subsumed by the combined iteration. Releases the chunked-resume reqs via the same release_req as running reqs — accepts the main-upstream pre-existing latent bug where disagg PREFILL chunked won't trigger sender.abort (see C14 abort_request for the same trade-off). Part of waiting_queue refactor chain.
Three filter_batch(only_decode_ready=True) call sites were purely defensive: at each location an upstream filter or merge step already guarantees no intermediate-mode reqs remain. The filter was a silent no-op in the common case and a silent fix-up if the invariant was violated. Replace with explicit asserts so future invariant violations surface loud rather than being papered over. Sites: - disaggregation/decode.py: prebuilt batch should not carry chunked reqs (chunked is prefill-side only). - scheduler.py is_prefill_only branch: last_batch filter+merge above already drops intermediate-mode reqs. - scheduler.py mix_with_running prep: same invariant; split functional v1_spec_info_filtered from defensive only_decode_ready (assert). Functional filter_batch(only_decode_ready=True) call sites (disagg/prefill.py, scheduler.py last_batch merge, etc.) are unchanged — those actually drop intermediate-mode reqs as part of the merge flow.
…shed_req
The guarded scenario ('PP+chunked: same Req in multiple in-flight mbs[*] batches; last chunk slot releases first, sibling slot's pending result re-releases here') no longer reaches this code path after the OutputProcessMode refactor:
- middle-chunk results route to process_batch_result_prefill's ELSE branch (_apply_chunked_prefill_logprobs), which never touches req_pool_idx
- last-chunk + decode results route through the IF branch / _handle_finished_req exactly once per req per finish
- line 221 'if req.finished() or req.is_retracted: continue' provides a separate defense against stale-finished entries
If a same-req-in-two-running-batches bug regresses, release_kv_cache's own assert (tree_cache.supports_mamba()) will bomb loudly — which is preferable to the previous silent skip.
Replace the implicit "reset host_hit_length to short-circuit
init_load_back" contract with an explicit local variable that is
zeroed for reuse admissions:
effective_host_hit_length = 0 if is_resume else req.host_hit_length
Apply effective_host_hit_length to both consumers in add_one_req:
the budget-control subtraction (real_input_tokens) and the
init_load_back predicate.
Delete the prepare_for_extend `req.host_hit_length = 0` reset added
in d7fa48b. That reset was overloading host_hit_length — a
match_prefix output — as a trigger flag for init_load_back, and
required all post-admission code paths (including retract+re-admit)
to keep it reset. The local-variable approach removes that implicit
cross-function contract entirely, so req.host_hit_length recovers
single-writer semantics (written only by init_next_round_input).
Brings the init_load_back skip in line with the other reuse-vs-fresh
differences in add_one_req (_req_inc_lock_ref and budget_prefix),
which already use explicit `is_resume` branches.
…er split Paste sglang-dev-d (main-upstream) old `add_chunked_req` back into `schedule_policy.py` under the temporary name `_add_chunked_req_restored` and add a top-of-`add_one_req` guard that routes chunked-resume reqs to it. This is a transitional state so the next commit can rename / dispatch without re-deriving behavior — preparation for splitting `add_one_req` into `add_first_chunk_req` and `add_non_first_chunk_req`.
…with scheduler-side dispatch Rename `PrefillAdder.add_one_req` → `add_first_chunk_req` and the transitional `_add_chunked_req_restored` (introduced in the previous commit) → `add_non_first_chunk_req`. Drop the in-callee dispatch guard and push the chunked-resume vs. fresh decision to the scheduler: - `scheduler.py` chunked-resume admission → `add_non_first_chunk_req(req)` - `scheduler.py` waiting_queue loop → `add_first_chunk_req(req, ...)` - `dllm/mixin/scheduler.py` → `add_first_chunk_req(req, ...)` - Tests in `test_prefill_adder.py` are fresh-path, so call `add_first_chunk_req` Also rewrite the two scheduler comment blocks that referenced the old in-callee `is_resume` flag, and update stale cross-references in `mem_cache/*` that point at the resume reuse path.
Scheduler-side dispatch (previous commit) guarantees chunked-resume reqs never enter `add_first_chunk_req`, so the in-callee `is_resume` flag and its derived `effective_host_hit_length` / `budget_prefix` branches are now dead code. Remove all three, restore the function to fresh-only behavior: `host_hit_length` is the raw `req.host_hit_length`, `_req_inc_lock_ref(req)` is unconditional on every admission path, and `_update_prefill_budget` uses the actual `prefix_len`. Add a defensive assert at entry so a future scheduler-side dispatch bug crashes loudly instead of silently double-locking.
Minimal adaptations on top of the main-upstream `add_chunked_req` restoration: - Add type annotation `-> AddReqResult`. - Entry assert documents the scheduler-side dispatch invariant (chunked-resume only, never DLLM). - Drop the unreachable DLLM branch (assert forbids it). - Replace `return req if truncated else None` with `self.budget_state()`, and the SWA early `return req` with `AddReqResult.NO_TOKEN`. The scheduler doesn't read this return value today, but symmetry with `add_first_chunk_req` and consistent budget feedback is preferable to a `Req | None` ad-hoc shape. - Append `req.set_scheduled_extend_len(...)` so dev-f's derived `has_pending_chunk` view sees the new admit on the next round. Behavior otherwise stays strictly aligned with main-upstream `add_chunked_req`: no `truncation_align_size`, no `_swa_budget_for_req`, no `_lock_node` / `_req_inc_lock_ref` / `init_load_back` / `host_hit_length`. The main-upstream-era latent bug where deterministic inference + flashinfer + multi-chunk prefill drifts off alignment on continuation chunks is consciously preserved here; if we ever want to fix it, do so in a separate evaluated commit.
… position Place add_non_first_chunk_req between add_dllm_staging_req and _lock_node to mirror main-upstream's layout (where add_chunked_req sat). Method ordering now aligns 1:1 with main-upstream's, minimizing diff.
The fill_ids array on Req was a copy of (origin_input_ids + output_ids [+ DLLM mask block])[:fill_len] -- it carried no information beyond the integer fill_len. Drop the array; store only fill_len. Token-content callers now go through build_fill_token_ids() / build_full_token_ids() which rebuild the sequence on demand. Length-only callers use req.fill_len directly. Removes the dual-phase (full / truncated) state machine that fill_ids carried, dissolving the in-iter "full phase" that the SWA gate Scheduler._chunked_req_scheduled_last_iter was protecting against (the gate itself is left in place as a no-op safety net and can be cleaned up separately). Also lets the DLLM mask-block in-place write at dllm/mixin/scheduler.py disappear: the generated tokens are already extended into output_ids on the next line, so rebuilding the full token sequence on the next iter naturally produces the same array. API additions on Req: - get_full_len() -> int: origin + output + (mask if DLLM) - build_full_token_ids() -> array: the array form, O(L) - build_fill_token_ids() -> array: build_full_token_ids()[:fill_len] reset_for_retract now also clears fill_len so a retracted req has a clean integer state before re-admission. The fill_len <-> kv_committed_len relationship is unchanged: they agree at prefill chunk boundaries and diverge during decode steps, just like before.
Restore if/else dllm structure (dllm branch is dead at runtime: assert + scheduler dispatch both rule out DLLM reqs) so body indentation matches main-upstream's 12-space else block. Saves ~13 lines of pure-indentation diff. Compress comment+assert to 2 lines. Drop stale return comment. Resulting diff against main-upstream's add_chunked_req body is now only the 4 necessary dev-f adaptations: signature + return annotation, entry assert + comment, return enum (NO_TOKEN / budget_state), and the trailing set_scheduled_extend_len.
get_full_len -> get_full_untruncated_fill_len build_full_token_ids -> build_full_untruncated_fill_ids The 'full untruncated' qualifier makes the contrast with the truncated fill_len/build_fill_token_ids() pair explicit at every call site.
Previously, _init_fill_ids_for_dllm set fill_len to the full
untruncated length (origin + output + block_size) at the top of
init_next_round_input, only to have admission immediately truncate
it back to prefix + block_size. That left fill_len with a transient
'full phase' on the DLLM path while the non-DLLM path was already
single-phase committed-truncated.
Move the phase-detection gate in determine_dllm_phase to use
get_full_untruncated_fill_len() — semantically that's what it's
asking ('is the full sequence long enough to inspect one block?')
— so we no longer need to write fill_len = full at the entry of
init_next_round_input. fill_len now uniformly means 'committed
truncated length' across DLLM and non-DLLM.
Switching strategy to Design E (stored full_untruncated_fill_ids + fill_len marker). The derive-only approach is correct but its 'pure-derived fill_ids' loses some debuggability and stores the mask block implicitly. Design E keeps the array stored, splits the ambiguous fill_ids field into (full_untruncated_fill_ids, fill_len) so admission only updates the integer marker.
Splits the ambiguous fill_ids field into two:
- full_untruncated_fill_ids: array stored. The full sequence
(origin + output, plus DLLM mask block). Rebuilt at the top of
each init_next_round_input; not mutated by admission.
- fill_len: int. Truncation marker. Admission writes only this.
A new method Req.get_fill_ids() returns
full_untruncated_fill_ids[:fill_len] — equivalent to the old fill_ids
in committed-truncated form.
This removes the dual-phase semantics of the old fill_ids field
('sometimes full, sometimes truncated') without touching DLLM's
mask-block container invariant. The in-place mask write at
dllm/mixin/scheduler.py is preserved (operates on the new array
with an explicit fill_len-based index range).
Eliminates the in-iter mutation that the SWA gate
_chunked_req_scheduled_last_iter was protecting against; that gate
can be removed as a follow-up.
The Design E refactor added defensive clears of full_untruncated_fill_ids and fill_len in reset_for_retract, but OLD's reset_for_retract never touched fill_ids. Removing the clears restores byte-equivalence to OLD on the retract -> next-iter _init_fill_ids_for_dllm path, where the first-iter check (if not self.fill_ids / if self.fill_len == 0) controls dllm_block_offset reset vs advance. The defensive clear made sense semantically (a retracted req has no committed KV, so fill_len == 0 reads naturally), but PR1's contract is strict equivalence. Re-introduce the clear in PR2 once fill_len's semantics are pinned down to 'committed truncated length only'.
Three remaining reads of req.full_untruncated_fill_ids — in _prefetch_kvcache, init_next_round_input's match_prefix key build, and determine_dllm_phase — were byte-equivalent to the OLD req.fill_ids reads only because fill_len equals len(full) at those moments under PR1's contract. Replace them with the literal mechanical translation req.get_fill_ids()[...] so the equivalence no longer relies on that 'fill_len happens to be full here' observation. The DLLM mask in-place write at dllm/mixin/scheduler.py keeps full_untruncated_fill_ids[fill_len - new_tokens : fill_len] = ... because in-place assignment cannot go through get_fill_ids() (a slice returns a new array, not a view onto the underlying storage). Cost is one extra array allocation per call at each of the three sites. None are hot enough to matter (init runs once per req per iter; the slice is the same O(L) operation OLD already did, just via an intermediate get_fill_ids() step).
After PR #26637 split Req.fill_ids into (full_untruncated_fill_ids, fill_len), fill_len still inherited the OLD field's dual phase: at init_next_round_input entry it was written to len(full_untruncated), then admission overwrote it with prefix + trunc. This commit makes fill_len mean 'committed truncated length' at every point in the request lifecycle. Changes: - init_next_round_input: drop the non-DLLM fill_len write; read input_len from len(full_untruncated_fill_ids) directly. - _init_fill_ids_for_dllm: drop the fill_len write; only update full_untruncated_fill_ids and dllm_block_offset. - determine_dllm_phase: gate uses len(full_untruncated_fill_ids) instead of fill_len. Semantically that gate asks 'is the full sequence long enough to inspect one block', so this is the right read. - set_extend_input_len: logprob_start_len default reads len(self.full_untruncated_fill_ids) instead of self.fill_len. - add_one_req post-init_load_back: uses len(req.full_untruncated_fill_ids) when recomputing extend_input_len. - reset_for_retract: re-add self.fill_len = 0 (PR #26637 removed it to match OLD; restoring it now that fill_len's semantics demand a retracted req have committed length 0). Behavior change: DLLM reqs retracted mid-decode now have dllm_block_offset reset to 0 on re-admission (since fill_len == 0 triggers the first-iter branch in _init_fill_ids_for_dllm). The OLD code's offset-advance-on-retract was inconsistent with the prefix-from-zero state and only worked because _update_block_offset clamped offset to prefix_len; the new behavior is more direct.
PR1 (now ahead) translated three reads of req.fill_ids literally to req.get_fill_ids() because fill_len happened to equal len(full) at those call moments under PR1's contract. PR2 drops fill_len's untruncated phase. At the same three sites fill_len is now stale (previous admission's value, or 0 for a fresh request), so req.get_fill_ids() = full[:fill_len] would return an empty or partial array that doesn't match what these readers want. Revert these three reads to req.full_untruncated_fill_ids — the explicit 'we always want the full sequence here regardless of fill_len's committed value': - scheduler.py _prefetch_kvcache: full sequence for hicache storage prefetch. - schedule_batch.py init_next_round_input match_prefix key: full sequence to match against the radix tree. - dllm/mixin/req.py determine_dllm_phase input_block: full sequence to inspect one block's mask content. The diff between this commit and the PR1 reads is exactly where PR2's semantic shift becomes visible to call sites.
Removes the 'branches: [main]' filter from pull_request triggers on: - pr-test.yml (base CUDA CI) - pr-test-extra.yml (extra CUDA CI, label-gated) - lint.yml (pull_request trigger only — keeps push trigger main-only) Without this filter, GitHub Actions ignores chain PRs entirely: when PR B has base = PR A's branch (instead of main), the PR test workflow never fires, so chain PRs can't get CI until each link merges. With the filter removed, PRs against any base run the PR test workflow, and the chain can be validated end-to-end without linearizing the merge order. Hardware-specific pr-test-* workflows (amd, npu, xpu, musa, etc.) intentionally keep their main-only filter — they're label-gated for specific hardware concerns and don't need to fire on every chain PR.
This reverts commit 3416dea.
|
/tag-and-rerun-ci extra |
|
/tag-and-rerun-ci |
Job: https://github.com/sgl-project/sglang/actions/runs/27410818109/job/81011557989 (head Fingerprint: Classification: infra — the runner ran out of disk ( Side note: manual GPU validation of the hot spec area passed on both heads — |
Job: https://github.com/sgl-project/sglang/actions/runs/27411555797/job/81014678665 (runner Fingerprint (identical to the base-b(4) one): Classification: infra — out-of-disk on the novita H100 host family (second occurrence today, different runner). Not related to this branch. Will |
Job: https://github.com/sgl-project/sglang/actions/runs/27411555797/job/81014678680 (runner Fingerprint: Classification: suspected flake / loaded runner, needs one rerun to rule out a real spec perf regression. Margin is ~3-10% over the threshold, on the same novita H100 host family that produced two disk-full failures today. But since this branch touches the spec scheduling path, I'm posting a |
|
/rerun-test test/registered/perf/test_bench_serving_1gpu_large.py |
|
Results for 🚀 |
Rerun result (https://github.com/sgl-project/sglang/actions/runs/27414605710, runner Two failures in a row rules out a one-off flake, but the latency variance (927 → 996 → 1439 → 3358 across four measurements on different novita H100 hosts) looks more like a degraded host fleet (same family produced two disk-full failures today) than a deterministic regression. To decide: dispatched |
|
/tag-and-rerun-ci extra |
Cross-branch table for
Verdict: infra. The novita H100 host family is degraded today — host3 is out of disk (also killed base-b(4) and extra-a(1) earlier), and the latency numbers on host1/host5 are wildly unstable (927 → 3358 ms across runs), consistent with oversubscribed hosts rather than a deterministic regression in this branch. The branch's spec path also passed extensive manual GPU validation today (eagle stress/topk, kv_canary, PP gsm8k 0.775). Plan: keep this classified as infra; retry the perf shards once the fleet recovers rather than burning more reruns now. Flagging for the maintainers' attention: the |
This reverts commit 8a04683.
The stateless scheduler replaced the single Scheduler.chunked_req slot with partially_extended_reqs(). The manual tests from the chunked-prefill test suite still read the old attribute and failed with AttributeError (15 failures in test/manual/chunked_prefill). Add a chunked_req_of() helper to scripted_runtime_chunked_helpers preserving the at-most-one invariant and switch all readers to it.
The stateless scheduler removed the Req.inflight_middle_chunks 0/1 latch; the equivalent signal is req.phase is ReqPhase.EXTEND_NON_LAST (set on middle-chunk admission, cleared on last-chunk admission, and reset by reset_for_retract). Add inflight_middle_chunks_of() to scripted_runtime_chunked_helpers and switch all manual-test readers.
ServerArgs dropped disable_piecewise_cuda_graph in the cuda-graph runner refactor (2495c02); the current equivalent for these tests is disable_prefill_cuda_graph. All TestSWA* classes failed setUpClass with TypeError on server launch.
The stateless scheduler replaced extend_input_len with the extend_range Range; add extend_input_len_of() to scripted_runtime_chunked_helpers and switch the special_case readers. Three TestSpecialCase tests failed with AttributeError on the removed field.
The stateless scheduler replaced the fill_ids field with the get_fill_ids() accessor (full untruncated ids cut at extend_range.end), which matches the prefix+extend invariant this test asserts.
Manual chunked-prefill test sweep (PR #27506 suite vs this branch)Ran the entire Result: all runnable tests PASS
Not run — environment-limited only: 5 stale-API fixes (test-only; the runtime was correct)All initial failures were the manual tests lagging behind the stateless-scheduler refactor — not runtime bugs. Fixed in the test helper + call sites (commits in this PR):
CI on this branch is green modulo the chronic H20 lane (and its |
test_decoded_req_output_ids_do_not_extend_chunked_prefill_bound built a DECODE-phase req (extend_range=None) with accumulated output_ids but never passed it to any assertion, so the decode-req invariant the docstring claims was untested. Add an assertion that mirrors the real scheduler decode path: _compute_is_extend_intermediate(req, forward_mode=ForwardMode.DECODE) returns False via the is_decode() short-circuit, without reading extend_range.
test_retract_clears_running_batch put both reqs in running_batch with phase=None, so partially_extended_reqs() returned [] and the [*running_batch.reqs, *partially_extended_reqs()] retract path never exercised its second term. Dropping that term (re-introducing the mid-chunk KV leak the stateless rewrite fixed) left the test green. Add an EXTEND_NON_LAST req that lives only in active_reqs (not running_batch.reqs) and assert it is also released, deactivated, and re-queued.
|
/rerun-failed-ci |

DO NOT MERGE — temporary CI sandbox for the
tom/stateless_scheduler_bbranch.Forked from
tom/stateless_scheduler_bastom/stateless_scheduler_b_temp_runto get a fresh full-CI signal for the current stateless-scheduler rewrite. Purpose is CI signal only; this PR will be closed without merging.CI States
Latest PR Test (Base): ❌ Run #27485902919
Latest PR Test (Extra): ❌ Run #27485902886