Add more testing for chunked prefill#27506
Conversation
Per review: a one-line passthrough should call the original object directly, and any wrapper that must exist should return the original type rather than invent a new one. Remove all thin ScriptedReqHandle field getters and the thin ScriptedContext readers (engine_stats, row_pool_used, *_rids, batch_size, get_chunked_req_rid, chunked_in_flight_count, lock_refs_snapshot); tests now read r.req.<field> / t._scheduler.<...> directly. Keep start_req kwargs, abort(handle) (an HTTP action), and list_active_reqs() (aggregates four scheduler structures and returns List[Req]).
list_active_reqs collapses to list(set(_get_all_reqs(ctx))) (Req is hashable by identity, so the same object across scheduler structures dedups). Make the ScriptedContextReqStarter.start_req ignore_eos/priority/dp_rank kwargs required so the internal plumbing carries no silent defaults; the public facade keeps its ergonomic defaults.
Per the line-count rule (wrap multi-line computations, inline 1-liners): kv_pages derives a page count from kv_allocated_len and scheduler.page_size, and engine_stats aggregates several pool counters into one dict -- both are real computations, not thin field reads, so they earn a wrapper. Re-add their registered-core tests.
Returning 0 when find_req_by_rid finds nothing conflated 'req released' with 'req present holding 0 pages', which would let a kv_pages==0 assertion pass even if the req were wrongly missing -- a silent assertion weakener. Read the field on the live req instead; the released-state check still holds because right after run_until_finished the req is still in last_batch with kv_allocated_len==0.
Replace the removed thin-wrapper calls with direct access to the original objects: handle field reads become r.req.<field> (output_ids, req_pool_idx, cached_tokens, finished_reason, ...) and the dropped context readers become direct t._scheduler.<...> expressions (running_batch rids, chunked_req rid, req_to_token_pool sizes; lock_refs_snapshot -> get_all_node_lock_refs). Kept wrappers (kv_pages, engine_stats, abort, list_active_reqs, start_req kwargs) are used unchanged. Released-state reads that follow a drain are made None-tolerant (assert r.req is None or ...). These files still reference the not-yet-implemented multi-line APIs (chunks_done, force_*, exhaust_*, ...) and do not run until those land.
With tie_word_embeddings, lm_head was tied to model.embed_tokens unconditionally. Under PP the embedding lives on the first rank, so the last rank (which computes logits) ended up with a PPMissingLayer head and crashed in _compute_lm_head. Mirror Qwen2: tie only when the model is on a single PP rank, otherwise build a real ParallelLMHead on the last rank and copy the tied embedding weight into it during load_weights.
The kv-canary SingleForwardManager assumes one forward pass at a time; pipeline parallelism keeps several micro-batches in flight and breaks its phase checker. Enable the canary only when pp_size == 1.
Under pipeline parallelism, micro-batches stay in flight after the running batch looks empty; flushing the cache then trips a radix-tree assertion when a late micro-batch releases a node from the flushed tree. Drive extra scheduler steps to drain the pipeline before flush_cache.
Lets a script keep a request resident for a fixed number of decode steps regardless of EOS, which scenarios that need a steady KV-cache occupant rely on. Defaults to False so existing scripts are unchanged.
start_req gains a prompt_token parameter so a script can issue a request whose prompt does not share a radix prefix with another req (needed to make a chunked req actually prefill instead of hitting a resident req's cache). get_all_node_lock_refs now handles SWARadixCache TreeNodes, which track full_lock_ref and swa_lock_ref separately instead of a single lock_ref.
The scenario never drove the KV pool tight enough to park the chunked req: the resident competitor and r both used token id 1, so r hit the radix prefix cache and prefilled almost nothing. Give r a distinct prompt token, keep the competitor resident via ignore_eos, and use enable_mixed_chunk so the competitor decodes alongside r's prefill chunks -- its decode growth exceeds its 0.7x admission reservation and pushes the full pool to 1.00 on r's final chunk, parking r (add_chunked_req early-return) until the competitor frees its KV. All assertions (observed_early_return, r.finished, no locked nodes after drain) are unchanged.
…lama The scripted-runtime PP tests ran Llama-3.2-1B-Instruct, whose tied word embeddings crashed under pipeline parallelism, which had forced a fix into production llama.py. Switch SMALL_MODEL to Qwen3-0.6B, which already handles tied embeddings under PP in qwen3.py, and revert the llama.py change.
…ng it Revert the earlier chunked=True at the prefill-completion cache site in batch_result_processor: that branch runs for every request finishing prefill (chunked final chunk and non-chunked single prefill alike), so forcing chunked=True suppressed the legitimate hit_count bump on the general prefill path, not just chunked self-inserts. A request inserts its prompt prefix twice -- via cache_unfinished_req when prefill completes and via cache_finished_req on finish -- while decode nodes are inserted once, so prompt nodes settle at hit_count 2 and decode nodes at 1, identically for chunked and non-chunked prefill. #18843's middle-chunk chunked=True still prevents the unbounded early-node inflation. Replace the old all==1 expectation with side-by-side chunked and non-chunked tests asserting the hit_count value set is exactly {1, 2}.
The scripted ctx.flush_cache() previously only confirmed the FlushCacheReqInput arrived at the scheduler; it never checked the flush actually happened. The scheduler's flush_cache() is a no-op returning success=False whenever it is not fully idle, so a flush request on a busy scheduler silently did nothing. Make flush_cache a generator that yields once -- the scheduler consumes the buffered FlushCacheReqInput during the same recv_requests() call, but only after the script yields control back -- and then asserts the cache is fully flushed: radix tree empty and both the KV allocator and the req_to_token pool restored to their fully-free baselines (captured at context init to sidestep page-rounding / padding-slot conventions). Add assert_flushed (default True) so the deliberately-busy flush-during-chunked test can opt out. Update all call sites to use `yield from t.flush_cache()`.
Classify radix nodes by cumulative token position instead of only checking the hit_count value set: nodes ending within the prompt (prefill) must each be hit exactly twice, nodes beyond it (decode) exactly once. This pins the count to the correct node role rather than merely asserting both values occur.
Record the flush outcome on SchedulerFlushWrapper.last_success and have the scripted ctx.flush_cache() assert that scheduler.flush_cache() returned True, rather than re-deriving flushed-ness from radix/pool state. The scheduler already computes the authoritative success bool (False whenever it is not fully idle); reading it directly is simpler and exact. The scripted helper clears last_success before posting and asserts it is True after the yield that lets the scheduler run the flush. Drop the pool-baseline capture in ScriptedContext.__init__.
This reverts commit b3df1f2.
Drop the assert-it-is-really-flushed logic from scripted ctx.flush_cache and restore it to its original fire-and-confirm form, reverting the generator signature, the pool-availability baselines on ScriptedContext, and the `yield from t.flush_cache()` call sites. The flush verification will be reworked separately.
Without enable_mixed_chunk and without an ignore_eos competitor, the add_chunked_req hybrid-SWA early-return is unreachable through ordinary request traffic: while a chunked req is being prefilled, prefill preempts decode (so a running req never grows the pool) and no second req can be admitted (the chunk budget is held), so nothing tightens rem_total_tokens mid-prefill. The previous design relied on exactly that mixed-chunk decode kept alive by ignore_eos. Reproduce the pool state directly instead: admit a single chunked req r on an empty pool, then drain the full-attention allocator from the script so r's next chunk cannot be admitted -> the real scheduler takes the early-return and parks r (chunked_req set, _chunked_req_scheduled_last_iter False). Releasing the reserved KV lets r resume; the test still asserts the stash gate left no radix lock refs behind. Add ScriptedContext.reserve_full_attention_kv / full_attention_available_size (context/pool.py) for the deterministic full-pool pressure.
…e) into start_req API work
Conflicts in scripted_runtime start_req, both sides extended the same signature:
- context/req_starter.py: keep theirs' prompt_token (input_ids = [prompt_token]*prompt_len)
and ours' priority/dp_rank payload passthrough; ignore_eos was added by both, kept once;
use the shared sampling_params var (equivalent to theirs' inline {max_new_tokens, ignore_eos}).
- context/api.py: union the public start_req kwargs (ignore_eos, priority, dp_rank, prompt_token)
in both the signature and the delegated call; ours' abort/list_active_reqs/engine_stats
auto-merged in untouched.
lock_refs (ScriptedReqHandle): the radix lock_ref held on the req's last_node
(0 once last_node is cleared on release) -- a genuine multi-line derivation,
no instrumentation. batch_composition (ScriptedContext): a pure read of the
scheduler returning {prefill, decode, chunked, running} rid lists, with the
chunked req excluded from prefill/decode so the subsets stay disjoint. Add
registered tests for both.
Rework the regression to mirror issue #24252's actual trigger. The chunked-req hybrid-SWA early-return is driven by rem_total_tokens <= 0, where the running batch's reserved-decode offset is sum(max_new - output) * new_token_ratio. In production a KV-pool-full retraction jumps new_token_ratio (log: 0.098 -> 0.7589), ballooning that reservation and parking a mid-prefill chunked req. A retraction only fires on the decode path, and the scripted runtime steps deterministically with overlap scheduling off, so decode (hence the ratio jump) cannot run while a chunked req is being prefilled. So reproduce the retraction's effect directly: a resident holder with a large max_new carries the reservation, r is admitted at a low (decayed) ratio, then the ratio is jumped exactly as retract_decode does -> r is parked via the early-return. Restoring the decayed ratio lets r resume; the test asserts the stash gate left no radix lock refs. Drop the earlier full-attention pool-drain helper (context/pool.py + the two ScriptedContext methods); the ratio-jump path reproduces the real mechanism and no longer needs it.
…ests has_pending_chunk has no accessor and is equivalent to the existing is_chunking (true while the req holds the chunked_req slot, false after abort/pause/retract/finish); every assertion site is consistent with that, so rename rather than add a redundant one-line wrapper.
…anual tests pending_middle_outputs has no harness accessor and is exactly the existing Req.inflight_middle_chunks (incremented per dispatched middle chunk, decremented when its forward result is consumed, zeroed on retract/abort) -- a direct field read, not instrumentation. Rename all references (incl. test/script names and messages) to read r.req.inflight_middle_chunks. Drop the duplicate test_inflight_middle_chunks_non_negative the rename collided with.
…inish guard The scripted harness has no finish_event_count accessor, and 'finish emitted exactly once' is already enforced by the engine: output_streamer asserts 'not req.finished_output' before streaming a finished req, and the req pool asserts req_pool_idx is not None on free -- the scripted runtime surfaces any such scheduler-side crash as a failed test. So the count assertion is redundant: keep driving each scenario (abort / preempt / PP races) and assert the req finished; a double-finish would have crashed the engine. == 1 -> assert finished; <= 1 -> a comment noting the engine enforces at-most-one.
Add a single gated hook at the scheduler's run_batch chokepoint: when scripted_scheduler_hook is set, record each run batch (forward_iter, mode, rids, extend_rids, chunked_rid) into ScriptedSchedulerHook._batch_log (driver rank only; log cleared after the per-script reset drain). No scheduler business-logic change and zero production cost (hook is None otherwise). chunks_done(rid) is then derived in the harness as the number of logged batches where the req held the chunked_req slot and actually ran -- exactly the number of chunked prefill passes (0 when never chunked). Expose it as ScriptedContext.chunks_done / ScriptedReqHandle.chunks_done, so the manual suite's r.chunks_done works unchanged. Add registered tests (<=chunk -> 0, chunk+2 -> 2, 5*chunk -> 5).
ForwardMode is an IntEnum, so .name.lower() is simpler than the is_*() chain and also captures the modes the chain collapsed to 'unknown' (target_verify, split_prefill, dllm_extend, ...).
…nset mode is now Optional[str].
…o-v2 multimodal-spec fix
Remaining red lanes are all unrelated to this PR's diff (PP cross-mb exclusion / scripted harness / gsm8k eval):
Will |
|
/rerun-failed-ci |
PR-introduced — all fixed
Not PR-introduced — infra/perf, only surfaced after merging latest main
Current: 70 SUCCESS / 0 FAILURE / 3 re-running. Will confirm the last reruns settle. Please push back if any conclusion is off. |
#27506 is fully green — 77 SUCCESS / 0 FAILURE / 0 running. All previously-red lanes resolved, each with cross-branch confirmation:
PR-introduced failures were all fixed earlier in this babysit: the PP×chunked-prefill exclusion regression, and the |
… test _in_flight_other_mb_rids + filter_batch(exclude_in_flight_other_mb=...) were cherry-picked from feat/stateless_scheduler_b (commit 69ef71e), where a stateless-scheduler rewrite genuinely introduced a PP+chunked KV corruption (gsm8k 0.66 -> 0.77). That rewrite derived has_pending_chunk from admission-time state so it cleared one step too early, opening a window where a req's last-chunk forward is still in flight in mb_a while mb_b merges it into running_batch and decodes on empty output_ids. upstream/main structurally prevents this window and never had it: - per-mb last_mbs[mb_id] is committed only AFTER _pp_process_batch_result runs on that batch (scheduler_pp_mixin.py), so a req is merged for decode only once its forward result is processed; - inflight_middle_chunks defers output append until every chunk forward is processed (batch_result_processor.py), so decode never runs with empty output_ids; - has_pending_chunk / pending_middle_outputs (the mechanism the fix's docstring describes) never existed on main. So on this branch the code was dead defensive belt-and-suspenders. The accompanying test only asserted that the exclusion mechanism engages and is self-consistent (structural), never that omitting it corrupts output, and the only failure ever seen here was the fix's own over-exclusion. Drop both. filter_batch and the two scheduler call sites now match main exactly.
Removed This code (
So on this branch the code was dead belt-and-suspenders, and its scripted test was tautological — it only asserted that the exclusion mechanism engages and is self-consistent, never that omitting it corrupts output. (The stateless branch itself later dropped the cross-mb scan, concluding the per-mb Verification (sci-h200, 4×GPU, this commit, Qwen3-0.6B): |
CI triage for head Context: after the merge this branch's production code ( Fingerprints:
Plan: let the round finish, then rerun the failed jobs once the 504 infra wave clears. |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Rerun update for head One persistent failure remains — and it is not caused by this PR:
Why it's main-inherited, not this PR's: after merging current |
Root cause of the sole remaining red lane When LoRA is enabled, This is not introduced by this PR:
Minimal fix (separate, on Everything else on this PR is green (all infra/perf flakes cleared on rerun). |
Status for head All remaining red lanes are infra / borderline-flake, in code this PR does not touch:
Plan: not re-running mid-round (would cancel in-flight lanes). Once the round settles I'll re-run the failed jobs; the b200 lanes additionally need the runner fleet to be clean to pass. Will keep watching. |
After re-running the failed jobs, every infra/b200 lane went green — the only remaining red is a perf-threshold flake, not this PR:
Re-running this single file to confirm. |
|
/rerun-test test/registered/models_e2e/test_dsa_glm5_tp_mtp.py |
|
Results for 🚀 |
Update: the Net: this PR is effectively green — every one of the ~18 lanes that went red across this round was infra (504-setup wave, b200 dirty-runner GPU-mem pre-check, CPU-OOM on external-model loads, chronic H20) or a perf-threshold flake (dsa_glm5, eagle-dp-attn), all of which passed on re-run. This branch's production code ( |
…e_and_chunked_testing

Stacked on #27502 (mixed-prefix gsm8k eval), which is this PR's base.
Adds the scripted-runtime test harness (one forward per script
yield, precise per-step engine-state assertions over an HTTP-server + scheduler-hook driven engine) and a chunked-prefill test suite (test/manual/chunked_prefill/test_scripted_*.py+test_e2e_*.py), plus kv-canary PP fixtures and the registered scripted-runtime/chunked CI tests.The whole suite was brought green on H200 (registered unit + GPU, manual scripted 89/89 classes, manual e2e 11/11 files) and passed a multi-agent weakening audit.
CI States
Latest PR Test (Base): 🚫 Run #27205607049
Latest PR Test (Extra): ✅ Run #27205605882