model: support deepseek v3#624
Draft
rebel-kblee wants to merge 66 commits into
Draft
Conversation
get_dp_padding called find_decode_batch_bucket with the max total tokens across DP, which only equals the batch size for single-token decode. For multi-token decode (e.g., speculative decoding with batch=8 reqs and 2 tokens/req → 16 tokens), the bucket lookup tried to resolve a batch bucket >= 16 and failed against typical decode_batch_buckets like [1, 4, 8]. Pack num_tokens, num_reqs, and is_prefill into a single bit-packed int32 (num_tokens in low 16 bits, num_reqs in bits 16..29, is_prefill in bit 30) so a single all-reduce surfaces both per-rank token counts and per-rank batch sizes. Use max(num_reqs) across DP for the decode bucket lookup, and pad to batch_bucket_size * max_tokens_per_req so the MoE max_pads_across_dp buffer fits every actual token position even under multi-token decode. Single-token decode is unchanged (max_tokens_per_req=1 reduces to the previous batch_bucket_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes that together eliminate hot-path model_wrapper
recompiles seen when one DP rank runs spec decode while peers are idle
or single-token decoding:
- Plumb cross-DP max-tokens-per-request through get_dp_padding /
_prepare_inputs (4-tuple / 10-tuple returns). Every rank reports a
local query length (1 when local has no drafts) so the all-reduce
branch is taken uniformly across the DP group, and callers can pad
to the cross-DP max regardless of local draft state.
- Gate spec-decode padding in execute_model on
max_tokens_per_req_across_dp > 1 instead of local
scheduled_spec_decode_tokens. Previously a rank with no local drafts
skipped padding even when peers raised max_pads_across_dp, producing
an (input_ids[1], max_pads) tuple that did not match any warmup
compile slot.
- Reshape dummy_run input_ids/positions from (bucket, 1) to (bucket,
query_len) when the cross-DP max exceeds 1. The pre-baked dummy
state is (bucket, 1); idle ranks now mirror the spec-mode shape
peers expect.
Verified end-to-end with vllm bench serve under DP=4 + ngram spec
decode (num_speculative_tokens=3): warmup creates 7 compile slots
(prefill + padded-decode/decode-only for q in {1,2,4}) and no
model_wrapper recompile is triggered during the ~9 minutes of bench
traffic that follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtime query length is now exactly 1 (no spec) or num_spec_tokens + 1
(full spec) — never an intermediate value. This eliminates the
non-pow2 query_len edge cases that previously forced hot-path
model_wrapper recompiles and reduces the warmup compile slot count
from 7 to 5 (prefill + padded/decode-only at q=1, q=4).
Runner (vllm_rbln/v1/worker/rbln_model_runner.py):
- __init__: assert (num_speculative_tokens + 1) is a power of two when
spec is configured. Required for MoE multicast's
max_pads / num_tokens divisibility.
- _warm_up_model_inner: query_len_range = [1, num_spec_tokens + 1]
(was the full pow2 sequence up to that maximum).
- execute_model: spec_decode_max_query_len is simply num_spec_tokens+1
when this rank has any draft tokens this step, else 1. The
pow2-round-up logic for intermediate sizes is gone.
Scheduler (vllm_rbln/v1/core/rbln_scheduler.py):
- spec_decode_cap update becomes a per-request binary block-boundary
decision: max_spec_decode_len if remaining_in_block and
remaining_in_maxlen can hold a full window, else 1. The retroactive
trim then aligns every scheduled req's num_scheduled_tokens onto the
same {1, num_spec_tokens+1} shape so the runner-side pad never
writes past anyone's block boundary.
- RBLNSchedulerOutput grows a `step_no_spec_required: bool` field set
True only when the binary cap was forced to 1 by the boundary check
(distinct from "no drafts proposed this step", which leaves it
False).
Cross-DP collective handling — distinguishes two reasons for a local
no-spec state and treats them differently:
- (a) boundary-induced (`step_no_spec_required=True` on some rank):
OR-reduce the flag across DP. On True, every rank scrubs its
scheduler_output (clears drafts, sets num_scheduled=1, recomputes
totals) before _prepare_inputs builds the input tensors. This keeps
the model_wrapper input shape uniform at query_len=1 across DP and
prevents pad-position KV writes past any rank's block boundary.
- (b) no-drafts-proposed (`step_no_spec_required=False`, local
scheduled_spec_decode_tokens empty): keep the existing cross-DP
MAX behavior so peers that do have drafts can run full spec. The
no-drafts rank gets padded to peers' query_len; pad positions land
on lookahead-allocated slots, which is functionally safe (their
outputs are discarded by the rejection sampler).
- dummy_run also participates in the new OR-reduce (voting 0) so the
host-side gloo all_reduce doesn't hang when one DP rank is idle
while peers run execute_model.
The added cross-DP communication is one extra int32 all_reduce per
step on the existing cpu_group, before model inference. Verified
end-to-end with DP=4 ngram spec decode over a 9-minute hot-path
bench: 5 warmup model_wrapper slots, 0 post-warmup recompiles. The
(a) collective-fallback path is not exercised by this bench
(input+output < block_size) and is planned for separate verification
via a long-output scenario or unit test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ruff-format reformatting across the three files touched by the preceding two commits (line wrapping / single-line conversions that ruff prefers). - Add `assert batch_bucket_size is not None` in get_dp_padding's path-B branch so mypy can narrow the type before the `batch_bucket_size * max_tokens_per_req` multiplication. No behavioral change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document why MoE DP + spec decoding needs cross-DP coordination and how vllm-rbln handles it: - Root cause: per-rank ngram outcomes diverge, so each DP rank's local decision splits between no-spec (`query_len=1`) and full-spec (`query_len=num_spec_tokens+1`). Local decisions alone leave the DP world with mismatched MoE collective shapes / forced recompiles. - Solution: lift each rank's local decision into a single global decision via cross-DP collective MAX, so the model_wrapper compiles only for the shapes the DP world will drive together. Bit-packed int32 channel carries num_tokens, num_reqs, and is_prefill in a single all-reduce per step. - Block-boundary edge case: legacy cross-DP collective fallback path (`step_no_spec_required` OR-reduce) is retained for back-compat but no longer fires under the sliding-window scheduler decision — pointer added to `docs/sliding_window.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the legacy collective binary-cap + retroactive-trim with a
per-request sliding-window mechanism. For each running decode req
whose remaining_in_block / remaining_in_maxlen budget cannot hold a
full num_spec_tokens+1 query window:
- Compute desired_slide = max_spec_decode_len - effective_remaining
- Trim num_scheduled_tokens to effective_remaining and re-trim
scheduled_spec_decode_tokens to match
- Record slide_distance in the new
RBLNSchedulerOutput.spec_decode_slide_distance map so the runner
can prepend that many past tokens to the query window and keep the
full num_spec_tokens+1 shape
Reqs whose full window already fits the current block are untouched
(they run normal full spec). The mechanism assumes
block_size >= num_spec_tokens + 1; under that invariant past tokens
are always sufficient (num_computed_tokens reaches block boundary
only after at least one full block's worth of past positions has
accumulated), so we assert rather than maintain a no-spec fallback:
assert effective_remaining >= 1
assert desired_slide <= available_past
The legacy step_no_spec_required field stays on RBLNSchedulerOutput
for backward compatibility with runner-side code that still reads it
(default False; runner cleanup follows in a later commit).
The retroactive trim block and spec_decode_cap variable that
implemented the previous batch-wide collective fallback are removed.
This is scheduler-only; the runner has not yet been taught how to
consume spec_decode_slide_distance — that's the next step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire `RBLNSchedulerOutput.spec_decode_slide_distance` into `_prepare_inputs` so the query window for boundary-affected reqs gets prepended with past positions whose KV is already in cache. Reqs without slide are unaffected — when the dict is empty the math collapses back to the standard flow. Concretely, at the top of _prepare_inputs: - Build a per-req `slide_arr` from spec_decode_slide_distance. - `query_lengths = num_scheduled_tokens + slide_arr` — the actual per-req window length the model will see (= num_spec_tokens+1 for boundary reqs, num_scheduled_tokens otherwise). - `total_query_tokens = sum(query_lengths)` — the flat token count the runner has to materialize. - `req_indices`, `cu_num_tokens`, `arange` are built off query_lengths so each req gets contiguous slots for its full window. - `positions_np = (num_computed_cpu - slide_arr) + arange`, shifting the window backward by `slide` for boundary reqs so all positions land within already-allocated current-block KV slots. Downstream tensor sizing (input_ids / positions / slot_mapping / mrope_positions / CommonAttentionMetadata.num_actual_tokens) is switched from `total_num_scheduled_tokens` (logical advance) to `total_query_tokens`, and `max_num_scheduled_tokens` / `get_dp_padding(num_tokens=...)` are likewise computed against the sliding-aware lengths. `block_table.compute_slot_mapping(req_indices, positions_np)` is called with the sliding-extended (req_indices, positions_np), so past positions resolve to their existing valid slots — past KV gets idempotently re-written, no -1 sentinel needed. Cross-DP / dummy_run / spec_decode_metadata / rejection sampler are not touched yet (Tasks #6–#8). The legacy collective fallback path remains in place; sliding takes over for the boundary cases the old flag used to handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test groups under tests/torch_compile/unit/test_sliding_window.py:
1) TestSchedulerSliding — drive RBLNScheduler.schedule() through five
boundary-or-not configurations and assert:
- no slide entry / unchanged num_scheduled / kept drafts when full
window fits;
- slide_distance == max_spec_decode_len - effective_remaining and
num_scheduled trimmed to effective_remaining when boundary hits;
- drafts trimmed to (effective_remaining - 1);
- step_no_spec_required stays False (no collective fallback);
- per-req independence (one req slides, the other does not).
2) TestRunnerSlidingMath — reimplement the per-req block at the top of
RBLNModelRunner._prepare_inputs as pure numpy and assert that:
- positions land at [T - slide .. T + R - 1] for boundary reqs;
- input_ids match the corresponding token_ids_cpu entries — i.e.
the past tokens really show up at the start of the query window;
- no-slide case is identity-equivalent to the standard flow;
- a mixed batch (one slide, one not) produces the expected
concatenated layout.
Also: drop the `assert not num_speculative_tokens` placeholder in the
create_scheduler test helper. The helper already wires through
SpeculativeConfig when the arg is provided, so the assertion was the
only thing blocking spec-decode unit tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a TestSlidingLogitsIndices class that replays the per-req block of _calc_spec_decode_metadata as pure numpy and verifies that, when cu_num_scheduled_tokens is fed the sliding-aware query_lengths (num_scheduled + slide), the resulting logits_indices correctly skip the prepended past positions of every req's window. Cases covered: - baseline (no slide, full spec drafts) — all positions sampled, - boundary slide=2 drafts=1 — flat positions [past,past,base,draft] yield logits_indices = [2, 3], - boundary slide=3 drafts=0 — only the base logit is sampled, - mixed batch with one full-spec req and one boundary req — past positions of the boundary req are skipped in the flat layout while the full-spec req contributes all its positions. No new logic was needed in the runner — Task #4's switch of cu_num_tokens from num_scheduled to query_lengths is what already makes the math work. This test pins the invariant down so future refactors can't silently break it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… trim
Add TestSlidingDraftTokenExtraction class verifying that the draft
tokens the rejection sampler actually validates are the post-trim
surviving drafts — not the original pre-slide proposals — and that
no past tokens leak into the draft tensor.
The extraction logic exercised matches what _calc_spec_decode_metadata
does in the runner:
draft_token_ids = input_ids[logits_indices][target_logits_indices + 1]
With sliding-aware cu_num_tokens (= cumsum of query_lengths including
slide), this expression should pick exactly the drafts the scheduler
kept in scheduled_spec_decode_tokens.
Cases covered:
- baseline no-slide full spec — all 3 drafts extracted unchanged,
- slide=2 with 1 kept draft — only the surviving draft returned
(note dropped drafts never enter input_ids in the first place;
the test pins down that past-token slots aren't misread as drafts
either),
- slide=3 no drafts — empty extraction,
- mixed batch (one full-spec, one boundary-with-1-draft) — drafts
concatenated per-req, no contamination across req boundaries.
No runner-side or rejection-sampler code change was needed for
Task #7. Scheduler-side trim (Task #3) propagates through the
num_draft_tokens / cu_num_draft_tokens / logits_indices machinery
automatically because Task #4 already switched cu_num_tokens to be
sliding-aware.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add TestSlidingEdgeCases class to pin down the three guards that keep the sliding-window decision a no-op outside its intended scope: 1) num_speculative_tokens=None — the entire sliding block in scheduler is gated on self.num_spec_tokens > 0. Verify that disabling spec leaves spec_decode_slide_distance empty and the req runs as standard single-token decode. 2) Prefill-phase req — the sliding block is also gated on `not is_prefill(request)`. A req still in prefill must never get a slide entry even if its prompt length places it near a block boundary. 3) Decode req comfortably mid-block — when remaining_in_block is much larger than num_spec_tokens+1, the boundary condition `effective_remaining < max_spec_decode_len` is False and the req must run at full num_scheduled with no slide entry. These regressions would silently corrupt non-spec or prefill flows so they're worth pinning down explicitly. Total sliding-window test count is now 20 (3 + 5 + 4 + 4 + 4 across the five classes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ility Add an INFO log line in the per-req boundary-detected branch of RBLNScheduler.schedule(). Fires only when a req's effective_remaining drops below num_spec_tokens + 1 and we record a slide_distance — i.e., exactly when the new sliding-window mechanism is exercised. Rate is bounded by the boundary-hit frequency (~num_spec_tokens / block_size per req per step, ≈0.3% for the typical 1024/3 config), so this is workload-cheap noise while end-to-end runs are validating the path. Each line carries num_computed, remaining_in_block, remaining_in_maxlen, slide_distance, the resulting advance, and the count of drafts kept — enough to reconstruct the per-req scheduler decision from server.log without enabling DEBUG. May be downgraded to logger.debug after sliding has been validated in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_prepare_inputs` prepends past tokens to a sliding decode req's query window, so the flat input layout grows by sum(slide_distance) beyond scheduler_output.total_num_scheduled_tokens. Three downstream sites sliced on the pre-sliding count and triggered an IndexError once a real boundary fired: - `_preprocess.num_input_tokens` lost the past-token suffix it should have read. - The outer `execute_model`'s `num_tokens_unpadded` (feeding `_get_slot_mappings`) skipped past slots. - `pad_speculative_draft_tokens` + the `unpadded_to_padded` remap used num_scheduled_tokens[req] only, while `logits_indices` already carried indices into the (scheduled + slide) layout. Each site now adds the per-req or aggregate slide_distance pulled from spec_decode_slide_distance, restoring the slide-aware invariant the runner's own input building relies on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small polish items on the sliding-window scheduler decision: - Replace the legacy "Full-spec-or-no-spec binary cap" comment block with one that describes the actual sliding-window design (slide query window backward, idempotent KV re-write, drafts that would cross the boundary get trimmed). The old wording survived from the collective-fallback approach and no longer matches the code. - Extend the sliding info log with a `proposed_drafts` field (= old_n - 1 = what ngram returned before the trim). Pair with the existing `kept_drafts` cap and the runner-side `num_draft_tokens` to expose draft drops directly in logs: a sliding step drops proposed_drafts - kept_drafts drafts. While here, collapse the redundant `if not is_prefill: if num_spec > 0` pair that ruff (SIM102) flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an env-gated diagnostic that dumps the full per-step trace for the first N sliding requests on each worker. Off by default (VLLM_RBLN_SLIDING_TRACE_REQS=0), so production runs incur no logging cost. When enabled, every sliding event for a tracked request emits positions, input_ids, slot block ids, the logits indices that survived past-position exclusion, and the num_draft_tokens reaching the rejection sampler. CPU host code in `_prepare_inputs`; never traced into the compiled model graph. Captures the FULL per-req sequence (1021 → 1022 → 1023 boundary events for the same req) so the timeline can be reconstructed from contiguous log lines — useful for verifying past-prepend / slot range / logits exclusion / effective drafts together against the scheduler's spec-decode sliding log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk through what the spec-decode sliding window does, how the scheduler and runner instrument it, and what a real `vllm bench serve` run shows. Three case studies pulled from MiniMax-M2.5 traces: - ngram-miss boundary traversal (slide 1 → 2 → 3 for one req), proving past tokens are idempotently re-fed, slot mapping stays inside the current block, past logits are excluded, and num_draft_tokens reflects what the rejection sampler actually sees. - ngram-hit with kept drafts (slide=1, 2 drafts kept). - ngram-hit with drafts dropped by sliding (slide=1/2/3, dropping 1/2/3 drafts respectively when ngram fills the proposal cap) — the central design payoff. Also documents how to reproduce (env vars, bench command, log greps) and notes about the same-num_computed log artifact observed during stress testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sliding window previously fired only at block boundaries; off the boundary the runtime query length still followed the proposer's output (1 base + up to num_spec_tokens drafts), and ranks with no drafts dispatched a separate (8, 1) no_spec shape. Cross-DP MAX papered over per-rank divergence at runtime, so the no_spec compile slot was warmup'd but almost never used in DP setups. Make the lift explicit and local: the scheduler always pads each decode req's query window to num_spec_tokens + 1 via slide_distance, regardless of how many drafts the proposer returned or whether a boundary is in sight. Variable-length proposers (ngram, suffix decoding) and fixed-length proposers (MTP, EAGLE) both converge to the same runtime shape; boundary squeeze is one special case of the same padding rule. Changes: - Scheduler: replace boundary-only sliding with `new_n = min(old_n, effective_remaining)` and `desired_slide = max_spec_decode_len - new_n`. Fires whenever the deficit is non-zero (length pad, boundary squeeze, or both); trims drafts only when boundary actually shortened the advance. - Runner warmup: drop the (8, 1) no_spec model_wrapper compile — `query_len_range = [num_spec_tokens + 1]` only. Saves two compile slots in MoE warmup. - Runner runtime: `spec_decode_max_query_len` is unconditionally num_spec_tokens + 1 when spec is configured. Cross-DP MAX is now a no-op (every rank votes the same value) but kept for shape uniformity guarantees. - Runner dummy_run: idle DP ranks now also report num_spec_tokens + 1 and expand their dummy input to (bucket, num_spec_tokens + 1). Without this an all-idle step (all 4 DP ranks dummy_run) would collapse to (bucket, 1) and trigger a hot-path model_wrapper recompile against the dropped no_spec slot. - Tests: add `TestSlidingVariableLengthPadding` scheduler cases (zero / partial drafts off the boundary, partial drafts at the boundary) and runner-math cases that mirror the local-only path cross-DP MAX used to provide. - Docs: rewrite `docs/sliding_window.md` opening to describe the unified design and update the scheduler log field semantics. Verified end-to-end on MiniMax-M2.5 / DP=4 / EP=4 / num_spec=3 with a 128-prompt bench (input 512, output 1500, rps=4): - 0 bench-time model_wrapper recompiles (was 2 before) - 0 errors, 128/128 successful - 38140 sliding events, 93% slide=3 (ngram-miss length-pad path) - Output throughput 571 tok/s (was 376 tok/s, +52%) - ITL p99 144ms (was 1500ms+, bimodal collapsed) - Acceptance rate 38.9% (was 32.6%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_dp_padding called find_decode_batch_bucket with the max total tokens across DP, which only equals the batch size for single-token decode. For multi-token decode (e.g., speculative decoding with batch=8 reqs and 2 tokens/req → 16 tokens), the bucket lookup tried to resolve a batch bucket >= 16 and failed against typical decode_batch_buckets like [1, 4, 8]. Pack num_tokens, num_reqs, and is_prefill into a single bit-packed int32 (num_tokens in low 16 bits, num_reqs in bits 16..29, is_prefill in bit 30) so a single all-reduce surfaces both per-rank token counts and per-rank batch sizes. Use max(num_reqs) across DP for the decode bucket lookup, and pad to batch_bucket_size * max_tokens_per_req so the MoE max_pads_across_dp buffer fits every actual token position even under multi-token decode. Single-token decode is unchanged (max_tokens_per_req=1 reduces to the previous batch_bucket_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes that together eliminate hot-path model_wrapper
recompiles seen when one DP rank runs spec decode while peers are idle
or single-token decoding:
- Plumb cross-DP max-tokens-per-request through get_dp_padding /
_prepare_inputs (4-tuple / 10-tuple returns). Every rank reports a
local query length (1 when local has no drafts) so the all-reduce
branch is taken uniformly across the DP group, and callers can pad
to the cross-DP max regardless of local draft state.
- Gate spec-decode padding in execute_model on
max_tokens_per_req_across_dp > 1 instead of local
scheduled_spec_decode_tokens. Previously a rank with no local drafts
skipped padding even when peers raised max_pads_across_dp, producing
an (input_ids[1], max_pads) tuple that did not match any warmup
compile slot.
- Reshape dummy_run input_ids/positions from (bucket, 1) to (bucket,
query_len) when the cross-DP max exceeds 1. The pre-baked dummy
state is (bucket, 1); idle ranks now mirror the spec-mode shape
peers expect.
Verified end-to-end with vllm bench serve under DP=4 + ngram spec
decode (num_speculative_tokens=3): warmup creates 7 compile slots
(prefill + padded-decode/decode-only for q in {1,2,4}) and no
model_wrapper recompile is triggered during the ~9 minutes of bench
traffic that follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtime query length is now exactly 1 (no spec) or num_spec_tokens + 1
(full spec) — never an intermediate value. This eliminates the
non-pow2 query_len edge cases that previously forced hot-path
model_wrapper recompiles and reduces the warmup compile slot count
from 7 to 5 (prefill + padded/decode-only at q=1, q=4).
Runner (vllm_rbln/v1/worker/rbln_model_runner.py):
- __init__: assert (num_speculative_tokens + 1) is a power of two when
spec is configured. Required for MoE multicast's
max_pads / num_tokens divisibility.
- _warm_up_model_inner: query_len_range = [1, num_spec_tokens + 1]
(was the full pow2 sequence up to that maximum).
- execute_model: spec_decode_max_query_len is simply num_spec_tokens+1
when this rank has any draft tokens this step, else 1. The
pow2-round-up logic for intermediate sizes is gone.
Scheduler (vllm_rbln/v1/core/rbln_scheduler.py):
- spec_decode_cap update becomes a per-request binary block-boundary
decision: max_spec_decode_len if remaining_in_block and
remaining_in_maxlen can hold a full window, else 1. The retroactive
trim then aligns every scheduled req's num_scheduled_tokens onto the
same {1, num_spec_tokens+1} shape so the runner-side pad never
writes past anyone's block boundary.
- RBLNSchedulerOutput grows a `step_no_spec_required: bool` field set
True only when the binary cap was forced to 1 by the boundary check
(distinct from "no drafts proposed this step", which leaves it
False).
Cross-DP collective handling — distinguishes two reasons for a local
no-spec state and treats them differently:
- (a) boundary-induced (`step_no_spec_required=True` on some rank):
OR-reduce the flag across DP. On True, every rank scrubs its
scheduler_output (clears drafts, sets num_scheduled=1, recomputes
totals) before _prepare_inputs builds the input tensors. This keeps
the model_wrapper input shape uniform at query_len=1 across DP and
prevents pad-position KV writes past any rank's block boundary.
- (b) no-drafts-proposed (`step_no_spec_required=False`, local
scheduled_spec_decode_tokens empty): keep the existing cross-DP
MAX behavior so peers that do have drafts can run full spec. The
no-drafts rank gets padded to peers' query_len; pad positions land
on lookahead-allocated slots, which is functionally safe (their
outputs are discarded by the rejection sampler).
- dummy_run also participates in the new OR-reduce (voting 0) so the
host-side gloo all_reduce doesn't hang when one DP rank is idle
while peers run execute_model.
The added cross-DP communication is one extra int32 all_reduce per
step on the existing cpu_group, before model inference. Verified
end-to-end with DP=4 ngram spec decode over a 9-minute hot-path
bench: 5 warmup model_wrapper slots, 0 post-warmup recompiles. The
(a) collective-fallback path is not exercised by this bench
(input+output < block_size) and is planned for separate verification
via a long-output scenario or unit test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small polish items on the sliding-window scheduler decision: - Replace the legacy "Full-spec-or-no-spec binary cap" comment block with one that describes the actual sliding-window design (slide query window backward, idempotent KV re-write, drafts that would cross the boundary get trimmed). The old wording survived from the collective-fallback approach and no longer matches the code. - Extend the sliding info log with a `proposed_drafts` field (= old_n - 1 = what ngram returned before the trim). Pair with the existing `kept_drafts` cap and the runner-side `num_draft_tokens` to expose draft drops directly in logs: a sliding step drops proposed_drafts - kept_drafts drafts. While here, collapse the redundant `if not is_prefill: if num_spec > 0` pair that ruff (SIM102) flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an env-gated diagnostic that dumps the full per-step trace for the first N sliding requests on each worker. Off by default (VLLM_RBLN_SLIDING_TRACE_REQS=0), so production runs incur no logging cost. When enabled, every sliding event for a tracked request emits positions, input_ids, slot block ids, the logits indices that survived past-position exclusion, and the num_draft_tokens reaching the rejection sampler. CPU host code in `_prepare_inputs`; never traced into the compiled model graph. Captures the FULL per-req sequence (1021 → 1022 → 1023 boundary events for the same req) so the timeline can be reconstructed from contiguous log lines — useful for verifying past-prepend / slot range / logits exclusion / effective drafts together against the scheduler's spec-decode sliding log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk through what the spec-decode sliding window does, how the scheduler and runner instrument it, and what a real `vllm bench serve` run shows. Three case studies pulled from MiniMax-M2.5 traces: - ngram-miss boundary traversal (slide 1 → 2 → 3 for one req), proving past tokens are idempotently re-fed, slot mapping stays inside the current block, past logits are excluded, and num_draft_tokens reflects what the rejection sampler actually sees. - ngram-hit with kept drafts (slide=1, 2 drafts kept). - ngram-hit with drafts dropped by sliding (slide=1/2/3, dropping 1/2/3 drafts respectively when ngram fills the proposal cap) — the central design payoff. Also documents how to reproduce (env vars, bench command, log greps) and notes about the same-num_computed log artifact observed during stress testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sliding window previously fired only at block boundaries; off the boundary the runtime query length still followed the proposer's output (1 base + up to num_spec_tokens drafts), and ranks with no drafts dispatched a separate (8, 1) no_spec shape. Cross-DP MAX papered over per-rank divergence at runtime, so the no_spec compile slot was warmup'd but almost never used in DP setups. Make the lift explicit and local: the scheduler always pads each decode req's query window to num_spec_tokens + 1 via slide_distance, regardless of how many drafts the proposer returned or whether a boundary is in sight. Variable-length proposers (ngram, suffix decoding) and fixed-length proposers (MTP, EAGLE) both converge to the same runtime shape; boundary squeeze is one special case of the same padding rule. Changes: - Scheduler: replace boundary-only sliding with `new_n = min(old_n, effective_remaining)` and `desired_slide = max_spec_decode_len - new_n`. Fires whenever the deficit is non-zero (length pad, boundary squeeze, or both); trims drafts only when boundary actually shortened the advance. - Runner warmup: drop the (8, 1) no_spec model_wrapper compile — `query_len_range = [num_spec_tokens + 1]` only. Saves two compile slots in MoE warmup. - Runner runtime: `spec_decode_max_query_len` is unconditionally num_spec_tokens + 1 when spec is configured. Cross-DP MAX is now a no-op (every rank votes the same value) but kept for shape uniformity guarantees. - Runner dummy_run: idle DP ranks now also report num_spec_tokens + 1 and expand their dummy input to (bucket, num_spec_tokens + 1). Without this an all-idle step (all 4 DP ranks dummy_run) would collapse to (bucket, 1) and trigger a hot-path model_wrapper recompile against the dropped no_spec slot. - Tests: add `TestSlidingVariableLengthPadding` scheduler cases (zero / partial drafts off the boundary, partial drafts at the boundary) and runner-math cases that mirror the local-only path cross-DP MAX used to provide. - Docs: rewrite `docs/sliding_window.md` opening to describe the unified design and update the scheduler log field semantics. Verified end-to-end on MiniMax-M2.5 / DP=4 / EP=4 / num_spec=3 with a 128-prompt bench (input 512, output 1500, rps=4): - 0 bench-time model_wrapper recompiles (was 2 before) - 0 errors, 128/128 successful - 38140 sliding events, 93% slide=3 (ngram-miss length-pad path) - Output throughput 571 tok/s (was 376 tok/s, +52%) - ITL p99 144ms (was 1500ms+, bimodal collapsed) - Acceptance rate 38.9% (was 32.6%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RBLN-SW/vllm-rbln into support_deepseek_v3
…e trace logs to DEBUG - test_sliding_window.py -> test_query_backfill.py (+ 6 class names) - log prefix: "spec-decode sliding" -> "spec-decode backfill" (scheduler) - log prefix: "sliding-trace" -> "backfill-trace" (runner) - env var: VLLM_RBLN_SLIDING_TRACE_REQS -> VLLM_RBLN_BACKFILL_TRACE_REQS - demote scheduler + runner per-step diagnostics from INFO to DEBUG - identifiers (slide_distance, _run_sliding_math, etc.) kept for stability with naming-note docstrings explaining the equivalence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backfill removes the boundary cap + retroactive trim mechanism these 7 tests guarded. Keep at_block_boundary, no_spec_tokens_no_retroactive_trim, prefill_triggers_no_mixed_batching (general scheduler invariants). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- new 3-section layout: Problem / Key idea / Example + Appendix - query_backfill.md: 343 -> 199 lines (running trace preserved) - cross_dp_spec_decode.md: 149 -> 145 lines - add naming-note: "sliding window" == "query backfill" (equivalent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit CI flagged this hunk for reformatting (single-line ternary). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
support deepseek v3 with mtp module
📌 Related Issues / Tickets
✅ Type of Change
release)feature)model)core)fix)perf)refactor)docs)other): please describe🧪 How to Test
.........📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes