Skip to content

model: support deepseek v3#624

Draft
rebel-kblee wants to merge 66 commits into
devfrom
support_deepseek_v3
Draft

model: support deepseek v3#624
rebel-kblee wants to merge 66 commits into
devfrom
support_deepseek_v3

Conversation

@rebel-kblee

@rebel-kblee rebel-kblee commented May 26, 2026

Copy link
Copy Markdown
Contributor

🚀 Summary of Changes

support deepseek v3 with mtp module

What does this PR do? What feature, fix, or improvement does it bring?


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


rebel-kblee and others added 30 commits May 6, 2026 06:23
get_dp_padding called find_decode_batch_bucket with the max total tokens
across DP, which only equals the batch size for single-token decode. For
multi-token decode (e.g., speculative decoding with batch=8 reqs and 2
tokens/req → 16 tokens), the bucket lookup tried to resolve a batch
bucket >= 16 and failed against typical decode_batch_buckets like
[1, 4, 8].

Pack num_tokens, num_reqs, and is_prefill into a single bit-packed int32
(num_tokens in low 16 bits, num_reqs in bits 16..29, is_prefill in bit 30)
so a single all-reduce surfaces both per-rank token counts and per-rank
batch sizes. Use max(num_reqs) across DP for the decode bucket lookup,
and pad to batch_bucket_size * max_tokens_per_req so the MoE
max_pads_across_dp buffer fits every actual token position even under
multi-token decode. Single-token decode is unchanged
(max_tokens_per_req=1 reduces to the previous batch_bucket_size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes that together eliminate hot-path model_wrapper
recompiles seen when one DP rank runs spec decode while peers are idle
or single-token decoding:

- Plumb cross-DP max-tokens-per-request through get_dp_padding /
  _prepare_inputs (4-tuple / 10-tuple returns). Every rank reports a
  local query length (1 when local has no drafts) so the all-reduce
  branch is taken uniformly across the DP group, and callers can pad
  to the cross-DP max regardless of local draft state.

- Gate spec-decode padding in execute_model on
  max_tokens_per_req_across_dp > 1 instead of local
  scheduled_spec_decode_tokens. Previously a rank with no local drafts
  skipped padding even when peers raised max_pads_across_dp, producing
  an (input_ids[1], max_pads) tuple that did not match any warmup
  compile slot.

- Reshape dummy_run input_ids/positions from (bucket, 1) to (bucket,
  query_len) when the cross-DP max exceeds 1. The pre-baked dummy
  state is (bucket, 1); idle ranks now mirror the spec-mode shape
  peers expect.

Verified end-to-end with vllm bench serve under DP=4 + ngram spec
decode (num_speculative_tokens=3): warmup creates 7 compile slots
(prefill + padded-decode/decode-only for q in {1,2,4}) and no
model_wrapper recompile is triggered during the ~9 minutes of bench
traffic that follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtime query length is now exactly 1 (no spec) or num_spec_tokens + 1
(full spec) — never an intermediate value. This eliminates the
non-pow2 query_len edge cases that previously forced hot-path
model_wrapper recompiles and reduces the warmup compile slot count
from 7 to 5 (prefill + padded/decode-only at q=1, q=4).

Runner (vllm_rbln/v1/worker/rbln_model_runner.py):
- __init__: assert (num_speculative_tokens + 1) is a power of two when
  spec is configured. Required for MoE multicast's
  max_pads / num_tokens divisibility.
- _warm_up_model_inner: query_len_range = [1, num_spec_tokens + 1]
  (was the full pow2 sequence up to that maximum).
- execute_model: spec_decode_max_query_len is simply num_spec_tokens+1
  when this rank has any draft tokens this step, else 1. The
  pow2-round-up logic for intermediate sizes is gone.

Scheduler (vllm_rbln/v1/core/rbln_scheduler.py):
- spec_decode_cap update becomes a per-request binary block-boundary
  decision: max_spec_decode_len if remaining_in_block and
  remaining_in_maxlen can hold a full window, else 1. The retroactive
  trim then aligns every scheduled req's num_scheduled_tokens onto the
  same {1, num_spec_tokens+1} shape so the runner-side pad never
  writes past anyone's block boundary.
- RBLNSchedulerOutput grows a `step_no_spec_required: bool` field set
  True only when the binary cap was forced to 1 by the boundary check
  (distinct from "no drafts proposed this step", which leaves it
  False).

Cross-DP collective handling — distinguishes two reasons for a local
no-spec state and treats them differently:
- (a) boundary-induced (`step_no_spec_required=True` on some rank):
  OR-reduce the flag across DP. On True, every rank scrubs its
  scheduler_output (clears drafts, sets num_scheduled=1, recomputes
  totals) before _prepare_inputs builds the input tensors. This keeps
  the model_wrapper input shape uniform at query_len=1 across DP and
  prevents pad-position KV writes past any rank's block boundary.
- (b) no-drafts-proposed (`step_no_spec_required=False`, local
  scheduled_spec_decode_tokens empty): keep the existing cross-DP
  MAX behavior so peers that do have drafts can run full spec. The
  no-drafts rank gets padded to peers' query_len; pad positions land
  on lookahead-allocated slots, which is functionally safe (their
  outputs are discarded by the rejection sampler).
- dummy_run also participates in the new OR-reduce (voting 0) so the
  host-side gloo all_reduce doesn't hang when one DP rank is idle
  while peers run execute_model.

The added cross-DP communication is one extra int32 all_reduce per
step on the existing cpu_group, before model inference. Verified
end-to-end with DP=4 ngram spec decode over a 9-minute hot-path
bench: 5 warmup model_wrapper slots, 0 post-warmup recompiles. The
(a) collective-fallback path is not exercised by this bench
(input+output < block_size) and is planned for separate verification
via a long-output scenario or unit test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ruff-format reformatting across the three files touched by the
  preceding two commits (line wrapping / single-line conversions
  that ruff prefers).
- Add `assert batch_bucket_size is not None` in get_dp_padding's
  path-B branch so mypy can narrow the type before the
  `batch_bucket_size * max_tokens_per_req` multiplication.

No behavioral change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document why MoE DP + spec decoding needs cross-DP coordination and
how vllm-rbln handles it:

- Root cause: per-rank ngram outcomes diverge, so each DP rank's
  local decision splits between no-spec (`query_len=1`) and full-spec
  (`query_len=num_spec_tokens+1`). Local decisions alone leave the DP
  world with mismatched MoE collective shapes / forced recompiles.
- Solution: lift each rank's local decision into a single global
  decision via cross-DP collective MAX, so the model_wrapper compiles
  only for the shapes the DP world will drive together. Bit-packed
  int32 channel carries num_tokens, num_reqs, and is_prefill in a
  single all-reduce per step.
- Block-boundary edge case: legacy cross-DP collective fallback path
  (`step_no_spec_required` OR-reduce) is retained for back-compat but
  no longer fires under the sliding-window scheduler decision —
  pointer added to `docs/sliding_window.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the legacy collective binary-cap + retroactive-trim with a
per-request sliding-window mechanism. For each running decode req
whose remaining_in_block / remaining_in_maxlen budget cannot hold a
full num_spec_tokens+1 query window:

  - Compute desired_slide = max_spec_decode_len - effective_remaining
  - Trim num_scheduled_tokens to effective_remaining and re-trim
    scheduled_spec_decode_tokens to match
  - Record slide_distance in the new
    RBLNSchedulerOutput.spec_decode_slide_distance map so the runner
    can prepend that many past tokens to the query window and keep the
    full num_spec_tokens+1 shape

Reqs whose full window already fits the current block are untouched
(they run normal full spec). The mechanism assumes
block_size >= num_spec_tokens + 1; under that invariant past tokens
are always sufficient (num_computed_tokens reaches block boundary
only after at least one full block's worth of past positions has
accumulated), so we assert rather than maintain a no-spec fallback:

  assert effective_remaining >= 1
  assert desired_slide <= available_past

The legacy step_no_spec_required field stays on RBLNSchedulerOutput
for backward compatibility with runner-side code that still reads it
(default False; runner cleanup follows in a later commit).

The retroactive trim block and spec_decode_cap variable that
implemented the previous batch-wide collective fallback are removed.

This is scheduler-only; the runner has not yet been taught how to
consume spec_decode_slide_distance — that's the next step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire `RBLNSchedulerOutput.spec_decode_slide_distance` into
`_prepare_inputs` so the query window for boundary-affected reqs
gets prepended with past positions whose KV is already in cache.
Reqs without slide are unaffected — when the dict is empty the math
collapses back to the standard flow.

Concretely, at the top of _prepare_inputs:
- Build a per-req `slide_arr` from spec_decode_slide_distance.
- `query_lengths = num_scheduled_tokens + slide_arr` — the actual
  per-req window length the model will see (= num_spec_tokens+1 for
  boundary reqs, num_scheduled_tokens otherwise).
- `total_query_tokens = sum(query_lengths)` — the flat token count
  the runner has to materialize.
- `req_indices`, `cu_num_tokens`, `arange` are built off
  query_lengths so each req gets contiguous slots for its full
  window.
- `positions_np = (num_computed_cpu - slide_arr) + arange`, shifting
  the window backward by `slide` for boundary reqs so all positions
  land within already-allocated current-block KV slots.

Downstream tensor sizing (input_ids / positions / slot_mapping /
mrope_positions / CommonAttentionMetadata.num_actual_tokens) is
switched from `total_num_scheduled_tokens` (logical advance) to
`total_query_tokens`, and `max_num_scheduled_tokens` /
`get_dp_padding(num_tokens=...)` are likewise computed against the
sliding-aware lengths.

`block_table.compute_slot_mapping(req_indices, positions_np)` is
called with the sliding-extended (req_indices, positions_np), so
past positions resolve to their existing valid slots — past KV
gets idempotently re-written, no -1 sentinel needed.

Cross-DP / dummy_run / spec_decode_metadata / rejection sampler are
not touched yet (Tasks #6#8). The legacy collective fallback path
remains in place; sliding takes over for the boundary cases the old
flag used to handle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test groups under tests/torch_compile/unit/test_sliding_window.py:

1) TestSchedulerSliding — drive RBLNScheduler.schedule() through five
   boundary-or-not configurations and assert:
   - no slide entry / unchanged num_scheduled / kept drafts when full
     window fits;
   - slide_distance == max_spec_decode_len - effective_remaining and
     num_scheduled trimmed to effective_remaining when boundary hits;
   - drafts trimmed to (effective_remaining - 1);
   - step_no_spec_required stays False (no collective fallback);
   - per-req independence (one req slides, the other does not).

2) TestRunnerSlidingMath — reimplement the per-req block at the top of
   RBLNModelRunner._prepare_inputs as pure numpy and assert that:
   - positions land at [T - slide .. T + R - 1] for boundary reqs;
   - input_ids match the corresponding token_ids_cpu entries — i.e.
     the past tokens really show up at the start of the query window;
   - no-slide case is identity-equivalent to the standard flow;
   - a mixed batch (one slide, one not) produces the expected
     concatenated layout.

Also: drop the `assert not num_speculative_tokens` placeholder in the
create_scheduler test helper. The helper already wires through
SpeculativeConfig when the arg is provided, so the assertion was the
only thing blocking spec-decode unit tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a TestSlidingLogitsIndices class that replays the per-req block of
_calc_spec_decode_metadata as pure numpy and verifies that, when
cu_num_scheduled_tokens is fed the sliding-aware query_lengths
(num_scheduled + slide), the resulting logits_indices correctly skip
the prepended past positions of every req's window.

Cases covered:
- baseline (no slide, full spec drafts) — all positions sampled,
- boundary slide=2 drafts=1 — flat positions [past,past,base,draft]
  yield logits_indices = [2, 3],
- boundary slide=3 drafts=0 — only the base logit is sampled,
- mixed batch with one full-spec req and one boundary req — past
  positions of the boundary req are skipped in the flat layout while
  the full-spec req contributes all its positions.

No new logic was needed in the runner — Task #4's switch of
cu_num_tokens from num_scheduled to query_lengths is what already
makes the math work. This test pins the invariant down so future
refactors can't silently break it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… trim

Add TestSlidingDraftTokenExtraction class verifying that the draft
tokens the rejection sampler actually validates are the post-trim
surviving drafts — not the original pre-slide proposals — and that
no past tokens leak into the draft tensor.

The extraction logic exercised matches what _calc_spec_decode_metadata
does in the runner:
    draft_token_ids = input_ids[logits_indices][target_logits_indices + 1]
With sliding-aware cu_num_tokens (= cumsum of query_lengths including
slide), this expression should pick exactly the drafts the scheduler
kept in scheduled_spec_decode_tokens.

Cases covered:
- baseline no-slide full spec — all 3 drafts extracted unchanged,
- slide=2 with 1 kept draft — only the surviving draft returned
  (note dropped drafts never enter input_ids in the first place;
  the test pins down that past-token slots aren't misread as drafts
  either),
- slide=3 no drafts — empty extraction,
- mixed batch (one full-spec, one boundary-with-1-draft) — drafts
  concatenated per-req, no contamination across req boundaries.

No runner-side or rejection-sampler code change was needed for
Task #7. Scheduler-side trim (Task #3) propagates through the
num_draft_tokens / cu_num_draft_tokens / logits_indices machinery
automatically because Task #4 already switched cu_num_tokens to be
sliding-aware.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add TestSlidingEdgeCases class to pin down the three guards that keep
the sliding-window decision a no-op outside its intended scope:

1) num_speculative_tokens=None — the entire sliding block in scheduler
   is gated on self.num_spec_tokens > 0. Verify that disabling spec
   leaves spec_decode_slide_distance empty and the req runs as
   standard single-token decode.

2) Prefill-phase req — the sliding block is also gated on
   `not is_prefill(request)`. A req still in prefill must never get a
   slide entry even if its prompt length places it near a block
   boundary.

3) Decode req comfortably mid-block — when remaining_in_block is much
   larger than num_spec_tokens+1, the boundary condition
   `effective_remaining < max_spec_decode_len` is False and the req
   must run at full num_scheduled with no slide entry.

These regressions would silently corrupt non-spec or prefill flows so
they're worth pinning down explicitly. Total sliding-window test
count is now 20 (3 + 5 + 4 + 4 + 4 across the five classes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ility

Add an INFO log line in the per-req boundary-detected branch of
RBLNScheduler.schedule(). Fires only when a req's effective_remaining
drops below num_spec_tokens + 1 and we record a slide_distance —
i.e., exactly when the new sliding-window mechanism is exercised.

Rate is bounded by the boundary-hit frequency (~num_spec_tokens /
block_size per req per step, ≈0.3% for the typical 1024/3 config),
so this is workload-cheap noise while end-to-end runs are validating
the path. Each line carries num_computed, remaining_in_block,
remaining_in_maxlen, slide_distance, the resulting advance, and the
count of drafts kept — enough to reconstruct the per-req scheduler
decision from server.log without enabling DEBUG.

May be downgraded to logger.debug after sliding has been validated in
production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_prepare_inputs` prepends past tokens to a sliding decode req's query
window, so the flat input layout grows by sum(slide_distance) beyond
scheduler_output.total_num_scheduled_tokens. Three downstream sites
sliced on the pre-sliding count and triggered an IndexError once a real
boundary fired:

- `_preprocess.num_input_tokens` lost the past-token suffix it should
  have read.
- The outer `execute_model`'s `num_tokens_unpadded` (feeding
  `_get_slot_mappings`) skipped past slots.
- `pad_speculative_draft_tokens` + the `unpadded_to_padded` remap used
  num_scheduled_tokens[req] only, while `logits_indices` already
  carried indices into the (scheduled + slide) layout.

Each site now adds the per-req or aggregate slide_distance pulled from
spec_decode_slide_distance, restoring the slide-aware invariant the
runner's own input building relies on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small polish items on the sliding-window scheduler decision:

- Replace the legacy "Full-spec-or-no-spec binary cap" comment block
  with one that describes the actual sliding-window design (slide
  query window backward, idempotent KV re-write, drafts that would
  cross the boundary get trimmed). The old wording survived from the
  collective-fallback approach and no longer matches the code.
- Extend the sliding info log with a `proposed_drafts` field
  (= old_n - 1 = what ngram returned before the trim). Pair with the
  existing `kept_drafts` cap and the runner-side `num_draft_tokens` to
  expose draft drops directly in logs: a sliding step drops
  proposed_drafts - kept_drafts drafts.

While here, collapse the redundant `if not is_prefill: if num_spec > 0`
pair that ruff (SIM102) flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an env-gated diagnostic that dumps the full per-step trace for
the first N sliding requests on each worker. Off by default
(VLLM_RBLN_SLIDING_TRACE_REQS=0), so production runs incur no logging
cost. When enabled, every sliding event for a tracked request emits
positions, input_ids, slot block ids, the logits indices that survived
past-position exclusion, and the num_draft_tokens reaching the
rejection sampler. CPU host code in `_prepare_inputs`; never traced
into the compiled model graph.

Captures the FULL per-req sequence (1021 → 1022 → 1023 boundary
events for the same req) so the timeline can be reconstructed from
contiguous log lines — useful for verifying past-prepend / slot
range / logits exclusion / effective drafts together against the
scheduler's spec-decode sliding log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk through what the spec-decode sliding window does, how the
scheduler and runner instrument it, and what a real `vllm bench serve`
run shows. Three case studies pulled from MiniMax-M2.5 traces:

- ngram-miss boundary traversal (slide 1 → 2 → 3 for one req), proving
  past tokens are idempotently re-fed, slot mapping stays inside the
  current block, past logits are excluded, and num_draft_tokens
  reflects what the rejection sampler actually sees.
- ngram-hit with kept drafts (slide=1, 2 drafts kept).
- ngram-hit with drafts dropped by sliding (slide=1/2/3, dropping
  1/2/3 drafts respectively when ngram fills the proposal cap) —
  the central design payoff.

Also documents how to reproduce (env vars, bench command, log greps)
and notes about the same-num_computed log artifact observed during
stress testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sliding window previously fired only at block boundaries; off the
boundary the runtime query length still followed the proposer's
output (1 base + up to num_spec_tokens drafts), and ranks with no
drafts dispatched a separate (8, 1) no_spec shape. Cross-DP MAX
papered over per-rank divergence at runtime, so the no_spec compile
slot was warmup'd but almost never used in DP setups.

Make the lift explicit and local: the scheduler always pads each
decode req's query window to num_spec_tokens + 1 via slide_distance,
regardless of how many drafts the proposer returned or whether a
boundary is in sight. Variable-length proposers (ngram, suffix
decoding) and fixed-length proposers (MTP, EAGLE) both converge to
the same runtime shape; boundary squeeze is one special case of the
same padding rule.

Changes:

- Scheduler: replace boundary-only sliding with `new_n = min(old_n,
  effective_remaining)` and `desired_slide = max_spec_decode_len -
  new_n`. Fires whenever the deficit is non-zero (length pad,
  boundary squeeze, or both); trims drafts only when boundary
  actually shortened the advance.
- Runner warmup: drop the (8, 1) no_spec model_wrapper compile —
  `query_len_range = [num_spec_tokens + 1]` only. Saves two compile
  slots in MoE warmup.
- Runner runtime: `spec_decode_max_query_len` is unconditionally
  num_spec_tokens + 1 when spec is configured. Cross-DP MAX is now a
  no-op (every rank votes the same value) but kept for shape
  uniformity guarantees.
- Runner dummy_run: idle DP ranks now also report num_spec_tokens + 1
  and expand their dummy input to (bucket, num_spec_tokens + 1).
  Without this an all-idle step (all 4 DP ranks dummy_run) would
  collapse to (bucket, 1) and trigger a hot-path model_wrapper
  recompile against the dropped no_spec slot.
- Tests: add `TestSlidingVariableLengthPadding` scheduler cases
  (zero / partial drafts off the boundary, partial drafts at the
  boundary) and runner-math cases that mirror the local-only path
  cross-DP MAX used to provide.
- Docs: rewrite `docs/sliding_window.md` opening to describe the
  unified design and update the scheduler log field semantics.

Verified end-to-end on MiniMax-M2.5 / DP=4 / EP=4 / num_spec=3 with
a 128-prompt bench (input 512, output 1500, rps=4):
- 0 bench-time model_wrapper recompiles (was 2 before)
- 0 errors, 128/128 successful
- 38140 sliding events, 93% slide=3 (ngram-miss length-pad path)
- Output throughput 571 tok/s (was 376 tok/s, +52%)
- ITL p99 144ms (was 1500ms+, bimodal collapsed)
- Acceptance rate 38.9% (was 32.6%)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_dp_padding called find_decode_batch_bucket with the max total tokens
across DP, which only equals the batch size for single-token decode. For
multi-token decode (e.g., speculative decoding with batch=8 reqs and 2
tokens/req → 16 tokens), the bucket lookup tried to resolve a batch
bucket >= 16 and failed against typical decode_batch_buckets like
[1, 4, 8].

Pack num_tokens, num_reqs, and is_prefill into a single bit-packed int32
(num_tokens in low 16 bits, num_reqs in bits 16..29, is_prefill in bit 30)
so a single all-reduce surfaces both per-rank token counts and per-rank
batch sizes. Use max(num_reqs) across DP for the decode bucket lookup,
and pad to batch_bucket_size * max_tokens_per_req so the MoE
max_pads_across_dp buffer fits every actual token position even under
multi-token decode. Single-token decode is unchanged
(max_tokens_per_req=1 reduces to the previous batch_bucket_size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes that together eliminate hot-path model_wrapper
recompiles seen when one DP rank runs spec decode while peers are idle
or single-token decoding:

- Plumb cross-DP max-tokens-per-request through get_dp_padding /
  _prepare_inputs (4-tuple / 10-tuple returns). Every rank reports a
  local query length (1 when local has no drafts) so the all-reduce
  branch is taken uniformly across the DP group, and callers can pad
  to the cross-DP max regardless of local draft state.

- Gate spec-decode padding in execute_model on
  max_tokens_per_req_across_dp > 1 instead of local
  scheduled_spec_decode_tokens. Previously a rank with no local drafts
  skipped padding even when peers raised max_pads_across_dp, producing
  an (input_ids[1], max_pads) tuple that did not match any warmup
  compile slot.

- Reshape dummy_run input_ids/positions from (bucket, 1) to (bucket,
  query_len) when the cross-DP max exceeds 1. The pre-baked dummy
  state is (bucket, 1); idle ranks now mirror the spec-mode shape
  peers expect.

Verified end-to-end with vllm bench serve under DP=4 + ngram spec
decode (num_speculative_tokens=3): warmup creates 7 compile slots
(prefill + padded-decode/decode-only for q in {1,2,4}) and no
model_wrapper recompile is triggered during the ~9 minutes of bench
traffic that follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtime query length is now exactly 1 (no spec) or num_spec_tokens + 1
(full spec) — never an intermediate value. This eliminates the
non-pow2 query_len edge cases that previously forced hot-path
model_wrapper recompiles and reduces the warmup compile slot count
from 7 to 5 (prefill + padded/decode-only at q=1, q=4).

Runner (vllm_rbln/v1/worker/rbln_model_runner.py):
- __init__: assert (num_speculative_tokens + 1) is a power of two when
  spec is configured. Required for MoE multicast's
  max_pads / num_tokens divisibility.
- _warm_up_model_inner: query_len_range = [1, num_spec_tokens + 1]
  (was the full pow2 sequence up to that maximum).
- execute_model: spec_decode_max_query_len is simply num_spec_tokens+1
  when this rank has any draft tokens this step, else 1. The
  pow2-round-up logic for intermediate sizes is gone.

Scheduler (vllm_rbln/v1/core/rbln_scheduler.py):
- spec_decode_cap update becomes a per-request binary block-boundary
  decision: max_spec_decode_len if remaining_in_block and
  remaining_in_maxlen can hold a full window, else 1. The retroactive
  trim then aligns every scheduled req's num_scheduled_tokens onto the
  same {1, num_spec_tokens+1} shape so the runner-side pad never
  writes past anyone's block boundary.
- RBLNSchedulerOutput grows a `step_no_spec_required: bool` field set
  True only when the binary cap was forced to 1 by the boundary check
  (distinct from "no drafts proposed this step", which leaves it
  False).

Cross-DP collective handling — distinguishes two reasons for a local
no-spec state and treats them differently:
- (a) boundary-induced (`step_no_spec_required=True` on some rank):
  OR-reduce the flag across DP. On True, every rank scrubs its
  scheduler_output (clears drafts, sets num_scheduled=1, recomputes
  totals) before _prepare_inputs builds the input tensors. This keeps
  the model_wrapper input shape uniform at query_len=1 across DP and
  prevents pad-position KV writes past any rank's block boundary.
- (b) no-drafts-proposed (`step_no_spec_required=False`, local
  scheduled_spec_decode_tokens empty): keep the existing cross-DP
  MAX behavior so peers that do have drafts can run full spec. The
  no-drafts rank gets padded to peers' query_len; pad positions land
  on lookahead-allocated slots, which is functionally safe (their
  outputs are discarded by the rejection sampler).
- dummy_run also participates in the new OR-reduce (voting 0) so the
  host-side gloo all_reduce doesn't hang when one DP rank is idle
  while peers run execute_model.

The added cross-DP communication is one extra int32 all_reduce per
step on the existing cpu_group, before model inference. Verified
end-to-end with DP=4 ngram spec decode over a 9-minute hot-path
bench: 5 warmup model_wrapper slots, 0 post-warmup recompiles. The
(a) collective-fallback path is not exercised by this bench
(input+output < block_size) and is planned for separate verification
via a long-output scenario or unit test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rebel-wonsubkim and others added 21 commits May 15, 2026 15:02
Two small polish items on the sliding-window scheduler decision:

- Replace the legacy "Full-spec-or-no-spec binary cap" comment block
  with one that describes the actual sliding-window design (slide
  query window backward, idempotent KV re-write, drafts that would
  cross the boundary get trimmed). The old wording survived from the
  collective-fallback approach and no longer matches the code.
- Extend the sliding info log with a `proposed_drafts` field
  (= old_n - 1 = what ngram returned before the trim). Pair with the
  existing `kept_drafts` cap and the runner-side `num_draft_tokens` to
  expose draft drops directly in logs: a sliding step drops
  proposed_drafts - kept_drafts drafts.

While here, collapse the redundant `if not is_prefill: if num_spec > 0`
pair that ruff (SIM102) flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an env-gated diagnostic that dumps the full per-step trace for
the first N sliding requests on each worker. Off by default
(VLLM_RBLN_SLIDING_TRACE_REQS=0), so production runs incur no logging
cost. When enabled, every sliding event for a tracked request emits
positions, input_ids, slot block ids, the logits indices that survived
past-position exclusion, and the num_draft_tokens reaching the
rejection sampler. CPU host code in `_prepare_inputs`; never traced
into the compiled model graph.

Captures the FULL per-req sequence (1021 → 1022 → 1023 boundary
events for the same req) so the timeline can be reconstructed from
contiguous log lines — useful for verifying past-prepend / slot
range / logits exclusion / effective drafts together against the
scheduler's spec-decode sliding log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk through what the spec-decode sliding window does, how the
scheduler and runner instrument it, and what a real `vllm bench serve`
run shows. Three case studies pulled from MiniMax-M2.5 traces:

- ngram-miss boundary traversal (slide 1 → 2 → 3 for one req), proving
  past tokens are idempotently re-fed, slot mapping stays inside the
  current block, past logits are excluded, and num_draft_tokens
  reflects what the rejection sampler actually sees.
- ngram-hit with kept drafts (slide=1, 2 drafts kept).
- ngram-hit with drafts dropped by sliding (slide=1/2/3, dropping
  1/2/3 drafts respectively when ngram fills the proposal cap) —
  the central design payoff.

Also documents how to reproduce (env vars, bench command, log greps)
and notes about the same-num_computed log artifact observed during
stress testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sliding window previously fired only at block boundaries; off the
boundary the runtime query length still followed the proposer's
output (1 base + up to num_spec_tokens drafts), and ranks with no
drafts dispatched a separate (8, 1) no_spec shape. Cross-DP MAX
papered over per-rank divergence at runtime, so the no_spec compile
slot was warmup'd but almost never used in DP setups.

Make the lift explicit and local: the scheduler always pads each
decode req's query window to num_spec_tokens + 1 via slide_distance,
regardless of how many drafts the proposer returned or whether a
boundary is in sight. Variable-length proposers (ngram, suffix
decoding) and fixed-length proposers (MTP, EAGLE) both converge to
the same runtime shape; boundary squeeze is one special case of the
same padding rule.

Changes:

- Scheduler: replace boundary-only sliding with `new_n = min(old_n,
  effective_remaining)` and `desired_slide = max_spec_decode_len -
  new_n`. Fires whenever the deficit is non-zero (length pad,
  boundary squeeze, or both); trims drafts only when boundary
  actually shortened the advance.
- Runner warmup: drop the (8, 1) no_spec model_wrapper compile —
  `query_len_range = [num_spec_tokens + 1]` only. Saves two compile
  slots in MoE warmup.
- Runner runtime: `spec_decode_max_query_len` is unconditionally
  num_spec_tokens + 1 when spec is configured. Cross-DP MAX is now a
  no-op (every rank votes the same value) but kept for shape
  uniformity guarantees.
- Runner dummy_run: idle DP ranks now also report num_spec_tokens + 1
  and expand their dummy input to (bucket, num_spec_tokens + 1).
  Without this an all-idle step (all 4 DP ranks dummy_run) would
  collapse to (bucket, 1) and trigger a hot-path model_wrapper
  recompile against the dropped no_spec slot.
- Tests: add `TestSlidingVariableLengthPadding` scheduler cases
  (zero / partial drafts off the boundary, partial drafts at the
  boundary) and runner-math cases that mirror the local-only path
  cross-DP MAX used to provide.
- Docs: rewrite `docs/sliding_window.md` opening to describe the
  unified design and update the scheduler log field semantics.

Verified end-to-end on MiniMax-M2.5 / DP=4 / EP=4 / num_spec=3 with
a 128-prompt bench (input 512, output 1500, rps=4):
- 0 bench-time model_wrapper recompiles (was 2 before)
- 0 errors, 128/128 successful
- 38140 sliding events, 93% slide=3 (ngram-miss length-pad path)
- Output throughput 571 tok/s (was 376 tok/s, +52%)
- ITL p99 144ms (was 1500ms+, bimodal collapsed)
- Acceptance rate 38.9% (was 32.6%)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e trace logs to DEBUG

- test_sliding_window.py -> test_query_backfill.py (+ 6 class names)
- log prefix: "spec-decode sliding" -> "spec-decode backfill" (scheduler)
- log prefix: "sliding-trace" -> "backfill-trace" (runner)
- env var: VLLM_RBLN_SLIDING_TRACE_REQS -> VLLM_RBLN_BACKFILL_TRACE_REQS
- demote scheduler + runner per-step diagnostics from INFO to DEBUG
- identifiers (slide_distance, _run_sliding_math, etc.) kept for stability
  with naming-note docstrings explaining the equivalence

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backfill removes the boundary cap + retroactive trim mechanism these
7 tests guarded. Keep at_block_boundary, no_spec_tokens_no_retroactive_trim,
prefill_triggers_no_mixed_batching (general scheduler invariants).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- new 3-section layout: Problem / Key idea / Example + Appendix
- query_backfill.md: 343 -> 199 lines (running trace preserved)
- cross_dp_spec_decode.md: 149 -> 145 lines
- add naming-note: "sliding window" == "query backfill" (equivalent)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit CI flagged this hunk for reformatting (single-line ternary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants