model: support deepseek v3 by rebel-kblee · Pull Request #624 · RBLN-SW/vllm-rbln

rebel-kblee · 2026-05-26T07:37:21Z

🚀 Summary of Changes

support deepseek v3 with mtp module

now only support k=1
k>1 (todo)
check e2e verification when fix weight free
support mtp based on feature(spec_dec): implement spec decode backfill for fixed length drafts (full_spec only) #604 mechanism

What does this PR do? What feature, fix, or improvement does it bring?

📌 Related Issues / Tickets

Resolves #
Related to #

✅ Type of Change

🚀 Release (release)
✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

Run ...
Verify output: ...
Edge case tested: ...

📸 Screenshots / Logs (if applicable)

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

get_dp_padding called find_decode_batch_bucket with the max total tokens across DP, which only equals the batch size for single-token decode. For multi-token decode (e.g., speculative decoding with batch=8 reqs and 2 tokens/req → 16 tokens), the bucket lookup tried to resolve a batch bucket >= 16 and failed against typical decode_batch_buckets like [1, 4, 8]. Pack num_tokens, num_reqs, and is_prefill into a single bit-packed int32 (num_tokens in low 16 bits, num_reqs in bits 16..29, is_prefill in bit 30) so a single all-reduce surfaces both per-rank token counts and per-rank batch sizes. Use max(num_reqs) across DP for the decode bucket lookup, and pad to batch_bucket_size * max_tokens_per_req so the MoE max_pads_across_dp buffer fits every actual token position even under multi-token decode. Single-token decode is unchanged (max_tokens_per_req=1 reduces to the previous batch_bucket_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related fixes that together eliminate hot-path model_wrapper recompiles seen when one DP rank runs spec decode while peers are idle or single-token decoding: - Plumb cross-DP max-tokens-per-request through get_dp_padding / _prepare_inputs (4-tuple / 10-tuple returns). Every rank reports a local query length (1 when local has no drafts) so the all-reduce branch is taken uniformly across the DP group, and callers can pad to the cross-DP max regardless of local draft state. - Gate spec-decode padding in execute_model on max_tokens_per_req_across_dp > 1 instead of local scheduled_spec_decode_tokens. Previously a rank with no local drafts skipped padding even when peers raised max_pads_across_dp, producing an (input_ids[1], max_pads) tuple that did not match any warmup compile slot. - Reshape dummy_run input_ids/positions from (bucket, 1) to (bucket, query_len) when the cross-DP max exceeds 1. The pre-baked dummy state is (bucket, 1); idle ranks now mirror the spec-mode shape peers expect. Verified end-to-end with vllm bench serve under DP=4 + ngram spec decode (num_speculative_tokens=3): warmup creates 7 compile slots (prefill + padded-decode/decode-only for q in {1,2,4}) and no model_wrapper recompile is triggered during the ~9 minutes of bench traffic that follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runtime query length is now exactly 1 (no spec) or num_spec_tokens + 1 (full spec) — never an intermediate value. This eliminates the non-pow2 query_len edge cases that previously forced hot-path model_wrapper recompiles and reduces the warmup compile slot count from 7 to 5 (prefill + padded/decode-only at q=1, q=4). Runner (vllm_rbln/v1/worker/rbln_model_runner.py): - __init__: assert (num_speculative_tokens + 1) is a power of two when spec is configured. Required for MoE multicast's max_pads / num_tokens divisibility. - _warm_up_model_inner: query_len_range = [1, num_spec_tokens + 1] (was the full pow2 sequence up to that maximum). - execute_model: spec_decode_max_query_len is simply num_spec_tokens+1 when this rank has any draft tokens this step, else 1. The pow2-round-up logic for intermediate sizes is gone. Scheduler (vllm_rbln/v1/core/rbln_scheduler.py): - spec_decode_cap update becomes a per-request binary block-boundary decision: max_spec_decode_len if remaining_in_block and remaining_in_maxlen can hold a full window, else 1. The retroactive trim then aligns every scheduled req's num_scheduled_tokens onto the same {1, num_spec_tokens+1} shape so the runner-side pad never writes past anyone's block boundary. - RBLNSchedulerOutput grows a `step_no_spec_required: bool` field set True only when the binary cap was forced to 1 by the boundary check (distinct from "no drafts proposed this step", which leaves it False). Cross-DP collective handling — distinguishes two reasons for a local no-spec state and treats them differently: - (a) boundary-induced (`step_no_spec_required=True` on some rank): OR-reduce the flag across DP. On True, every rank scrubs its scheduler_output (clears drafts, sets num_scheduled=1, recomputes totals) before _prepare_inputs builds the input tensors. This keeps the model_wrapper input shape uniform at query_len=1 across DP and prevents pad-position KV writes past any rank's block boundary. - (b) no-drafts-proposed (`step_no_spec_required=False`, local scheduled_spec_decode_tokens empty): keep the existing cross-DP MAX behavior so peers that do have drafts can run full spec. The no-drafts rank gets padded to peers' query_len; pad positions land on lookahead-allocated slots, which is functionally safe (their outputs are discarded by the rejection sampler). - dummy_run also participates in the new OR-reduce (voting 0) so the host-side gloo all_reduce doesn't hang when one DP rank is idle while peers run execute_model. The added cross-DP communication is one extra int32 all_reduce per step on the existing cpu_group, before model inference. Verified end-to-end with DP=4 ngram spec decode over a 9-minute hot-path bench: 5 warmup model_wrapper slots, 0 post-warmup recompiles. The (a) collective-fallback path is not exercised by this bench (input+output < block_size) and is planned for separate verification via a long-output scenario or unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- ruff-format reformatting across the three files touched by the preceding two commits (line wrapping / single-line conversions that ruff prefers). - Add `assert batch_bucket_size is not None` in get_dp_padding's path-B branch so mypy can narrow the type before the `batch_bucket_size * max_tokens_per_req` multiplication. No behavioral change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Document why MoE DP + spec decoding needs cross-DP coordination and how vllm-rbln handles it: - Root cause: per-rank ngram outcomes diverge, so each DP rank's local decision splits between no-spec (`query_len=1`) and full-spec (`query_len=num_spec_tokens+1`). Local decisions alone leave the DP world with mismatched MoE collective shapes / forced recompiles. - Solution: lift each rank's local decision into a single global decision via cross-DP collective MAX, so the model_wrapper compiles only for the shapes the DP world will drive together. Bit-packed int32 channel carries num_tokens, num_reqs, and is_prefill in a single all-reduce per step. - Block-boundary edge case: legacy cross-DP collective fallback path (`step_no_spec_required` OR-reduce) is retained for back-compat but no longer fires under the sliding-window scheduler decision — pointer added to `docs/sliding_window.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the legacy collective binary-cap + retroactive-trim with a per-request sliding-window mechanism. For each running decode req whose remaining_in_block / remaining_in_maxlen budget cannot hold a full num_spec_tokens+1 query window: - Compute desired_slide = max_spec_decode_len - effective_remaining - Trim num_scheduled_tokens to effective_remaining and re-trim scheduled_spec_decode_tokens to match - Record slide_distance in the new RBLNSchedulerOutput.spec_decode_slide_distance map so the runner can prepend that many past tokens to the query window and keep the full num_spec_tokens+1 shape Reqs whose full window already fits the current block are untouched (they run normal full spec). The mechanism assumes block_size >= num_spec_tokens + 1; under that invariant past tokens are always sufficient (num_computed_tokens reaches block boundary only after at least one full block's worth of past positions has accumulated), so we assert rather than maintain a no-spec fallback: assert effective_remaining >= 1 assert desired_slide <= available_past The legacy step_no_spec_required field stays on RBLNSchedulerOutput for backward compatibility with runner-side code that still reads it (default False; runner cleanup follows in a later commit). The retroactive trim block and spec_decode_cap variable that implemented the previous batch-wide collective fallback are removed. This is scheduler-only; the runner has not yet been taught how to consume spec_decode_slide_distance — that's the next step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire `RBLNSchedulerOutput.spec_decode_slide_distance` into `_prepare_inputs` so the query window for boundary-affected reqs gets prepended with past positions whose KV is already in cache. Reqs without slide are unaffected — when the dict is empty the math collapses back to the standard flow. Concretely, at the top of _prepare_inputs: - Build a per-req `slide_arr` from spec_decode_slide_distance. - `query_lengths = num_scheduled_tokens + slide_arr` — the actual per-req window length the model will see (= num_spec_tokens+1 for boundary reqs, num_scheduled_tokens otherwise). - `total_query_tokens = sum(query_lengths)` — the flat token count the runner has to materialize. - `req_indices`, `cu_num_tokens`, `arange` are built off query_lengths so each req gets contiguous slots for its full window. - `positions_np = (num_computed_cpu - slide_arr) + arange`, shifting the window backward by `slide` for boundary reqs so all positions land within already-allocated current-block KV slots. Downstream tensor sizing (input_ids / positions / slot_mapping / mrope_positions / CommonAttentionMetadata.num_actual_tokens) is switched from `total_num_scheduled_tokens` (logical advance) to `total_query_tokens`, and `max_num_scheduled_tokens` / `get_dp_padding(num_tokens=...)` are likewise computed against the sliding-aware lengths. `block_table.compute_slot_mapping(req_indices, positions_np)` is called with the sliding-extended (req_indices, positions_np), so past positions resolve to their existing valid slots — past KV gets idempotently re-written, no -1 sentinel needed. Cross-DP / dummy_run / spec_decode_metadata / rejection sampler are not touched yet (Tasks #6–#8). The legacy collective fallback path remains in place; sliding takes over for the boundary cases the old flag used to handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two test groups under tests/torch_compile/unit/test_sliding_window.py: 1) TestSchedulerSliding — drive RBLNScheduler.schedule() through five boundary-or-not configurations and assert: - no slide entry / unchanged num_scheduled / kept drafts when full window fits; - slide_distance == max_spec_decode_len - effective_remaining and num_scheduled trimmed to effective_remaining when boundary hits; - drafts trimmed to (effective_remaining - 1); - step_no_spec_required stays False (no collective fallback); - per-req independence (one req slides, the other does not). 2) TestRunnerSlidingMath — reimplement the per-req block at the top of RBLNModelRunner._prepare_inputs as pure numpy and assert that: - positions land at [T - slide .. T + R - 1] for boundary reqs; - input_ids match the corresponding token_ids_cpu entries — i.e. the past tokens really show up at the start of the query window; - no-slide case is identity-equivalent to the standard flow; - a mixed batch (one slide, one not) produces the expected concatenated layout. Also: drop the `assert not num_speculative_tokens` placeholder in the create_scheduler test helper. The helper already wires through SpeculativeConfig when the arg is provided, so the assertion was the only thing blocking spec-decode unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a TestSlidingLogitsIndices class that replays the per-req block of _calc_spec_decode_metadata as pure numpy and verifies that, when cu_num_scheduled_tokens is fed the sliding-aware query_lengths (num_scheduled + slide), the resulting logits_indices correctly skip the prepended past positions of every req's window. Cases covered: - baseline (no slide, full spec drafts) — all positions sampled, - boundary slide=2 drafts=1 — flat positions [past,past,base,draft] yield logits_indices = [2, 3], - boundary slide=3 drafts=0 — only the base logit is sampled, - mixed batch with one full-spec req and one boundary req — past positions of the boundary req are skipped in the flat layout while the full-spec req contributes all its positions. No new logic was needed in the runner — Task #4's switch of cu_num_tokens from num_scheduled to query_lengths is what already makes the math work. This test pins the invariant down so future refactors can't silently break it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… trim Add TestSlidingDraftTokenExtraction class verifying that the draft tokens the rejection sampler actually validates are the post-trim surviving drafts — not the original pre-slide proposals — and that no past tokens leak into the draft tensor. The extraction logic exercised matches what _calc_spec_decode_metadata does in the runner: draft_token_ids = input_ids[logits_indices][target_logits_indices + 1] With sliding-aware cu_num_tokens (= cumsum of query_lengths including slide), this expression should pick exactly the drafts the scheduler kept in scheduled_spec_decode_tokens. Cases covered: - baseline no-slide full spec — all 3 drafts extracted unchanged, - slide=2 with 1 kept draft — only the surviving draft returned (note dropped drafts never enter input_ids in the first place; the test pins down that past-token slots aren't misread as drafts either), - slide=3 no drafts — empty extraction, - mixed batch (one full-spec, one boundary-with-1-draft) — drafts concatenated per-req, no contamination across req boundaries. No runner-side or rejection-sampler code change was needed for Task #7. Scheduler-side trim (Task #3) propagates through the num_draft_tokens / cu_num_draft_tokens / logits_indices machinery automatically because Task #4 already switched cu_num_tokens to be sliding-aware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add TestSlidingEdgeCases class to pin down the three guards that keep the sliding-window decision a no-op outside its intended scope: 1) num_speculative_tokens=None — the entire sliding block in scheduler is gated on self.num_spec_tokens > 0. Verify that disabling spec leaves spec_decode_slide_distance empty and the req runs as standard single-token decode. 2) Prefill-phase req — the sliding block is also gated on `not is_prefill(request)`. A req still in prefill must never get a slide entry even if its prompt length places it near a block boundary. 3) Decode req comfortably mid-block — when remaining_in_block is much larger than num_spec_tokens+1, the boundary condition `effective_remaining < max_spec_decode_len` is False and the req must run at full num_scheduled with no slide entry. These regressions would silently corrupt non-spec or prefill flows so they're worth pinning down explicitly. Total sliding-window test count is now 20 (3 + 5 + 4 + 4 + 4 across the five classes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ility Add an INFO log line in the per-req boundary-detected branch of RBLNScheduler.schedule(). Fires only when a req's effective_remaining drops below num_spec_tokens + 1 and we record a slide_distance — i.e., exactly when the new sliding-window mechanism is exercised. Rate is bounded by the boundary-hit frequency (~num_spec_tokens / block_size per req per step, ≈0.3% for the typical 1024/3 config), so this is workload-cheap noise while end-to-end runs are validating the path. Each line carries num_computed, remaining_in_block, remaining_in_maxlen, slide_distance, the resulting advance, and the count of drafts kept — enough to reconstruct the per-req scheduler decision from server.log without enabling DEBUG. May be downgraded to logger.debug after sliding has been validated in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_prepare_inputs` prepends past tokens to a sliding decode req's query window, so the flat input layout grows by sum(slide_distance) beyond scheduler_output.total_num_scheduled_tokens. Three downstream sites sliced on the pre-sliding count and triggered an IndexError once a real boundary fired: - `_preprocess.num_input_tokens` lost the past-token suffix it should have read. - The outer `execute_model`'s `num_tokens_unpadded` (feeding `_get_slot_mappings`) skipped past slots. - `pad_speculative_draft_tokens` + the `unpadded_to_padded` remap used num_scheduled_tokens[req] only, while `logits_indices` already carried indices into the (scheduled + slide) layout. Each site now adds the per-req or aggregate slide_distance pulled from spec_decode_slide_distance, restoring the slide-aware invariant the runner's own input building relies on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two small polish items on the sliding-window scheduler decision: - Replace the legacy "Full-spec-or-no-spec binary cap" comment block with one that describes the actual sliding-window design (slide query window backward, idempotent KV re-write, drafts that would cross the boundary get trimmed). The old wording survived from the collective-fallback approach and no longer matches the code. - Extend the sliding info log with a `proposed_drafts` field (= old_n - 1 = what ngram returned before the trim). Pair with the existing `kept_drafts` cap and the runner-side `num_draft_tokens` to expose draft drops directly in logs: a sliding step drops proposed_drafts - kept_drafts drafts. While here, collapse the redundant `if not is_prefill: if num_spec > 0` pair that ruff (SIM102) flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add an env-gated diagnostic that dumps the full per-step trace for the first N sliding requests on each worker. Off by default (VLLM_RBLN_SLIDING_TRACE_REQS=0), so production runs incur no logging cost. When enabled, every sliding event for a tracked request emits positions, input_ids, slot block ids, the logits indices that survived past-position exclusion, and the num_draft_tokens reaching the rejection sampler. CPU host code in `_prepare_inputs`; never traced into the compiled model graph. Captures the FULL per-req sequence (1021 → 1022 → 1023 boundary events for the same req) so the timeline can be reconstructed from contiguous log lines — useful for verifying past-prepend / slot range / logits exclusion / effective drafts together against the scheduler's spec-decode sliding log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Walk through what the spec-decode sliding window does, how the scheduler and runner instrument it, and what a real `vllm bench serve` run shows. Three case studies pulled from MiniMax-M2.5 traces: - ngram-miss boundary traversal (slide 1 → 2 → 3 for one req), proving past tokens are idempotently re-fed, slot mapping stays inside the current block, past logits are excluded, and num_draft_tokens reflects what the rejection sampler actually sees. - ngram-hit with kept drafts (slide=1, 2 drafts kept). - ngram-hit with drafts dropped by sliding (slide=1/2/3, dropping 1/2/3 drafts respectively when ngram fills the proposal cap) — the central design payoff. Also documents how to reproduce (env vars, bench command, log greps) and notes about the same-num_computed log artifact observed during stress testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sliding window previously fired only at block boundaries; off the boundary the runtime query length still followed the proposer's output (1 base + up to num_spec_tokens drafts), and ranks with no drafts dispatched a separate (8, 1) no_spec shape. Cross-DP MAX papered over per-rank divergence at runtime, so the no_spec compile slot was warmup'd but almost never used in DP setups. Make the lift explicit and local: the scheduler always pads each decode req's query window to num_spec_tokens + 1 via slide_distance, regardless of how many drafts the proposer returned or whether a boundary is in sight. Variable-length proposers (ngram, suffix decoding) and fixed-length proposers (MTP, EAGLE) both converge to the same runtime shape; boundary squeeze is one special case of the same padding rule. Changes: - Scheduler: replace boundary-only sliding with `new_n = min(old_n, effective_remaining)` and `desired_slide = max_spec_decode_len - new_n`. Fires whenever the deficit is non-zero (length pad, boundary squeeze, or both); trims drafts only when boundary actually shortened the advance. - Runner warmup: drop the (8, 1) no_spec model_wrapper compile — `query_len_range = [num_spec_tokens + 1]` only. Saves two compile slots in MoE warmup. - Runner runtime: `spec_decode_max_query_len` is unconditionally num_spec_tokens + 1 when spec is configured. Cross-DP MAX is now a no-op (every rank votes the same value) but kept for shape uniformity guarantees. - Runner dummy_run: idle DP ranks now also report num_spec_tokens + 1 and expand their dummy input to (bucket, num_spec_tokens + 1). Without this an all-idle step (all 4 DP ranks dummy_run) would collapse to (bucket, 1) and trigger a hot-path model_wrapper recompile against the dropped no_spec slot. - Tests: add `TestSlidingVariableLengthPadding` scheduler cases (zero / partial drafts off the boundary, partial drafts at the boundary) and runner-math cases that mirror the local-only path cross-DP MAX used to provide. - Docs: rewrite `docs/sliding_window.md` opening to describe the unified design and update the scheduler log field semantics. Verified end-to-end on MiniMax-M2.5 / DP=4 / EP=4 / num_spec=3 with a 128-prompt bench (input 512, output 1500, rps=4): - 0 bench-time model_wrapper recompiles (was 2 before) - 0 errors, 128/128 successful - 38140 sliding events, 93% slide=3 (ngram-miss length-pad path) - Output throughput 571 tok/s (was 376 tok/s, +52%) - ITL p99 144ms (was 1500ms+, bimodal collapsed) - Acceptance rate 38.9% (was 32.6%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

get_dp_padding called find_decode_batch_bucket with the max total tokens across DP, which only equals the batch size for single-token decode. For multi-token decode (e.g., speculative decoding with batch=8 reqs and 2 tokens/req → 16 tokens), the bucket lookup tried to resolve a batch bucket >= 16 and failed against typical decode_batch_buckets like [1, 4, 8]. Pack num_tokens, num_reqs, and is_prefill into a single bit-packed int32 (num_tokens in low 16 bits, num_reqs in bits 16..29, is_prefill in bit 30) so a single all-reduce surfaces both per-rank token counts and per-rank batch sizes. Use max(num_reqs) across DP for the decode bucket lookup, and pad to batch_bucket_size * max_tokens_per_req so the MoE max_pads_across_dp buffer fits every actual token position even under multi-token decode. Single-token decode is unchanged (max_tokens_per_req=1 reduces to the previous batch_bucket_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related fixes that together eliminate hot-path model_wrapper recompiles seen when one DP rank runs spec decode while peers are idle or single-token decoding: - Plumb cross-DP max-tokens-per-request through get_dp_padding / _prepare_inputs (4-tuple / 10-tuple returns). Every rank reports a local query length (1 when local has no drafts) so the all-reduce branch is taken uniformly across the DP group, and callers can pad to the cross-DP max regardless of local draft state. - Gate spec-decode padding in execute_model on max_tokens_per_req_across_dp > 1 instead of local scheduled_spec_decode_tokens. Previously a rank with no local drafts skipped padding even when peers raised max_pads_across_dp, producing an (input_ids[1], max_pads) tuple that did not match any warmup compile slot. - Reshape dummy_run input_ids/positions from (bucket, 1) to (bucket, query_len) when the cross-DP max exceeds 1. The pre-baked dummy state is (bucket, 1); idle ranks now mirror the spec-mode shape peers expect. Verified end-to-end with vllm bench serve under DP=4 + ngram spec decode (num_speculative_tokens=3): warmup creates 7 compile slots (prefill + padded-decode/decode-only for q in {1,2,4}) and no model_wrapper recompile is triggered during the ~9 minutes of bench traffic that follows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runtime query length is now exactly 1 (no spec) or num_spec_tokens + 1 (full spec) — never an intermediate value. This eliminates the non-pow2 query_len edge cases that previously forced hot-path model_wrapper recompiles and reduces the warmup compile slot count from 7 to 5 (prefill + padded/decode-only at q=1, q=4). Runner (vllm_rbln/v1/worker/rbln_model_runner.py): - __init__: assert (num_speculative_tokens + 1) is a power of two when spec is configured. Required for MoE multicast's max_pads / num_tokens divisibility. - _warm_up_model_inner: query_len_range = [1, num_spec_tokens + 1] (was the full pow2 sequence up to that maximum). - execute_model: spec_decode_max_query_len is simply num_spec_tokens+1 when this rank has any draft tokens this step, else 1. The pow2-round-up logic for intermediate sizes is gone. Scheduler (vllm_rbln/v1/core/rbln_scheduler.py): - spec_decode_cap update becomes a per-request binary block-boundary decision: max_spec_decode_len if remaining_in_block and remaining_in_maxlen can hold a full window, else 1. The retroactive trim then aligns every scheduled req's num_scheduled_tokens onto the same {1, num_spec_tokens+1} shape so the runner-side pad never writes past anyone's block boundary. - RBLNSchedulerOutput grows a `step_no_spec_required: bool` field set True only when the binary cap was forced to 1 by the boundary check (distinct from "no drafts proposed this step", which leaves it False). Cross-DP collective handling — distinguishes two reasons for a local no-spec state and treats them differently: - (a) boundary-induced (`step_no_spec_required=True` on some rank): OR-reduce the flag across DP. On True, every rank scrubs its scheduler_output (clears drafts, sets num_scheduled=1, recomputes totals) before _prepare_inputs builds the input tensors. This keeps the model_wrapper input shape uniform at query_len=1 across DP and prevents pad-position KV writes past any rank's block boundary. - (b) no-drafts-proposed (`step_no_spec_required=False`, local scheduled_spec_decode_tokens empty): keep the existing cross-DP MAX behavior so peers that do have drafts can run full spec. The no-drafts rank gets padded to peers' query_len; pad positions land on lookahead-allocated slots, which is functionally safe (their outputs are discarded by the rejection sampler). - dummy_run also participates in the new OR-reduce (voting 0) so the host-side gloo all_reduce doesn't hang when one DP rank is idle while peers run execute_model. The added cross-DP communication is one extra int32 all_reduce per step on the existing cpu_group, before model inference. Verified end-to-end with DP=4 ngram spec decode over a 9-minute hot-path bench: 5 warmup model_wrapper slots, 0 post-warmup recompiles. The (a) collective-fallback path is not exercised by this bench (input+output < block_size) and is planned for separate verification via a long-output scenario or unit test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two small polish items on the sliding-window scheduler decision: - Replace the legacy "Full-spec-or-no-spec binary cap" comment block with one that describes the actual sliding-window design (slide query window backward, idempotent KV re-write, drafts that would cross the boundary get trimmed). The old wording survived from the collective-fallback approach and no longer matches the code. - Extend the sliding info log with a `proposed_drafts` field (= old_n - 1 = what ngram returned before the trim). Pair with the existing `kept_drafts` cap and the runner-side `num_draft_tokens` to expose draft drops directly in logs: a sliding step drops proposed_drafts - kept_drafts drafts. While here, collapse the redundant `if not is_prefill: if num_spec > 0` pair that ruff (SIM102) flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add an env-gated diagnostic that dumps the full per-step trace for the first N sliding requests on each worker. Off by default (VLLM_RBLN_SLIDING_TRACE_REQS=0), so production runs incur no logging cost. When enabled, every sliding event for a tracked request emits positions, input_ids, slot block ids, the logits indices that survived past-position exclusion, and the num_draft_tokens reaching the rejection sampler. CPU host code in `_prepare_inputs`; never traced into the compiled model graph. Captures the FULL per-req sequence (1021 → 1022 → 1023 boundary events for the same req) so the timeline can be reconstructed from contiguous log lines — useful for verifying past-prepend / slot range / logits exclusion / effective drafts together against the scheduler's spec-decode sliding log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Walk through what the spec-decode sliding window does, how the scheduler and runner instrument it, and what a real `vllm bench serve` run shows. Three case studies pulled from MiniMax-M2.5 traces: - ngram-miss boundary traversal (slide 1 → 2 → 3 for one req), proving past tokens are idempotently re-fed, slot mapping stays inside the current block, past logits are excluded, and num_draft_tokens reflects what the rejection sampler actually sees. - ngram-hit with kept drafts (slide=1, 2 drafts kept). - ngram-hit with drafts dropped by sliding (slide=1/2/3, dropping 1/2/3 drafts respectively when ngram fills the proposal cap) — the central design payoff. Also documents how to reproduce (env vars, bench command, log greps) and notes about the same-num_computed log artifact observed during stress testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sliding window previously fired only at block boundaries; off the boundary the runtime query length still followed the proposer's output (1 base + up to num_spec_tokens drafts), and ranks with no drafts dispatched a separate (8, 1) no_spec shape. Cross-DP MAX papered over per-rank divergence at runtime, so the no_spec compile slot was warmup'd but almost never used in DP setups. Make the lift explicit and local: the scheduler always pads each decode req's query window to num_spec_tokens + 1 via slide_distance, regardless of how many drafts the proposer returned or whether a boundary is in sight. Variable-length proposers (ngram, suffix decoding) and fixed-length proposers (MTP, EAGLE) both converge to the same runtime shape; boundary squeeze is one special case of the same padding rule. Changes: - Scheduler: replace boundary-only sliding with `new_n = min(old_n, effective_remaining)` and `desired_slide = max_spec_decode_len - new_n`. Fires whenever the deficit is non-zero (length pad, boundary squeeze, or both); trims drafts only when boundary actually shortened the advance. - Runner warmup: drop the (8, 1) no_spec model_wrapper compile — `query_len_range = [num_spec_tokens + 1]` only. Saves two compile slots in MoE warmup. - Runner runtime: `spec_decode_max_query_len` is unconditionally num_spec_tokens + 1 when spec is configured. Cross-DP MAX is now a no-op (every rank votes the same value) but kept for shape uniformity guarantees. - Runner dummy_run: idle DP ranks now also report num_spec_tokens + 1 and expand their dummy input to (bucket, num_spec_tokens + 1). Without this an all-idle step (all 4 DP ranks dummy_run) would collapse to (bucket, 1) and trigger a hot-path model_wrapper recompile against the dropped no_spec slot. - Tests: add `TestSlidingVariableLengthPadding` scheduler cases (zero / partial drafts off the boundary, partial drafts at the boundary) and runner-math cases that mirror the local-only path cross-DP MAX used to provide. - Docs: rewrite `docs/sliding_window.md` opening to describe the unified design and update the scheduler log field semantics. Verified end-to-end on MiniMax-M2.5 / DP=4 / EP=4 / num_spec=3 with a 128-prompt bench (input 512, output 1500, rps=4): - 0 bench-time model_wrapper recompiles (was 2 before) - 0 errors, 128/128 successful - 38140 sliding events, 93% slide=3 (ngram-miss length-pad path) - Output throughput 571 tok/s (was 376 tok/s, +52%) - ITL p99 144ms (was 1500ms+, bimodal collapsed) - Acceptance rate 38.9% (was 32.6%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…RBLN-SW/vllm-rbln into support_deepseek_v3

…e trace logs to DEBUG - test_sliding_window.py -> test_query_backfill.py (+ 6 class names) - log prefix: "spec-decode sliding" -> "spec-decode backfill" (scheduler) - log prefix: "sliding-trace" -> "backfill-trace" (runner) - env var: VLLM_RBLN_SLIDING_TRACE_REQS -> VLLM_RBLN_BACKFILL_TRACE_REQS - demote scheduler + runner per-step diagnostics from INFO to DEBUG - identifiers (slide_distance, _run_sliding_math, etc.) kept for stability with naming-note docstrings explaining the equivalence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Backfill removes the boundary cap + retroactive trim mechanism these 7 tests guarded. Keep at_block_boundary, no_spec_tokens_no_retroactive_trim, prefill_triggers_no_mixed_batching (general scheduler invariants). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- new 3-section layout: Problem / Key idea / Example + Appendix - query_backfill.md: 343 -> 199 lines (running trace preserved) - cross_dp_spec_decode.md: 149 -> 145 lines - add naming-note: "sliding window" == "query backfill" (equivalent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-commit CI flagged this hunk for reformatting (single-line ternary). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-26T07:59:26Z

Codecov Report

❌ Patch coverage is 23.90158% with 433 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
vllm_rbln/v1/worker/rbln_model_runner.py	10.06%	133 Missing and 1 partial ⚠️
...lm_rbln/v1/attention/backends/mla/flashattn_mla.py	40.74%	64 Missing ⚠️
vllm_rbln/v1/spec_decode/eagle.py	12.50%	63 Missing ⚠️
..._rbln/model_executor/model_loader/weight_loader.py	3.92%	49 Missing ⚠️
..._rbln/model_executor/layers/attention/attention.py	25.00%	30 Missing ⚠️
vllm_rbln/model_executor/layers/mla.py	11.76%	30 Missing ⚠️
...llm_rbln/model_executor/layers/quantization/fp8.py	0.00%	20 Missing ⚠️
vllm_rbln/model_executor/layers/fused_moe/layer.py	0.00%	9 Missing ⚠️
vllm_rbln/models/deepseek_v2.py	0.00%	8 Missing ⚠️
vllm_rbln/v1/attention/backends/flash_attention.py	11.11%	8 Missing ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

…rt_deepseek_v3

rebel-kblee and others added 30 commits May 6, 2026 06:23

initial commit for support deepseekv3

7f079fa

remove unused

f2bca04

support mla in backend

8ff0a9c

repeat interleave + crop for block quant with mla

ce734ac

pattern matching & rope

3feb9c7

fix scatter element

3abcefc

seq_len span

04f70b6

enable mla_attn b1 batch decode

1166dc5

fix condition

cff2d23

consider CR03 with mla

353f785

rebel-wonsubkim and others added 21 commits May 15, 2026 15:02

Merge branch 'feat/spec-decode-sliding-window' of https://github.com/…

7a2324a

…RBLN-SW/vllm-rbln into support_deepseek_v3

dev rebase

f1f1be2

available mtp & bugfix in attn metadata

75e856c

single compile ctx

0aaa1f0

temporal fix & remove mark_static address

58e57ef

fix root cause? need to check

f847d5a

enable num_spec > 1

c0d7bb9

style: apply ruff format to forward_context.py

6178a76

Pre-commit CI flagged this hunk for reformatting (single-line ternary). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sync spec-dec branch

e6d9997

support dp with mtp

24f8c7f

add padding in eagle proposer with dp > 1

375612f

revert

8bde854

pad slot_mappings

82c8689

fix logic when chunked prefill size < prompt len

82094ec

rebel-kblee requested review from rebel-mhkang, rebel-thkim and rebel-wonsubkim May 26, 2026 07:50

rebel-eunji and others added 5 commits May 26, 2026 20:56

Merge branch 'dev' into support_deepseek_v3

63d55c2

fix runnert for measure

5b56c1e

support k>1 (draft(

442325d

Merge branch 'dev' of https://github.com/RBLN-SW/vllm-rbln into suppo…

63d1a04

…rt_deepseek_v3

sync with rebase fp8 branch

cca9fad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: support deepseek v3#624

model: support deepseek v3#624
rebel-kblee wants to merge 66 commits into
devfrom
support_deepseek_v3

rebel-kblee commented May 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rebel-kblee commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📸 Screenshots / Logs (if applicable)

📋 Checklist

💬 Notes

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rebel-kblee commented May 26, 2026 •

edited

Loading

codecov Bot commented May 26, 2026 •

edited

Loading