Skip to content

feature: cross-block no-spec fallback for variable-length spec decoding proposers#649

Open
rebel-wonsubkim wants to merge 3 commits into
devfrom
feat/spec-decode-cross-block-nospec
Open

feature: cross-block no-spec fallback for variable-length spec decoding proposers#649
rebel-wonsubkim wants to merge 3 commits into
devfrom
feat/spec-decode-cross-block-nospec

Conversation

@rebel-wonsubkim

@rebel-wonsubkim rebel-wonsubkim commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

🚀 Summary of Changes

What does this PR do? What feature, fix, or improvement does it bring?

Motivation

Query backfill (#604) keeps every decode step at a single (batch, num_spec+1) shape by prepending past tokens. But variable-length proposers
(ngram/suffix) can need backfill that crosses a KV block boundary: right after entering a new block, a short draft would pull past tokens from the
previous block, which the single-block decode path can't express → silent KV corruption. (Fixed-length proposers never hit this — num_spec+1 <=
block_size keeps their backfill in-block.)

Key idea

  • Detect cross-block in the scheduler (desired_slide > tokens_used_in_block) and fall back to no-spec (query_len=1) for that step.
  • Cross-DP collective: any rank tripping it → every rank drops to no-spec via the step_no_spec_required OR-reduce, so all ranks keep the same decode
    shape.
  • Compile the no-spec graph in warmup. A decode graph's compile key has two axes — input_ids shape and max_pads_across_dp size (num_padded). The warmup
    dummy drives the same runtime path (RBLNSchedulerOutput(step_no_spec_required=True)) so num_padded matches too; setting query_len=1 alone would still
    hot-path recompile at runtime.
  • Two runtime asserts guard the invariant: no backfill window may cross a block boundary, and the no-spec scrub must clear all slide + force
    query_len=1.

Verification

Full model (MiniMax-M2.5, DP4/EP4, ngram num_spec=3), isl=512/osl=2048: cross-block no-spec fired 322× organically, with 0 recompile / 0 hot-path / 0
assert, outputs coherent.

▎ Full motivation & key idea are kept in the rbln_scheduler.py module docstring.

host tensor - 검증 완료

RBLN_WEIGHT_FREE=0 VLLM_RBLN_USE_DEVICE_TENSOR=0 VLLM_RBLN_SAMPLER=0 \
vllm serve MiniMaxAI/MiniMax-M2.5 --port 8000 --data-parallel-size 4 --enable-expert-parallel \
--max-model-len 196608 --block-size 1024 --enable-chunked-prefill \
--max-num-batched-tokens 512 --max-num-seqs 8 --gpu-memory-utilization 0.8 --trust-remote-code \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'`

device tensor - 검증 중

RBLN_WEIGHT_FREE=1 VLLM_RBLN_USE_DEVICE_TENSOR=1 VLLM_RBLN_SAMPLER=0 \
vllm serve MiniMaxAI/MiniMax-M2.5 --port 8000 --data-parallel-size 4 --enable-expert-parallel \
--max-model-len 196608 --block-size 1024 --enable-chunked-prefill \
--max-num-batched-tokens 512 --max-num-seqs 8 --gpu-memory-utilization 0.8 --trust-remote-code \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'`
vllm bench serve \
    --model MiniMaxAI/MiniMax-M2.5 --backend vllm \
    --base-url http://127.0.0.1:8000 --endpoint /v1/completions \
    --dataset-name random --random-input-len 512 --random-output-len 2048 \
    --num-prompts 320 --max-concurrency 32 --request-rate 4 \
    --percentile-metrics ttft,tpot,itl,e2el --temperature 0

📌 Related Issues / Tickets


✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


…roposers

When in-block query backfill would cross a KV block boundary (variable-length
proposers entering a new block with a short draft), the step falls back to
no-spec (query_len=1) via the cross-DP step_no_spec_required OR-reduce. Warmup
now compiles the no-spec decode graph on both guard axes (input_ids shape and
max_pads_across_dp size), and two runtime asserts guard the in-block invariant.

Motivation & key idea are documented in the rbln_scheduler.py module docstring.

Signed-off-by: wonsub kim <subang0@rebellions.ai>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rebel-wonsubkim rebel-wonsubkim changed the title feat(spec-decode): cross-block no-spec fallback for variable-length p… feature: cross-block no-spec fallback for variable-length spec decoding proposers Jun 6, 2026
@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 34.88372% with 28 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
vllm_rbln/v1/worker/rbln_model_runner.py 6.89% 26 Missing and 1 partial ⚠️
vllm_rbln/v1/core/rbln_scheduler.py 92.85% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

rebel-wonsubkim and others added 2 commits June 7, 2026 11:09
…ock no-spec

Fixed-length proposers (eagle/eagle3/mtp/medusa) compile no no-spec graph and
their DP-idle peers vote num_spec+1, so a cross-block no-spec step would
hot-path recompile and break cross-DP full-spec shape agreement. Guard the
cross-block branch with a runtime assert so this invariant violation (only
reachable when max_model_len % block_size is in (0, num_spec+1) and a request
reaches the final block) fails loudly instead of corrupting.

Signed-off-by: wonsub kim <subang0@rebellions.ai>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-edge threshold

Add scheduler-side unit tests (no model) for cross-block no-spec across
variable- and fixed-length proposers. The fixed-length sweep showed cross-block
fires iff max_model_len % block_size <= num_spec+1 (decode is capped to
max_model_len-1); correct the guard assert message and docstring.

Signed-off-by: wonsub kim <subang0@rebellions.ai>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rebel-jinhwan rebel-jinhwan added the torch.compile torch.compile based implementation label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

torch.compile torch.compile based implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants