feature: cross-block no-spec fallback for variable-length spec decoding proposers by rebel-wonsubkim · Pull Request #649 · RBLN-SW/vllm-rbln

rebel-wonsubkim · 2026-06-06T08:27:52Z

🚀 Summary of Changes

What does this PR do? What feature, fix, or improvement does it bring?

Motivation

Query backfill (#604) keeps every decode step at a single (batch, num_spec+1) shape by prepending past tokens. But variable-length proposers
(ngram/suffix) can need backfill that crosses a KV block boundary: right after entering a new block, a short draft would pull past tokens from the
previous block, which the single-block decode path can't express → silent KV corruption. (Fixed-length proposers never hit this — num_spec+1 <=
block_size keeps their backfill in-block.)

Key idea

Detect cross-block in the scheduler (desired_slide > tokens_used_in_block) and fall back to no-spec (query_len=1) for that step.
Cross-DP collective: any rank tripping it → every rank drops to no-spec via the step_no_spec_required OR-reduce, so all ranks keep the same decode
shape.
Compile the no-spec graph in warmup. A decode graph's compile key has two axes — input_ids shape and max_pads_across_dp size (num_padded). The warmup
dummy drives the same runtime path (RBLNSchedulerOutput(step_no_spec_required=True)) so num_padded matches too; setting query_len=1 alone would still
hot-path recompile at runtime.
Two runtime asserts guard the invariant: no backfill window may cross a block boundary, and the no-spec scrub must clear all slide + force
query_len=1.

Verification

Full model (MiniMax-M2.5, DP4/EP4, ngram num_spec=3), isl=512/osl=2048: cross-block no-spec fired 322× organically, with 0 recompile / 0 hot-path / 0
assert, outputs coherent.

▎ Full motivation & key idea are kept in the rbln_scheduler.py module docstring.

host tensor - 검증 완료

RBLN_WEIGHT_FREE=0 VLLM_RBLN_USE_DEVICE_TENSOR=0 VLLM_RBLN_SAMPLER=0 \
vllm serve MiniMaxAI/MiniMax-M2.5 --port 8000 --data-parallel-size 4 --enable-expert-parallel \
--max-model-len 196608 --block-size 1024 --enable-chunked-prefill \
--max-num-batched-tokens 512 --max-num-seqs 8 --gpu-memory-utilization 0.8 --trust-remote-code \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'`

device tensor - 검증 중

RBLN_WEIGHT_FREE=1 VLLM_RBLN_USE_DEVICE_TENSOR=1 VLLM_RBLN_SAMPLER=0 \
vllm serve MiniMaxAI/MiniMax-M2.5 --port 8000 --data-parallel-size 4 --enable-expert-parallel \
--max-model-len 196608 --block-size 1024 --enable-chunked-prefill \
--max-num-batched-tokens 512 --max-num-seqs 8 --gpu-memory-utilization 0.8 --trust-remote-code \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 5, "prompt_lookup_min": 2}'`

vllm bench serve \
    --model MiniMaxAI/MiniMax-M2.5 --backend vllm \
    --base-url http://127.0.0.1:8000 --endpoint /v1/completions \
    --dataset-name random --random-input-len 512 --random-output-len 2048 \
    --num-prompts 320 --max-concurrency 32 --request-rate 4 \
    --percentile-metrics ttft,tpot,itl,e2el --temperature 0

📌 Related Issues / Tickets

Resolves #
Related to feature(spec_dec): implement spec decode backfill for fixed length drafts (full_spec only) #604

✅ Type of Change

🚀 Release (release)
✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

Run ...
Verify output: ...
Edge case tested: ...

📸 Screenshots / Logs (if applicable)

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

…roposers When in-block query backfill would cross a KV block boundary (variable-length proposers entering a new block with a short draft), the step falls back to no-spec (query_len=1) via the cross-DP step_no_spec_required OR-reduce. Warmup now compiles the no-spec decode graph on both guard axes (input_ids shape and max_pads_across_dp size), and two runtime asserts guard the in-block invariant. Motivation & key idea are documented in the rbln_scheduler.py module docstring. Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-06T08:44:37Z

Codecov Report

❌ Patch coverage is 34.88372% with 28 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
vllm_rbln/v1/worker/rbln_model_runner.py	6.89%	26 Missing and 1 partial ⚠️
vllm_rbln/v1/core/rbln_scheduler.py	92.85%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

…ock no-spec Fixed-length proposers (eagle/eagle3/mtp/medusa) compile no no-spec graph and their DP-idle peers vote num_spec+1, so a cross-block no-spec step would hot-path recompile and break cross-DP full-spec shape agreement. Guard the cross-block branch with a runtime assert so this invariant violation (only reachable when max_model_len % block_size is in (0, num_spec+1) and a request reaches the final block) fails loudly instead of corrupting. Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-edge threshold Add scheduler-side unit tests (no model) for cross-block no-spec across variable- and fixed-length proposers. The fixed-length sweep showed cross-block fires iff max_model_len % block_size <= num_spec+1 (decode is capped to max_model_len-1); correct the guard assert message and docstring. Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rebel-wonsubkim requested review from junstar92, rebel-jaehwang and rebel-jinhwan June 6, 2026 08:28

rebel-wonsubkim changed the title ~~feat(spec-decode): cross-block no-spec fallback for variable-length p…~~ feature: cross-block no-spec fallback for variable-length spec decoding proposers Jun 6, 2026

rebel-wonsubkim and others added 2 commits June 7, 2026 11:09

rebel-jinhwan assigned rebel-wonsubkim Jun 9, 2026

rebel-jinhwan added the torch.compile torch.compile based implementation label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: cross-block no-spec fallback for variable-length spec decoding proposers#649

feature: cross-block no-spec fallback for variable-length spec decoding proposers#649
rebel-wonsubkim wants to merge 3 commits into
devfrom
feat/spec-decode-cross-block-nospec

rebel-wonsubkim commented Jun 6, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rebel-wonsubkim commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📸 Screenshots / Logs (if applicable)

📋 Checklist

💬 Notes

Uh oh!

codecov Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rebel-wonsubkim commented Jun 6, 2026 •

edited

Loading

codecov Bot commented Jun 6, 2026 •

edited

Loading