feature: cross-block no-spec fallback for variable-length spec decoding proposers#649
Open
rebel-wonsubkim wants to merge 3 commits into
Open
feature: cross-block no-spec fallback for variable-length spec decoding proposers#649rebel-wonsubkim wants to merge 3 commits into
rebel-wonsubkim wants to merge 3 commits into
Conversation
…roposers When in-block query backfill would cross a KV block boundary (variable-length proposers entering a new block with a short draft), the step falls back to no-spec (query_len=1) via the cross-DP step_no_spec_required OR-reduce. Warmup now compiles the no-spec decode graph on both guard axes (input_ids shape and max_pads_across_dp size), and two runtime asserts guard the in-block invariant. Motivation & key idea are documented in the rbln_scheduler.py module docstring. Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ock no-spec Fixed-length proposers (eagle/eagle3/mtp/medusa) compile no no-spec graph and their DP-idle peers vote num_spec+1, so a cross-block no-spec step would hot-path recompile and break cross-DP full-spec shape agreement. Guard the cross-block branch with a runtime assert so this invariant violation (only reachable when max_model_len % block_size is in (0, num_spec+1) and a request reaches the final block) fails loudly instead of corrupting. Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-edge threshold Add scheduler-side unit tests (no model) for cross-block no-spec across variable- and fixed-length proposers. The fixed-length sweep showed cross-block fires iff max_model_len % block_size <= num_spec+1 (decode is capped to max_model_len-1); correct the guard assert message and docstring. Signed-off-by: wonsub kim <subang0@rebellions.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
Motivation
Query backfill (#604) keeps every decode step at a single (batch, num_spec+1) shape by prepending past tokens. But variable-length proposers
(ngram/suffix) can need backfill that crosses a KV block boundary: right after entering a new block, a short draft would pull past tokens from the
previous block, which the single-block decode path can't express → silent KV corruption. (Fixed-length proposers never hit this — num_spec+1 <=
block_size keeps their backfill in-block.)
Key idea
shape.
dummy drives the same runtime path (RBLNSchedulerOutput(step_no_spec_required=True)) so num_padded matches too; setting query_len=1 alone would still
hot-path recompile at runtime.
query_len=1.
Verification
Full model (MiniMax-M2.5, DP4/EP4, ngram num_spec=3), isl=512/osl=2048: cross-block no-spec fired 322× organically, with 0 recompile / 0 hot-path / 0
assert, outputs coherent.
▎ Full motivation & key idea are kept in the rbln_scheduler.py module docstring.
host tensor - 검증 완료
device tensor - 검증 중
📌 Related Issues / Tickets
✅ Type of Change
release)feature)model)core)fix)perf)refactor)docs)other): please describe🧪 How to Test
.........📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes