Skip to content

test(spec): TestNextNV25Pro E2E (#1053 P1-8)#1097

Open
zorrofox wants to merge 10 commits into
sgl-project:epic/mtp-refactor-phase1from
zorrofox:feat/p1-8-nextn-e2e-test
Open

test(spec): TestNextNV25Pro E2E (#1053 P1-8)#1097
zorrofox wants to merge 10 commits into
sgl-project:epic/mtp-refactor-phase1from
zorrofox:feat/p1-8-nextn-e2e-test

Conversation

@zorrofox
Copy link
Copy Markdown
Contributor

Stacked on #1089 — only review e5d5628.

Adds env-gated TestNextNV25Pro to test_speculative_decoding.py. V2.5-Pro (1T) requires v6e-64 (16 hosts) which popen_launch_server can't orchestrate, so this skips unless SGLANG_NEXTN_E2E_URL points at an externally-managed NEXTN server.

Three checks at bs=1 topk=1 greedy (Phase-1 deliverable scope):

  • test_greedy_sanity: single /generate succeeds
  • test_multi_prompt_stable: 4 serial mixed-length reqs (incl. 280-tok crossing the page=256 boundary) all succeed + server alive after idle — regression for feat(spec): MultiLayerEAGLEWorker/MultiLayerDraftWorker (#1053 P1-4) #1089's KV-leak and _fetch_mask Mosaic fixes
  • test_raw_completion_accuracy: 5 MMLU-style raw-completion prompts, ≥4/5. Doesn't use run_eval mmlu because chat mode triggers <think> reasoning that the parser can't score.

Manual run (v6e-64, V2.5-Pro 3-layer NEXTN, --ep-size 64 --moe-backend epmoe)

test_greedy_sanity ... ok
test_multi_prompt_stable ... ok
test_raw_completion_accuracy ... ok
Ran 3 tests in 14.445s
crashes: 0

accept-len 2.41 over 55 decode rounds (cold+warm; warm-only ~2.5–2.8 from #1089).

In CI (v6e-4) the class is skipped — TestSpeculativeDecoding (Qwen3-32B EAGLE3) is unchanged.

Test User added 10 commits May 14, 2026 13:35
…1053 P1-4)

3-layer MTP for MiMo-V2.5-Pro: N draft model runners (one per mtp_layer_idx),
hidden states chained layer→layer in draft_extend_*; draft_forward step i uses
runner(i). Verify/orchestration reused from EAGLEWorker via draft_worker
injection.

Includes:
- NEXTN SpeculativeAlgorithm + scheduler dispatch
- model_config: draft arch=MiMoV2MTPForCausalLM, num_hidden_layers=1, quant
  ignore eh_proj/o_proj (bf16 in checkpoint)
- mimo_v2_nextn: 4-tuple __call__ + concat reshard
- mimo_v2_flash: get_embed_and_head
- fa_backend: TARGET_VERIFY custom_mask page-align pad + swa_page_indices
- rpa_v3: _fetch_mask aligned mask stride + pl.multiple_of hints
- spec prefill via padded get_model_worker_batch (epmoe padding-sensitivity)
- EAGLEWorker.__init__ accepts draft_worker (fold sgl-project#1080 review (4))

E2E v6e-64: bs=1 topk=1 greedy accept-len ~2.5 (5-prompt mix), output matches
nospec for first N>=13 tokens (later divergence is upstream epmoe bf16
padding-sensitivity, tracked separately).
…ject#1080 review (1))

verify(), forward_target_extend(), forward_batch_speculative_generation()
only touch BaseSpecWorker state (target_worker, draft_worker, mesh,
speculative_num_*); moving them up makes EAGLEWorker and
MultiLayerEAGLEWorker thin draft_worker-injection wrappers.

BaseDraftWorker gains explicit abstract draft_extend_for_{prefill,decode}
+ draft_model_runner so the contract is visible.

(3) precompile multi-layer warm-up left as TODO (currently
--disable-precompile in E2E; only layer 0 would be warmed via
_worker delegation).
EAGLE/EAGLE3 dense targets don't hit sgl-project#1090 (no MoE-EP) and
EagleDraftWorker.draft_extend_for_prefill doesn't yet handle padded
mwb (sharded [:real_bs] slice → ShardingTypeError). Restores CI
test_speculative_decoding (Qwen3-32B EAGLE3).
Kernel _fetch_mask now uses page-aligned mask row stride (cu_kv_lens
delta) so DMA offset/size are 8-divisible. Update ref impl + test
helper to construct masks with the same padded layout (host-side
fa_backend already does). Restores test_flashattention custom_mask
tests.
The host-side mask pad + kernel pl.multiple_of hints were added for the
NEXTN hybrid-SWA verify DMA 8-align crash, but they regress EAGLE3
(dense Qwen3-32B): accept-len 1.5→1.06 and per-round JIT cache_miss
(padded mask shape varies with seq_len → recompile). Restoring the
original unpadded mask path so EAGLE3 CI recovers; the NEXTN hybrid
case will be re-fixed via a hybrid-only path that doesn't touch the
dense kernel contract. swa_page_indices for TARGET_VERIFY is kept
(independent hybrid fix).
…ory leak)

Spec decode allocates KV via EagleDraftInput.prepare_for_decode, not
ScheduleBatch.prepare_for_decode, so the per-step kv_committed_len bump
never happens. cache_finished_req then only frees the prefill-time
committed range, leaking every decode-allocated page. EAGLE3 CI (bs=16)
never hits idle so check_memory never fires; NEXTN bs=1 idles after each
request and crashes with 'token_to_kv_pool_allocator memory leak'.
…l-side static)

NEXTN (V2.5-Pro, page=256) hits Mosaic tiling(8) proof in _fetch_mask;
EAGLE3 (page=64) does not. Derive mask_aligned_to_cu_kv inside the
kernel from kv_cache page_size (static shape) — passing it as a kwarg
from fa_backend gets traced through jit. Host pads mask rows under the
same page_size>=256 condition. Dense path unchanged from upstream.
… v6e-64 multi-host)

V2.5-Pro NEXTN needs v6e-64 (16 hosts) which popen_launch_server can't
orchestrate, so this skips unless SGLANG_NEXTN_E2E_URL points at an
externally-managed server. Three checks at bs=1 topk=1 greedy:

- test_greedy_sanity: single /generate succeeds
- test_multi_prompt_stable: serial mixed-length reqs (incl. 280-tok
  crossing the page=256 boundary) all succeed and server stays alive
  after idle (regression for the spec KV leak + _fetch_mask Mosaic fix)
- test_raw_completion_accuracy: 5 MMLU-style raw-completion prompts
  (run_eval mmlu uses chat which triggers <think> reasoning that the
  parser can't score)

Manual run on v6e-64 (V2.5-Pro, 3-layer NEXTN): 3/3 pass in 14s warm,
crashes=0, accept-len 2.41 over 55 decode rounds.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements support for the V2.5-Pro (1T) model within the speculative decoding framework. It introduces a multi-layer draft worker architecture to handle multi-layer MTP models and adds robust E2E tests to verify greedy generation, stability under mixed-length prompts, and completion accuracy. Additionally, it addresses critical issues related to KV-pool management and Mosaic kernel tiling constraints.

Highlights

  • New E2E Testing for V2.5-Pro: Added TestNextNV25Pro to test_speculative_decoding.py to support V2.5-Pro (1T) model testing via an externally-managed server.
  • Multi-Layer MTP Support: Introduced MultiLayerDraftWorker and MultiLayerEAGLEWorker to support multi-layer speculative decoding architectures.
  • Bug Fixes: Resolved KV-leak issues and fixed _fetch_mask Mosaic tiling constraints for page sizes >= 256.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant