test(spec): TestNextNV25Pro E2E (#1053 P1-8) by zorrofox · Pull Request #1097 · sgl-project/sglang-jax

zorrofox · 2026-05-15T10:11:27Z

Stacked on #1089 — only review e5d5628.

Adds env-gated TestNextNV25Pro to test_speculative_decoding.py. V2.5-Pro (1T) requires v6e-64 (16 hosts) which popen_launch_server can't orchestrate, so this skips unless SGLANG_NEXTN_E2E_URL points at an externally-managed NEXTN server.

Three checks at bs=1 topk=1 greedy (Phase-1 deliverable scope):

test_greedy_sanity: single /generate succeeds
test_multi_prompt_stable: 4 serial mixed-length reqs (incl. 280-tok crossing the page=256 boundary) all succeed + server alive after idle — regression for feat(spec): MultiLayerEAGLEWorker/MultiLayerDraftWorker (#1053 P1-4) #1089's KV-leak and _fetch_mask Mosaic fixes
test_raw_completion_accuracy: 5 MMLU-style raw-completion prompts, ≥4/5. Doesn't use run_eval mmlu because chat mode triggers <think> reasoning that the parser can't score.

Manual run (v6e-64, V2.5-Pro 3-layer NEXTN, `--ep-size 64 --moe-backend epmoe`)

test_greedy_sanity ... ok
test_multi_prompt_stable ... ok
test_raw_completion_accuracy ... ok
Ran 3 tests in 14.445s
crashes: 0

accept-len 2.41 over 55 decode rounds (cold+warm; warm-only ~2.5–2.8 from #1089).

In CI (v6e-4) the class is skipped — TestSpeculativeDecoding (Qwen3-32B EAGLE3) is unchanged.

…1053 P1-4) 3-layer MTP for MiMo-V2.5-Pro: N draft model runners (one per mtp_layer_idx), hidden states chained layer→layer in draft_extend_*; draft_forward step i uses runner(i). Verify/orchestration reused from EAGLEWorker via draft_worker injection. Includes: - NEXTN SpeculativeAlgorithm + scheduler dispatch - model_config: draft arch=MiMoV2MTPForCausalLM, num_hidden_layers=1, quant ignore eh_proj/o_proj (bf16 in checkpoint) - mimo_v2_nextn: 4-tuple __call__ + concat reshard - mimo_v2_flash: get_embed_and_head - fa_backend: TARGET_VERIFY custom_mask page-align pad + swa_page_indices - rpa_v3: _fetch_mask aligned mask stride + pl.multiple_of hints - spec prefill via padded get_model_worker_batch (epmoe padding-sensitivity) - EAGLEWorker.__init__ accepts draft_worker (fold sgl-project#1080 review (4)) E2E v6e-64: bs=1 topk=1 greedy accept-len ~2.5 (5-prompt mix), output matches nospec for first N>=13 tokens (later divergence is upstream epmoe bf16 padding-sensitivity, tracked separately).

…ject#1080 review (1)) verify(), forward_target_extend(), forward_batch_speculative_generation() only touch BaseSpecWorker state (target_worker, draft_worker, mesh, speculative_num_*); moving them up makes EAGLEWorker and MultiLayerEAGLEWorker thin draft_worker-injection wrappers. BaseDraftWorker gains explicit abstract draft_extend_for_{prefill,decode} + draft_model_runner so the contract is visible. (3) precompile multi-layer warm-up left as TODO (currently --disable-precompile in E2E; only layer 0 would be warmed via _worker delegation).

EAGLE/EAGLE3 dense targets don't hit sgl-project#1090 (no MoE-EP) and EagleDraftWorker.draft_extend_for_prefill doesn't yet handle padded mwb (sharded [:real_bs] slice → ShardingTypeError). Restores CI test_speculative_decoding (Qwen3-32B EAGLE3).

Kernel _fetch_mask now uses page-aligned mask row stride (cu_kv_lens delta) so DMA offset/size are 8-divisible. Update ref impl + test helper to construct masks with the same padded layout (host-side fa_backend already does). Restores test_flashattention custom_mask tests.

…ract" This reverts commit 9f11539.

The host-side mask pad + kernel pl.multiple_of hints were added for the NEXTN hybrid-SWA verify DMA 8-align crash, but they regress EAGLE3 (dense Qwen3-32B): accept-len 1.5→1.06 and per-round JIT cache_miss (padded mask shape varies with seq_len → recompile). Restoring the original unpadded mask path so EAGLE3 CI recovers; the NEXTN hybrid case will be re-fixed via a hybrid-only path that doesn't touch the dense kernel contract. swa_page_indices for TARGET_VERIFY is kept (independent hybrid fix).

…ory leak) Spec decode allocates KV via EagleDraftInput.prepare_for_decode, not ScheduleBatch.prepare_for_decode, so the per-step kv_committed_len bump never happens. cache_finished_req then only frees the prefill-time committed range, leaking every decode-allocated page. EAGLE3 CI (bs=16) never hits idle so check_memory never fires; NEXTN bs=1 idles after each request and crashes with 'token_to_kv_pool_allocator memory leak'.

…l-side static) NEXTN (V2.5-Pro, page=256) hits Mosaic tiling(8) proof in _fetch_mask; EAGLE3 (page=64) does not. Derive mask_aligned_to_cu_kv inside the kernel from kv_cache page_size (static shape) — passing it as a kwarg from fa_backend gets traced through jit. Host pads mask rows under the same page_size>=256 condition. Dense path unchanged from upstream.

Addresses sgl-project#1089 review (2)/(4).

… v6e-64 multi-host) V2.5-Pro NEXTN needs v6e-64 (16 hosts) which popen_launch_server can't orchestrate, so this skips unless SGLANG_NEXTN_E2E_URL points at an externally-managed server. Three checks at bs=1 topk=1 greedy: - test_greedy_sanity: single /generate succeeds - test_multi_prompt_stable: serial mixed-length reqs (incl. 280-tok crossing the page=256 boundary) all succeed and server stays alive after idle (regression for the spec KV leak + _fetch_mask Mosaic fix) - test_raw_completion_accuracy: 5 MMLU-style raw-completion prompts (run_eval mmlu uses chat which triggers <think> reasoning that the parser can't score) Manual run on v6e-64 (V2.5-Pro, 3-layer NEXTN): 3/3 pass in 14s warm, crashes=0, accept-len 2.41 over 55 decode rounds.

gemini-code-assist · 2026-05-15T10:13:00Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements support for the V2.5-Pro (1T) model within the speculative decoding framework. It introduces a multi-layer draft worker architecture to handle multi-layer MTP models and adds robust E2E tests to verify greedy generation, stability under mixed-length prompts, and completion accuracy. Additionally, it addresses critical issues related to KV-pool management and Mosaic kernel tiling constraints.

Highlights

New E2E Testing for V2.5-Pro: Added TestNextNV25Pro to test_speculative_decoding.py to support V2.5-Pro (1T) model testing via an externally-managed server.
Multi-Layer MTP Support: Introduced MultiLayerDraftWorker and MultiLayerEAGLEWorker to support multi-layer speculative decoding architectures.
Bug Fixes: Resolved KV-leak issues and fixed _fetch_mask Mosaic tiling constraints for page sizes >= 256.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Test User added 10 commits May 14, 2026 13:35

Revert "fix(rpa_v3): align custom_mask test/ref to padded layout cont…

f0f8b45

…ract" This reverts commit 9f11539.

review: expand _get alias + assert num_mtp_layers matches config

93f881e

Addresses sgl-project#1089 review (2)/(4).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(spec): TestNextNV25Pro E2E (#1053 P1-8)#1097

test(spec): TestNextNV25Pro E2E (#1053 P1-8)#1097
zorrofox wants to merge 10 commits into
sgl-project:epic/mtp-refactor-phase1from
zorrofox:feat/p1-8-nextn-e2e-test

zorrofox commented May 15, 2026

Uh oh!

gemini-code-assist Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zorrofox commented May 15, 2026

Manual run (v6e-64, V2.5-Pro 3-layer NEXTN, --ep-size 64 --moe-backend epmoe)

Uh oh!

gemini-code-assist Bot commented May 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Manual run (v6e-64, V2.5-Pro 3-layer NEXTN, `--ep-size 64 --moe-backend epmoe`)