test(spec): TestNextNV25Pro E2E (#1053 P1-8)#1097
Conversation
…1053 P1-4) 3-layer MTP for MiMo-V2.5-Pro: N draft model runners (one per mtp_layer_idx), hidden states chained layer→layer in draft_extend_*; draft_forward step i uses runner(i). Verify/orchestration reused from EAGLEWorker via draft_worker injection. Includes: - NEXTN SpeculativeAlgorithm + scheduler dispatch - model_config: draft arch=MiMoV2MTPForCausalLM, num_hidden_layers=1, quant ignore eh_proj/o_proj (bf16 in checkpoint) - mimo_v2_nextn: 4-tuple __call__ + concat reshard - mimo_v2_flash: get_embed_and_head - fa_backend: TARGET_VERIFY custom_mask page-align pad + swa_page_indices - rpa_v3: _fetch_mask aligned mask stride + pl.multiple_of hints - spec prefill via padded get_model_worker_batch (epmoe padding-sensitivity) - EAGLEWorker.__init__ accepts draft_worker (fold sgl-project#1080 review (4)) E2E v6e-64: bs=1 topk=1 greedy accept-len ~2.5 (5-prompt mix), output matches nospec for first N>=13 tokens (later divergence is upstream epmoe bf16 padding-sensitivity, tracked separately).
…ject#1080 review (1)) verify(), forward_target_extend(), forward_batch_speculative_generation() only touch BaseSpecWorker state (target_worker, draft_worker, mesh, speculative_num_*); moving them up makes EAGLEWorker and MultiLayerEAGLEWorker thin draft_worker-injection wrappers. BaseDraftWorker gains explicit abstract draft_extend_for_{prefill,decode} + draft_model_runner so the contract is visible. (3) precompile multi-layer warm-up left as TODO (currently --disable-precompile in E2E; only layer 0 would be warmed via _worker delegation).
EAGLE/EAGLE3 dense targets don't hit sgl-project#1090 (no MoE-EP) and EagleDraftWorker.draft_extend_for_prefill doesn't yet handle padded mwb (sharded [:real_bs] slice → ShardingTypeError). Restores CI test_speculative_decoding (Qwen3-32B EAGLE3).
Kernel _fetch_mask now uses page-aligned mask row stride (cu_kv_lens delta) so DMA offset/size are 8-divisible. Update ref impl + test helper to construct masks with the same padded layout (host-side fa_backend already does). Restores test_flashattention custom_mask tests.
…ract" This reverts commit 9f11539.
The host-side mask pad + kernel pl.multiple_of hints were added for the NEXTN hybrid-SWA verify DMA 8-align crash, but they regress EAGLE3 (dense Qwen3-32B): accept-len 1.5→1.06 and per-round JIT cache_miss (padded mask shape varies with seq_len → recompile). Restoring the original unpadded mask path so EAGLE3 CI recovers; the NEXTN hybrid case will be re-fixed via a hybrid-only path that doesn't touch the dense kernel contract. swa_page_indices for TARGET_VERIFY is kept (independent hybrid fix).
…ory leak) Spec decode allocates KV via EagleDraftInput.prepare_for_decode, not ScheduleBatch.prepare_for_decode, so the per-step kv_committed_len bump never happens. cache_finished_req then only frees the prefill-time committed range, leaking every decode-allocated page. EAGLE3 CI (bs=16) never hits idle so check_memory never fires; NEXTN bs=1 idles after each request and crashes with 'token_to_kv_pool_allocator memory leak'.
…l-side static) NEXTN (V2.5-Pro, page=256) hits Mosaic tiling(8) proof in _fetch_mask; EAGLE3 (page=64) does not. Derive mask_aligned_to_cu_kv inside the kernel from kv_cache page_size (static shape) — passing it as a kwarg from fa_backend gets traced through jit. Host pads mask rows under the same page_size>=256 condition. Dense path unchanged from upstream.
Addresses sgl-project#1089 review (2)/(4).
… v6e-64 multi-host) V2.5-Pro NEXTN needs v6e-64 (16 hosts) which popen_launch_server can't orchestrate, so this skips unless SGLANG_NEXTN_E2E_URL points at an externally-managed server. Three checks at bs=1 topk=1 greedy: - test_greedy_sanity: single /generate succeeds - test_multi_prompt_stable: serial mixed-length reqs (incl. 280-tok crossing the page=256 boundary) all succeed and server stays alive after idle (regression for the spec KV leak + _fetch_mask Mosaic fix) - test_raw_completion_accuracy: 5 MMLU-style raw-completion prompts (run_eval mmlu uses chat which triggers <think> reasoning that the parser can't score) Manual run on v6e-64 (V2.5-Pro, 3-layer NEXTN): 3/3 pass in 14s warm, crashes=0, accept-len 2.41 over 55 decode rounds.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements support for the V2.5-Pro (1T) model within the speculative decoding framework. It introduces a multi-layer draft worker architecture to handle multi-layer MTP models and adds robust E2E tests to verify greedy generation, stability under mixed-length prompts, and completion accuracy. Additionally, it addresses critical issues related to KV-pool management and Mosaic kernel tiling constraints. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Stacked on #1089 — only review e5d5628.
Adds env-gated
TestNextNV25Prototest_speculative_decoding.py. V2.5-Pro (1T) requires v6e-64 (16 hosts) whichpopen_launch_servercan't orchestrate, so this skips unlessSGLANG_NEXTN_E2E_URLpoints at an externally-managed NEXTN server.Three checks at bs=1 topk=1 greedy (Phase-1 deliverable scope):
test_greedy_sanity: single/generatesucceedstest_multi_prompt_stable: 4 serial mixed-length reqs (incl. 280-tok crossing the page=256 boundary) all succeed + server alive after idle — regression for feat(spec): MultiLayerEAGLEWorker/MultiLayerDraftWorker (#1053 P1-4) #1089's KV-leak and_fetch_maskMosaic fixestest_raw_completion_accuracy: 5 MMLU-style raw-completion prompts, ≥4/5. Doesn't userun_eval mmlubecause chat mode triggers<think>reasoning that the parser can't score.Manual run (v6e-64, V2.5-Pro 3-layer NEXTN,
--ep-size 64 --moe-backend epmoe)accept-len 2.41 over 55 decode rounds (cold+warm; warm-only ~2.5–2.8 from #1089).
In CI (v6e-4) the class is skipped —
TestSpeculativeDecoding(Qwen3-32B EAGLE3) is unchanged.