-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Checklist
- I searched related issues but found no solution.
- The bug persists in the latest version.
- Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Describe the bug
I noticed that when running the MiMo-V2-Flash model on SGLang, there is a precision issue with the configuration CUDA Graph + MTP + page_size = 64.
However, the precision is correct when using Graph + MTP + page_size = 1
Has anyone tried to fix this issue?
Reproduction
cudagraph+mtp+page_size 64
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path /ssd3/MiMo-V2-Flash --max-total-tokens 835584 --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 8806 --trust-remote-code --tp-size 8 --page-size 64 --cuda-graph-max-bs 64 --max-running-requests 64 --disable-overlap-schedule --attention-backend fa3 --mem-fraction-static 0.9 --dp-size 2 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-num-draft-tokens 4 --speculative-eagle-topk 1
cudagraph+mtp+page_size 1
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path /ssd3/MiMo-V2-Flash --max-total-tokens 835584 --disable-radix-cache --decode-log-interval 1 --host 0.0.0.0 --port 8806 --trust-remote-code --tp-size 8 --page-size 1 --cuda-graph-max-b
s 64 --max-running-requests 64 --disable-overlap-schedule --attention-backend fa3 --mem-fraction-static 0.9 --dp-size 2 --enable-dp-attention --speculative-algorith
m EAGLE --speculative-num-steps 3 --speculative-num-draft-tokens 4 --speculative-eagle-topk 1
Environment
h200

