-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
did glm support pd disaggregation and mtp? i try to test,but the accept len in log is always 1(failed to predict everytime) and performance is bad.i use the start command below,is there something wrong?
args for prefill node :
SGLANG_ENABLE_SPEC_V2=1 SGLANG_DISAGGREGATION_QUEUE_SIZE=1 SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=1 MC_TE_METRIC=1 SGLANG_SET_CPU_AFFINITY=true python -m sglang.launch_server --model /models/GLM-4.6-FP8/ --trust-remote-code --watchdog-timeout "1000000" --mem-fraction-static 0.8 --max-running-requests 40 --disaggregation-mode prefill --tp-size 8 --kv-cache-dtype fp8_e4m3 --host 0.0.0.0 --chunked-prefill-size 16384 --attention-backend fa3 --enable-metrics --disaggregation-ib-device mlx5_0 --page-size 64 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
args for decode node:
SGLANG_ENABLE_SPEC_V2=1 SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION=512 SGLANG_SET_CPU_AFFINITY=true python -m sglang.launch_server --model /models/GLM-4.6-FP8/ --trust-remote-code --watchdog-timeout "1000000" --mem-fraction-static 0.9 --tp-size 8 --kv-cache-dtype fp8_e4m3 --disaggregation-mode decode --prefill-round-robin-balance --host 0.0.0.0 --chunked-prefill-size 16384 --attention-backend fa3 --max-running-requests 80 --enable-metrics --disaggregation-ib-device mlx5_0 --page-size 64 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4