sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM#42
Closed
Evrard-Nil wants to merge 1 commit into
Closed
sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM#42Evrard-Nil wants to merge 1 commit into
Evrard-Nil wants to merge 1 commit into
Conversation
- GLM-5.1, Qwen3.6 (small-models), FLUX (small-models): swap the
pre-v0.5.11 :dev pin (e1eee3f, built 2026-04-30) and the very old
:latest-ish FLUX pin (8ece90ad, 2026-01-21) for v0.5.12-cu129
(9e02c8e1, 2026-05-16). cu129 keeps the CUDA 12.9 driver story
unchanged; CUDA 13 default in v0.5.11 sidesteps the host-driver
question for now.
- Qwen3.5-122B-A10B: replace the vLLM v0.20.0 stanza with an sglang
service modeled on the Qwen3.6 cookbook. 4-way TP per instance,
fp8 KV cache, qwen3 reasoning + qwen3_coder tool parsers preserved.
EAGLE spec decoding added (Spec V2 is default since 0.5.11).
- Qwen3.5 context drops from 1.01M (yarn rope hf-override) to native
262144 — yarn override removed. Clients using >256k context will
start failing; agreed in advance.
- Containers renamed vllm-qwen35-122b-{1,2} → qwen35-{1,2}; Datadog
source/service/metrics flipped vllm → sglang. Proxy container
(vllm-proxy-qwen35, vllm-proxy-rs) keeps its name.
Note: EAGLE on Qwen3.5 mirrors the 3.6 flag set, which relies on the
model's native MTP head. If Qwen3.5 lacks one, sglang will fail at
boot — verify on a single CVM before rolling.
Contributor
Author
|
Closing — keeping changes on local branch only. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sglang:devpin built 2026-04-30 (pre-0.5.11) to the releasedlmsysorg/sglang:v0.5.12-cu129@sha256:9e02c8e1…(built 2026-05-16).-cu129tag — driver story unchanged from today.Why now
The current
:devpin is a snapshot of master between v0.5.10.post1 and v0.5.11. Both subsequent stable releases name our workloads explicitly:Changes per file
GLM-5.1.yamlSingle-line image bump.
small-models.yamldockerfile_inline: bumps theFROMdigest.python[diffusion]extras still apply.Qwen3.5-122B.yaml— the substantive changeTranslation of the existing vLLM flags:
--tensor-parallel-size 4--tp 4--gpu-memory-utilization 0.88--mem-fraction-static 0.88--max-model-len 1010000--context-length 262144⚠--hf-overrides {yarn rope…}--kv-cache-dtype fp8_e4m3--kv-cache-dtype fp8_e4m3--max-num-batched-tokens 8192--chunked-prefill-size 16384(matches GLM-5.1/Qwen3.6)--reasoning-parser qwen3--tool-call-parser qwen3_coder--enable-auto-tool-choice--enable-chunked-prefill--enable-mixed-chunk--enable-prefix-caching--enable-prompt-tokens-details--enable-cache-reportcovers similar telemetryVLLM_ALLOW_LONG_MAX_MODEL_LEN=1Added (mirroring Qwen3.6):
--speculative-algorithm EAGLE+ 3 steps / topk 1 / 4 draft tokens--num-continuous-decode-steps 5--model-loader-extra-config '{"enable_multithread_load": "true", "num_threads": 64}'--enable-cache-report,--enable-metrics,--trust-remote-code,--log-requests-level 0,--served-model-name Qwen/Qwen3.5-122B-A10BSGLANG_ENABLE_SPEC_V2=1(explicit; default since 0.5.11)kernel_cachevolume for DeepGEMM JIT cacheContainer/observability rename:
vllm-qwen35-122b-{1,2}→qwen35-{1,2}source/servicevllm→sglang; metric prefix filtervllm:.*→sglang:**vllm-qwen35-122b-common→*qwen35-122b-commonVLLM_BACKEND_URLSupdated to the new container namesvllm-proxy-qwen35, vllm-proxy-rs) name kept — it's still the same Rust proxyWatch items
SGLANG_ENABLE_SPEC_V2=1.compose/upthat recreates the Qwen3.6 / FLUX proxies — SNI routing silently misroutes otherwise.service:vllm-qwen35-122b-1(or*-2) need to be updated toqwen35-1/-2andsource:sglang.Rollout plan
sglang:*metrics + a synthetic chat-completions call againstqwen35-122b.completions.near.ai.--speculative-*flags andSGLANG_ENABLE_SPEC_V2=1and try again — that's the only Qwen3.5-specific risk.