sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM by Evrard-Nil · Pull Request #42 · nearai/cvm-compose-files

Evrard-Nil · 2026-05-22T09:55:50Z

Summary

Bumps GLM-5.1, Qwen3.6 (small-models), and FLUX (small-models) from a moving sglang:dev pin built 2026-04-30 (pre-0.5.11) to the released lmsysorg/sglang:v0.5.12-cu129@sha256:9e02c8e1… (built 2026-05-16).
Migrates Qwen3.5-122B-A10B off vLLM v0.20.0 onto the same sglang v0.5.12 image.
Stays on CUDA 12.9 via the -cu129 tag — driver story unchanged from today.

Why now

The current :dev pin is a snapshot of master between v0.5.10.post1 and v0.5.11. Both subsequent stable releases name our workloads explicitly:

v0.5.11 — Qwen 3.6 optimizations, GLM-5.1 work, Spec V2 default
v0.5.12 — FlashInfer 0.6.8.post1 → 0.6.11.post1, EAGLE-3 refinements, unified NVIDIA image

Changes per file

`GLM-5.1.yaml`

Single-line image bump.

`small-models.yaml`

Qwen3.6 service: image bump + freshen the comment (now mentions Spec V2 as default).
FLUX dockerfile_inline: bumps the FROM digest. python[diffusion] extras still apply.

`Qwen3.5-122B.yaml` — the substantive change

Translation of the existing vLLM flags:

vLLM flag	sglang equivalent
`--tensor-parallel-size 4`	`--tp 4`
`--gpu-memory-utilization 0.88`	`--mem-fraction-static 0.88`
`--max-model-len 1010000`	`--context-length 262144` ⚠
`--hf-overrides {yarn rope…}`	dropped ⚠
`--kv-cache-dtype fp8_e4m3`	`--kv-cache-dtype fp8_e4m3`
`--max-num-batched-tokens 8192`	`--chunked-prefill-size 16384` (matches GLM-5.1/Qwen3.6)
`--reasoning-parser qwen3`	same
`--tool-call-parser qwen3_coder`	same
`--enable-auto-tool-choice`	dropped (implicit in sglang when parser is set)
`--enable-chunked-prefill`	replaced by `--enable-mixed-chunk`
`--enable-prefix-caching`	dropped (radix cache is sglang default)
`--enable-prompt-tokens-details`	no direct equivalent — `--enable-cache-report` covers similar telemetry
`VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`	dropped

Added (mirroring Qwen3.6):

--speculative-algorithm EAGLE + 3 steps / topk 1 / 4 draft tokens
--num-continuous-decode-steps 5
--model-loader-extra-config '{"enable_multithread_load": "true", "num_threads": 64}'
--enable-cache-report, --enable-metrics, --trust-remote-code, --log-requests-level 0, --served-model-name Qwen/Qwen3.5-122B-A10B
SGLANG_ENABLE_SPEC_V2=1 (explicit; default since 0.5.11)
kernel_cache volume for DeepGEMM JIT cache

Container/observability rename:

vllm-qwen35-122b-{1,2} → qwen35-{1,2}
Datadog source/service vllm → sglang; metric prefix filter vllm:.* → sglang:*
Anchor *vllm-qwen35-122b-common → *qwen35-122b-common
Health check + proxy VLLM_BACKEND_URLS updated to the new container names
Proxy container (vllm-proxy-qwen35, vllm-proxy-rs) name kept — it's still the same Rust proxy

Watch items

⚠ Qwen3.5 context drops from 1.01M to native 262144. Any client sending >256k context will start failing. Confirmed in advance.
⚠ EAGLE on Qwen3.5 mirrors the Qwen3.6 flag set (no explicit draft model — relies on the model's native MTP head). If Qwen3.5-122B-A10B does not have one, sglang will fail at boot. Verify on one CVM before rolling the fleet.
Spec V2 is default since v0.5.11; GLM-5.1 doesn't use spec decoding so no behavior change there. Qwen3.6 already opted in via SGLANG_ENABLE_SPEC_V2=1.
Per [feedback_small_models_nginx_sni_cache], restart nginx on the small-models host after any compose/up that recreates the Qwen3.6 / FLUX proxies — SNI routing silently misroutes otherwise.
DataDog dashboards that filter by service:vllm-qwen35-122b-1 (or *-2) need to be updated to qwen35-1/-2 and source:sglang.

Rollout plan

Pick one Qwen3.5 host, deploy via compose-manager, watch sglang:* metrics + a synthetic chat-completions call against qwen35-122b.completions.near.ai.
If EAGLE fails at boot, strip the four --speculative-* flags and SGLANG_ENABLE_SPEC_V2=1 and try again — that's the only Qwen3.5-specific risk.
Roll GLM-5.1 (4 hosts) one at a time via the standard rolling-deploy pattern.
Roll small-models (Qwen3.6 + FLUX) on the 2 hosts; restart nginx afterward on each.

- GLM-5.1, Qwen3.6 (small-models), FLUX (small-models): swap the pre-v0.5.11 :dev pin (e1eee3f, built 2026-04-30) and the very old :latest-ish FLUX pin (8ece90ad, 2026-01-21) for v0.5.12-cu129 (9e02c8e1, 2026-05-16). cu129 keeps the CUDA 12.9 driver story unchanged; CUDA 13 default in v0.5.11 sidesteps the host-driver question for now. - Qwen3.5-122B-A10B: replace the vLLM v0.20.0 stanza with an sglang service modeled on the Qwen3.6 cookbook. 4-way TP per instance, fp8 KV cache, qwen3 reasoning + qwen3_coder tool parsers preserved. EAGLE spec decoding added (Spec V2 is default since 0.5.11). - Qwen3.5 context drops from 1.01M (yarn rope hf-override) to native 262144 — yarn override removed. Clients using >256k context will start failing; agreed in advance. - Containers renamed vllm-qwen35-122b-{1,2} → qwen35-{1,2}; Datadog source/service/metrics flipped vllm → sglang. Proxy container (vllm-proxy-qwen35, vllm-proxy-rs) keeps its name. Note: EAGLE on Qwen3.5 mirrors the 3.6 flag set, which relies on the model's native MTP head. If Qwen3.5 lacks one, sglang will fail at boot — verify on a single CVM before rolling.

Evrard-Nil · 2026-05-22T10:01:01Z

Closing — keeping changes on local branch only.

Evrard-Nil closed this May 22, 2026

Evrard-Nil deleted the feat/sglang-v0.5.12 branch May 22, 2026 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM#42

sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM#42
Evrard-Nil wants to merge 1 commit into
mainfrom
feat/sglang-v0.5.12

Evrard-Nil commented May 22, 2026

Uh oh!

Evrard-Nil commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Evrard-Nil commented May 22, 2026

Summary

Why now

Changes per file

GLM-5.1.yaml

small-models.yaml

Qwen3.5-122B.yaml — the substantive change

Watch items

Rollout plan

Uh oh!

Evrard-Nil commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`GLM-5.1.yaml`

`small-models.yaml`

`Qwen3.5-122B.yaml` — the substantive change