Skip to content

sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM#42

Closed
Evrard-Nil wants to merge 1 commit into
mainfrom
feat/sglang-v0.5.12
Closed

sglang: bump to v0.5.12-cu129; migrate Qwen3.5-122B off vLLM#42
Evrard-Nil wants to merge 1 commit into
mainfrom
feat/sglang-v0.5.12

Conversation

@Evrard-Nil

Copy link
Copy Markdown
Contributor

Summary

  • Bumps GLM-5.1, Qwen3.6 (small-models), and FLUX (small-models) from a moving sglang:dev pin built 2026-04-30 (pre-0.5.11) to the released lmsysorg/sglang:v0.5.12-cu129@sha256:9e02c8e1… (built 2026-05-16).
  • Migrates Qwen3.5-122B-A10B off vLLM v0.20.0 onto the same sglang v0.5.12 image.
  • Stays on CUDA 12.9 via the -cu129 tag — driver story unchanged from today.

Why now

The current :dev pin is a snapshot of master between v0.5.10.post1 and v0.5.11. Both subsequent stable releases name our workloads explicitly:

  • v0.5.11 — Qwen 3.6 optimizations, GLM-5.1 work, Spec V2 default
  • v0.5.12 — FlashInfer 0.6.8.post1 → 0.6.11.post1, EAGLE-3 refinements, unified NVIDIA image

Changes per file

GLM-5.1.yaml

Single-line image bump.

small-models.yaml

  • Qwen3.6 service: image bump + freshen the comment (now mentions Spec V2 as default).
  • FLUX dockerfile_inline: bumps the FROM digest. python[diffusion] extras still apply.

Qwen3.5-122B.yaml — the substantive change

Translation of the existing vLLM flags:

vLLM flag sglang equivalent
--tensor-parallel-size 4 --tp 4
--gpu-memory-utilization 0.88 --mem-fraction-static 0.88
--max-model-len 1010000 --context-length 262144
--hf-overrides {yarn rope…} dropped
--kv-cache-dtype fp8_e4m3 --kv-cache-dtype fp8_e4m3
--max-num-batched-tokens 8192 --chunked-prefill-size 16384 (matches GLM-5.1/Qwen3.6)
--reasoning-parser qwen3 same
--tool-call-parser qwen3_coder same
--enable-auto-tool-choice dropped (implicit in sglang when parser is set)
--enable-chunked-prefill replaced by --enable-mixed-chunk
--enable-prefix-caching dropped (radix cache is sglang default)
--enable-prompt-tokens-details no direct equivalent — --enable-cache-report covers similar telemetry
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 dropped

Added (mirroring Qwen3.6):

  • --speculative-algorithm EAGLE + 3 steps / topk 1 / 4 draft tokens
  • --num-continuous-decode-steps 5
  • --model-loader-extra-config '{"enable_multithread_load": "true", "num_threads": 64}'
  • --enable-cache-report, --enable-metrics, --trust-remote-code, --log-requests-level 0, --served-model-name Qwen/Qwen3.5-122B-A10B
  • SGLANG_ENABLE_SPEC_V2=1 (explicit; default since 0.5.11)
  • kernel_cache volume for DeepGEMM JIT cache

Container/observability rename:

  • vllm-qwen35-122b-{1,2}qwen35-{1,2}
  • Datadog source/service vllmsglang; metric prefix filter vllm:.*sglang:*
  • Anchor *vllm-qwen35-122b-common*qwen35-122b-common
  • Health check + proxy VLLM_BACKEND_URLS updated to the new container names
  • Proxy container (vllm-proxy-qwen35, vllm-proxy-rs) name kept — it's still the same Rust proxy

Watch items

  • Qwen3.5 context drops from 1.01M to native 262144. Any client sending >256k context will start failing. Confirmed in advance.
  • EAGLE on Qwen3.5 mirrors the Qwen3.6 flag set (no explicit draft model — relies on the model's native MTP head). If Qwen3.5-122B-A10B does not have one, sglang will fail at boot. Verify on one CVM before rolling the fleet.
  • Spec V2 is default since v0.5.11; GLM-5.1 doesn't use spec decoding so no behavior change there. Qwen3.6 already opted in via SGLANG_ENABLE_SPEC_V2=1.
  • Per [feedback_small_models_nginx_sni_cache], restart nginx on the small-models host after any compose/up that recreates the Qwen3.6 / FLUX proxies — SNI routing silently misroutes otherwise.
  • DataDog dashboards that filter by service:vllm-qwen35-122b-1 (or *-2) need to be updated to qwen35-1/-2 and source:sglang.

Rollout plan

  1. Pick one Qwen3.5 host, deploy via compose-manager, watch sglang:* metrics + a synthetic chat-completions call against qwen35-122b.completions.near.ai.
  2. If EAGLE fails at boot, strip the four --speculative-* flags and SGLANG_ENABLE_SPEC_V2=1 and try again — that's the only Qwen3.5-specific risk.
  3. Roll GLM-5.1 (4 hosts) one at a time via the standard rolling-deploy pattern.
  4. Roll small-models (Qwen3.6 + FLUX) on the 2 hosts; restart nginx afterward on each.

- GLM-5.1, Qwen3.6 (small-models), FLUX (small-models): swap the
  pre-v0.5.11 :dev pin (e1eee3f, built 2026-04-30) and the very old
  :latest-ish FLUX pin (8ece90ad, 2026-01-21) for v0.5.12-cu129
  (9e02c8e1, 2026-05-16). cu129 keeps the CUDA 12.9 driver story
  unchanged; CUDA 13 default in v0.5.11 sidesteps the host-driver
  question for now.
- Qwen3.5-122B-A10B: replace the vLLM v0.20.0 stanza with an sglang
  service modeled on the Qwen3.6 cookbook. 4-way TP per instance,
  fp8 KV cache, qwen3 reasoning + qwen3_coder tool parsers preserved.
  EAGLE spec decoding added (Spec V2 is default since 0.5.11).
- Qwen3.5 context drops from 1.01M (yarn rope hf-override) to native
  262144 — yarn override removed. Clients using >256k context will
  start failing; agreed in advance.
- Containers renamed vllm-qwen35-122b-{1,2} → qwen35-{1,2}; Datadog
  source/service/metrics flipped vllm → sglang. Proxy container
  (vllm-proxy-qwen35, vllm-proxy-rs) keeps its name.

Note: EAGLE on Qwen3.5 mirrors the 3.6 flag set, which relies on the
model's native MTP head. If Qwen3.5 lacks one, sglang will fail at
boot — verify on a single CVM before rolling.
@Evrard-Nil

Copy link
Copy Markdown
Contributor Author

Closing — keeping changes on local branch only.

@Evrard-Nil Evrard-Nil closed this May 22, 2026
@Evrard-Nil Evrard-Nil deleted the feat/sglang-v0.5.12 branch May 22, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant