23 Jan 22:09

Kangyan-Zhou

0189f41

v0.5.8 Latest

Latest

Highlights

Up to 1.5x faster across the board for all major diffusion models https://lmsys.org/blog/2026-01-16-sglang-diffusion/
Close to linear scaling with chunked pipeline parallelism for super long million-token context https://lmsys.org/blog/2026-01-15-chunked-pipeline/
Optimizing GLM4-MoE for Production: 65% Faster TTFT https://lmsys.org/blog/2026-01-21-novita-glm4/
EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models https://lmsys.org/blog/2026-01-12-epd/

New Model Support

Day 0 Support for GLM 4.7 Flash: #17247
LFM2 model support: #16890
Qwen3-VL-Embedding & Qwen3-VL-Reranker model support: #16635, #16403
DeepSeek V3.2 NVFP4: https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4
[Diffusion] black-forest-labs/FLUX.2-klein-9B

DeepSeek V3.2 Optimization

Context Parallelism Optimization with support for fused MoE, multi-batch, and FP8 KV cache: #13959

Flash Attention 4

Support for Flash Attention 4 decoding kernels: #16034

SGLang-Diffusion

Run sglang-diffusion with diffusers backend
Features: Multi-LoRA inference, SLA attention backends, warmup switch in CLI, ComfyUI Plugin
Performance improvements for all models

Dependencies

sgl-kernel updated to 0.3.21: #17075
Cutedsl updated to 4.3.4: #17075
Added dependencies for tvm-ffi and quack-kernels: #17075
Flashinfer updated to 0.6.1: #15551
Mooncake transfer engine updated to 0.3.8.post1: #16792

Security

Fixed urllib and gpgv vulnerabilities: #17439

What's Changed

Refactor custom allreduce logics by @iforgetmyname in #13710
[Doc] Update DeepSeek-V3.2 document by @Fridge003 in #14321
Feature/support distilled vae generic by @baonudesifeizhai in #14195
[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels by @Johnsonms in #13812
Update CODEOWNERS for multimodal by @mickqian in #14329
[bug fix] use npu phy id in container env by @jinke446 in #14266
[model-gateway] multimodality initialization by @slin1237 in #13350
[Doc] Fix DeepSeek V32 Doc by @Fridge003 in #14336
sync attention, deepseek doc by @b8zhong in #14335
[PD] Support decode pp for PD disaggregation by @ShangmingCai in #14265
[model-gateway] add image processor and transformer structure by @slin1237 in #14344
[CPU] Support chunk_gated_delta_rule kernel for Qwen3-Next by @Valentine233 in #12441
[bugfix] Fix prefill tbo disabled when --deepep-mode=auto by @yuhyao in #14333
[CI] update estimated elapsed time of some unittests by @ch-wan in #14347
[NPU] bug fix: w_vc need contiguous for NPU batch_matmul_transpose ops by @ZhengdQin in #13980
[bugfix] NpuFuseEPMoE miss initialization parameters by @chenxu140 in #14295
[Ascend] fix AscendAttnMaskBuilder bug to support float16 models by @MichelleWu351 in #14271
Tiny adjust CI testcases by @hnyls2002 in #14362
[NPU][Doc] updated installation guide for Ascend NPU by @VDV1985 in #13585
Feature/add vae path to cli doc#14004 by @baonudesifeizhai in #14355
[CPU] add fused_qkvzba_split_reshape_cat kernel for Qwen3-next by @blzheng in #12330
Single Batch Overlap for MoE Models by @Sulfur6 in #9660
Move custom_ops under layers; move _custom_ops.py → custom_all_reduce_ops.py by @merrymercy in #14326
[model-gateway] add llava model image processor and tests by @slin1237 in #14371
ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back by @sunxxuns in #14226
[Tiny]Small fixes in deepseek v32 doc by @Fridge003 in #14372
Fix validation to detect missing model files before loading by @alisonshao in #14253
[model-gateway] add qwen2_vl model image processor and tests by @slin1237 in #14374
[model-gateway] add qwen2.5_vl model image processor by @slin1237 in #14375
Revert "Revert "enable csgmv automatically on cuda"" by @b8zhong in #14277
[model-gateway] use worker crate in openai router by @slin1237 in #14330
[model-gateway] add qwen3_vl model image processor by @slin1237 in #14377
Fix sgl-router silently parse selector wrongly causing OME fail to discover pods by @fzyzcjy in #14359
[sgl-kernel][Feat][B200][1/N]Support MXFP8 Grouped GEMM in Blackwell by @HydraQYH in #13731
[CPU] document updates by @ZailiWang in #14272
Support PP x PD decode with nixl backend by @bluecoffee8 in #14392
[VLM] Introduce Cache for positional embedding ids for Qwen-VL family by @yuan-luo in #14292
use faster covnersion from float8_e4m3fn to bfloat16 by @mingfeima in #12316
[model-gateway][doc] Add STDIO Explicitly to Example in README by @xuwenyihust in #14393
[CPU] add support for mamba causal conv1d for qwen3-next by @mingfeima in #12309
[model-gateway] add phi3 vis...

Contributors

cynial, dcampora, and 198 other contributors

Assets 2

09 Jan 06:18

slin1237

gateway-v0.3.1

7460240

Release Gateway-v0.3.1

🚀 SMG v0.3.1 Released!

We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security!

🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡

Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains:

Performance Improvements

Our cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900), with latency dropping from 52.9 microseconds to just 4.6 microseconds per operation.
For prefix matching across 10,000 tree entries, throughput jumped from 41,000 to 124,000 operations per second.
Under concurrent load with 64 threads, the system processes 474,000 operations per second – a 7.9x improvement over the previous 59,000 ops/sec.

Data processing

INSERT operations now process 440 MB/s (up from 38 MB/s),
MATCH operations handle 253 MB/s (up from 83 MB/s).

Memory Improvements:

~99% memory reduction per tree node:
Before: ~180 KB per node (DashMap default config on 170-core machines)
After: ~1.4 KB per node
Result: Deploy 100x more cache entries in the same memory footprint!
For a typical deployment with 10,000 cached prefixes, memory usage drops from ~1.8 GB to just ~14 MB – freeing up resources for actual inference workloads.
Impact: Cache-aware routing is now 10-12x faster and uses 99% less memory. This is critical for large-scale multi-tenant deployments.

🔐 JWT/OIDC Authentication

Production-grade security for control plane APIs with native support for industry-standard OIDC providers: Google, Azure, Oracle, GitHub, and more. Protect tokenizer management, worker registration, and admin endpoints with enterprise authentication infrastructure you already use. Critical for enterprise deployments – seamlessly integrate SMG into your existing identity and access management systems.

📊 Classification API Support

Native support for classification workloads! Deploy and serve classification models alongside your existing inference fleet with dedicated pipeline stages and protocol types.

✨ Additional Features

PrefixHash Load Balancing: New KV cache-aware load balancing policy using prefix hashing for improved cache hit rates in multi-tenant environments.
Nemotron Nano V3 Parser
In-Flight Request Age Metrics: Track request age in-flight for better observability and SLA monitoring.

🛠️ Enhancements

Developer Experience:

Organized CLI arguments into logical groups
Shortened logging targets (sgl_model_gateway → smg)
Comprehensive embedding correctness tests against HuggingFace
Auto-generate protobuf files during wheel build

Reliability:

Fix IGW routing for external OpenAI workers
Work around orphan process problems
Prevent potential hangs in subprocess handling
Use 504 Gateway Timeout for upstream timeouts (proper HTTP semantics)

🐛 Bug Fixes

Fixed embedding worker health check crash
Fixed tokenizer to match transformers special token handling
Fixed age bucket rendering issue
Fixed non-PD router HTTP header whitelist
Fixed duplicate classify prefix in response ID
Fixed WASM test errors on machines with many cores

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (120 commits)

[model-gateway] release 0.3.1 (#16254) by @slin1237 in #16254
[smg] cleanup router RAII guards (#16560) by @fzyzcjy in #16560
[smg] update gRPC proto to match upstream changes (#16764) by @slin1237 in #16764
[smg] Add Nemotron Nano V3 reasoning parser support (#16763) by @slin1237 in #16763
[smg] Work around sglang's notorious orphan process problem (#16756) by @slin1237 in #16756
fix(e2e): prevent potential hangs in model pool subprocess handling (#16752) by @slin1237 in #16752
[smg][ci] delete old chat completion integration tests and workflow step (#16751) by @slin1237 in #16751
[smg][ci] migrate function calling tests to new infrastructure (#16748) by @slin1237 in #16748
[smg][ci] migrate validation tests to new infrastructure (#16746) by @slin1237 in #16746
[grpc] Auto-generate protobuf files during wheel build (#16409) by @CatherineSue in #16409
[smg][ci] fix model pool GPU cleanup and add startup reliability improvements (#16745) by @slin1237 in #16745
[smg][ci] migrate reasoning_content tests to new infrastructure (#16741) by @slin1237 in #16741
[smg][ci] migrate enable_thinking tests to new infrastructure (#16739) by @slin1237 in #16739
Remove migrated e2e_grpc/basic tests (#16738) by @slin1237 in #16738
[smg][ci] migrate chat completions tests to new infrastructure and build wheel once and share via artifact (#16709) by @slin1237 in #16709
[smg][ci] delete old responses api ci (#16695) by @slin1237 in #16695
[smg][ci] rename 3rd models from cloud backend and delete dead code (#16692) by @slin1237 in #16692
[smg][ci] Migrate Response API e2e tests to shared infrastructure (#16680) by @slin1237 in #16680
[smg][ci] Add thread safety to ModelPool and GPUAllocator (#16674) by @slin1237 in #16674
Add reference counting to ModelInstance for parallel test safety (#16672) by @slin1237 in #16672
[model-gateway] Fix IGW routing for external OpenAI workers (#16633) by @zhaowenzi in #16633
refactor(e2e): unify RouterInstance into Gateway class, split conftest.py into modular fixtures (#16671) by @slin1237 in #16671
refactor(e2e_test): fix smg ci e2e test code quality (#16664) by @slin1237 in #16664
fix(e2e_test): remove dead code and fix type annotations (#16661) by @slin1237 in #16661
[smg][ci] preserve model launch order with test collected (#16618) by @slin1237 in #16618
[model-gateway] extract header extraction in policy and add (#16566) by @fzyzcjy in #16566
[smg][ci]: migrate benchmarks to e2e_test/benchmarks/, use parent conftest (#16597) by @slin1237 in #16597
[router][openai] Rename prepare_mcp_payload_for_streaming and patch_streaming_response_json (#16596) by @CatherineSue in #16596
[router][grpc] Replace Vec<(String, String, String)> with ExtractedToolCall (#16598) by @CatherineSue in #16598
[model-gateway][cleanup] Fix wrong comment in manager.rs (#16601) by @CatherineSue in #16601
refactor(e2e): keep only benchmark tests in e2e_http, remove redundant tests (#16594) by @slin1237 in #16594
refactor(e2e): remove old embedding tests migrated to e2e_test/embeddings (#16592) by @slin1237 in #16592
[model-gateway] add embedding tests (#16583) by @slin1237 in #16583
[smg] clean up logs in mcp (should be info instead warn) (#16591) by @slin1237 in #16591
[ci] fix url strips in smg ci (#16548) by @slin1237 in #16548
[grpc] Unify ResponsesContext and HarmonyResponsesContext (#16549) by @CatherineSue in #16549
[responses API] Add list_tools_for_servers and threading server_keys in routers (#16540) by @CatherineSue in #16540
[router] Remove deadcode and add note for unused API completeness methods (#16528) by @CatherineSue in #16528
[model-gateway] Add model scope support and LRU eviction for GPU-constrained environments (#16525) by @slin1237 in #16525
[model-gateway] Tighten visibility in modules and remove unused re-exports (#16524) by @CatherineSue in #16524
[model-gateway] refactor e2e test infrastructure and add router CI (#16513) by @slin1237 in #16513
[model-gateway] Tighten visibility across data_connector and grpc module (#16516) by @CatherineSue in #16516
[model-gateway] fix tokenizer encode in golang bindings (#16482) by @WeiLai5432 in #16482
[grpc] Refactor openai module (#16511) by @CatherineSue in #16511
[grpc] Refactor grpc/regular/responses (#16509) by @CatherineSue in #16509
[model-gateway][grpc] Refactor harmony/responses.rs (#16508) by @CatherineSue in #16508
Fix age bucket rendering issue (#16492) by @fzyzcjy in #16492
[model-gateway][e2e_test]: Crea...

Contributors

fzyzcjy, CatherineSue, and 6 other contributors

Assets 2

01 Jan 10:01

Fridge003

v0.5.7

232982a

v0.5.7

Highlights

New Model Support:
- Day 0 Support for Mimo-V2-Flash: #15207, https://lmsys.org/blog/2025-12-16-mimo-v2-flash/
- Day 0 Support for Nemotron-Nano-v3: https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/
- Day 0 Support for LLaDA 2.0: https://lmsys.org/blog/2025-12-19-diffusion-llm/
- [SGLang-Diffusion] Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered
- EAGLE 3 speculative decoding draft models for popular models: https://lmsys.org/blog/2025-12-23-spec-bundle-phase-1/
Model Gateway v0.3.0 Release:
https://docs.sglang.io/advanced_features/sgl_model_gateway.html
Scalable pipeline parallelism with dynamic chunking support for ultra-long contexts (PP Refactor Roadmap #11857）
Encoder Disaggregation for Multi-modal models (Roadmap #15118)
SGLang-Diffusion:
- Set --dit-layerwise-offload true to reduce peak VRAM usage by up to 30GB, and improve performance by up to 58% for all models
- Significantly reduce the latency of Qwen-Image-Edit, making it one-of-the-fastest among all open-source solutions. More improvements are on the way
- Add support for AMD/4090/5090, along with additional attention choices (sage-attn, sage-attn3), more parallelism options (TP) and enhancements to HTTP API (Google vertex supported)
- Cache-dit integration to improve performance by up to 165%

What's Changed

Refactor custom allreduce logics by @iforgetmyname in #13710
[Doc] Update DeepSeek-V3.2 document by @Fridge003 in #14321
Feature/support distilled vae generic by @baonudesifeizhai in #14195
[Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels by @Johnsonms in #13812
Update CODEOWNERS for multimodal by @mickqian in #14329
[bug fix] use npu phy id in container env by @jinke446 in #14266
[model-gateway] multimodality initialization by @slin1237 in #13350
[Doc] Fix DeepSeek V32 Doc by @Fridge003 in #14336
sync attention, deepseek doc by @b8zhong in #14335
[PD] Support decode pp for PD disaggregation by @ShangmingCai in #14265
[model-gateway] add image processor and transformer structure by @slin1237 in #14344
[CPU] Support chunk_gated_delta_rule kernel for Qwen3-Next by @Valentine233 in #12441
[bugfix] Fix prefill tbo disabled when --deepep-mode=auto by @yuhyao in #14333
[CI] update estimated elapsed time of some unittests by @ch-wan in #14347
[NPU] bug fix: w_vc need contiguous for NPU batch_matmul_transpose ops by @ZhengdQin in #13980
[bugfix] NpuFuseEPMoE miss initialization parameters by @chenxu140 in #14295
[Ascend] fix AscendAttnMaskBuilder bug to support float16 models by @MichelleWu351 in #14271
Tiny adjust CI testcases by @hnyls2002 in #14362
[NPU][Doc] updated installation guide for Ascend NPU by @VDV1985 in #13585
Feature/add vae path to cli doc#14004 by @baonudesifeizhai in #14355
[CPU] add fused_qkvzba_split_reshape_cat kernel for Qwen3-next by @blzheng in #12330
Single Batch Overlap for MoE Models by @Sulfur6 in #9660
Move custom_ops under layers; move _custom_ops.py → custom_all_reduce_ops.py by @merrymercy in #14326
[model-gateway] add llava model image processor and tests by @slin1237 in #14371
ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back by @sunxxuns in #14226
[Tiny]Small fixes in deepseek v32 doc by @Fridge003 in #14372
Fix validation to detect missing model files before loading by @alisonshao in #14253
[model-gateway] add qwen2_vl model image processor and tests by @slin1237 in #14374
[model-gateway] add qwen2.5_vl model image processor by @slin1237 in #14375
Revert "Revert "enable csgmv automatically on cuda"" by @b8zhong in #14277
[model-gateway] use worker crate in openai router by @slin1237 in #14330
[model-gateway] add qwen3_vl model image processor by @slin1237 in #14377
Fix sgl-router silently parse selector wrongly causing OME fail to discover pods by @fzyzcjy in #14359
[sgl-kernel][Feat][B200][1/N]Support MXFP8 Grouped GEMM in Blackwell by @HydraQYH in #13731
[CPU] document updates by @ZailiWang in #14272
Support PP x PD decode with nixl backend by @bluecoffee8 in #14392
[VLM] Introduce Cache for positional embedding ids for Qwen-VL family by @yuan-luo in #14292
use faster covnersion from float8_e4m3fn to bfloat16 by @mingfeima in #12316
[model-gateway][doc] Add STDIO Explicitly to Example in README by @xuwenyihust in #14393
[CPU] add support for mamba causal conv1d for qwen3-next by @mingfeima in #12309
[model-gateway] add phi3 vision image processor by @slin1237 in #14381
[model-gateway] introduce provider in openai router by @slin1237 in #14394
[AMD] fix the regression issue for DeepseekV3 on MI300 by @yctseng0211 in #14383
[NPU][1/N] NPU basic functions refactor and new modelslim quant type by @iforgetmyname in #13359
[CPU] Optimize small oc GEMM for Qwen3-next on CPU by @jianan-gu in #12446
Try to fix B200 DeepEP error by @fzyzcjy in #14399
[1/2] Add rope kernel in sgl-kernel by @Qiaolin-Yu in #14334
[bug fix] fix ima with get_mla_kv_buffer_kernel overflow by @XucSh in #14224
Add Mistral Large 3 support. by @dcampora in #14213
[diffusion] fix gen video doc by @yeahdongcn in #14409
Add 'NPU' to the runtime exception message in get_device by @rauletorresc in #14225
Add mooncake transfer_engine_bench into maunal test by @hnyls2002 in #14429
[model-gateway] add phi4 vision image processor by @slin1237 in #14430
diffusion: Add Configurable Generator Device and Seed Support via API by @niehen6174 in #14366
[model-gateway] introduce request ctx for oai router by @slin1237 in #14434
[NPU]add nightly-test-npu by @cherryblo in #14143
[model-gateway] add llama4 vision image processor by @slin1237 in #14438
[model-gateway] extract conversation out of oai router by @slin1237 in #14440
[DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 by @YAMY1234 in #14325
[model-gateway] move oai header util to router header util by @slin1237 in #14441
[FIX] trtllm-moe-fp4-renorm for Qwen series models by @samuellees in #14350
add doc for quantized kv cache by @b8zhong in #14348
fix: Correct environment variable syntax in docker-compose configuration by @yankay in #8287
[model-gateway] move all responses api event from oai to proto by @slin1237 in #14446
[model-gateway] add mistral 3 image processor by @slin1237 in #14445
[model-gateway] grpc to leverage event type by @slin1237 in #14450
ministral3 by @JustinTong0323 in #14251
[Bug] fix not desired disable fused share experts caused by rocm logic by @ocss884 in #14432
Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE by @sglang-bot in #14421
further optimze model load by @zyksir in #13836
Add CI permissions for user 'yushengsu-thu' by @alisonshao in #14468
[ez] Fix ty...

Contributors

cynial, dcampora, and 196 other contributors

Assets 2

24 Dec 22:00

slin1237

gateway-v0.3.0

5454d2a

Release Gateway-v0.3.0

🚀 SGLang Model Gateway v0.3.0 Released!

We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!

⚠️ Breaking Changes

📊 Metrics Architecture Redesigned

Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.

🔧 UUID-Based Worker Resource Management

Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.

✨ New Features

🌐 Unified Inference Gateway Mode (IGW)

Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:

gRPC router (PD and regular mode)
HTTP router (PD and regular mode)
OpenAI router
Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.

🔤 Tokenize/Detokenize HTTP Endpoints

Direct HTTP endpoints for tokenization operations
Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
TokenizerRegistry for efficient dynamic loading

🧠 Parser Endpoints

/parse/reasoning - Parse reasoning outputs
/parse/function_call - Parse function call responses
GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models

📊 Embeddings Support

Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.

🔐 Server-Side TLS Support

Secure your gateway deployments with native TLS support.

🌐 Go Implementation, contributed by iFlytek MaaS team.

Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!

⚡ Major Enhancements

Control Plane - Workflow Engine

Intelligent lifecycle orchestration with:

DAG-based parallel execution with pre-computed dependency graphs
Concurrent event processing for maximum throughput
Modular add/remove/update workflows

Performance Optimization

Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
Optimized router management: Improved selection algorithms and state management

Resilience & Reliability:

Retry and circuit breaker support for OpenAI and gRPC routers
Enhanced circuit breaker with better state management
Graceful shutdown for TLS and non-TLS servers
Unified error responses with error codes and X-SMG-Error-Code headers

Infrastructure:

Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
Custom Prometheus duration buckets
Improved logging across all modules

🐛 Bug Fixes & Stability

Fixed cache-aware routing in gRPC mode
Resolved load metric tracking and double-decrease issues for cache aware load balancing
Improved backward compatibility for GET endpoints
Fixed gRPC scheduler launcher issues
Fixed token bucket negative duration panics
Resolved MCP server initialization issues

📚 Documentation

Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.

⚠️ Migration checklist:

Update Prometheus dashboards for new metrics
Update worker API integrations for UUID-based management
Review new error response format

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (108 commits)

[model-gateway] release smg 0.3.0 (#15781) by @slin1237 in #15781
[model-gateway] Fix logging module name, parse endpoint context, and tokenizer factory (#15782) by @slin1237 in #15782
[model-gateway] Implement Zero-Copy Vision Tensor Access (#15750) by @ppraneth in #15750
[model-gateway] Fix IGW routing and optimize RouterManager (#15741) by @slin1237 in #15741
Fix smg_http_requests_total semantics (#15655) by @fzyzcjy in #15655
[model-gateway]Enable IGW mode with gRPC router and auto enable IGW when service discovery is turned on (#15459) by @YouNeedCryDear in #15459
[docs] major SGL Model Gateway documentation update (#15715) by @slin1237 in #15715
[model-gateway] add back router worker health metric and fix init state (#15622) by @fzyzcjy in #15622
[mode;-gateway] add back fixes of incorrect metrics after worker removal (#15624) by @fzyzcjy in #15624
[model-gateway] Add tokenize/detokenize HTTP endpoints and tokenizer management (#15702) by @slin1237 in #15702
[model-gateway] Fix tokenizer caching and improve error handling (#15695) by @slin1237 in #15695
[model-gateway]: add gRPC router embeddings endpoint implementation (#15273) by @Ratish1 in #15273
[model-gateway] Optimize router selection with lock-free snapshots (#15672) by @ppraneth in #15672
[model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router (#12968) by @YouNeedCryDear in #12968
Improve engine customization interface (#15635) by @merrymercy in #15635
Tiny add back missing router per attempt response metric (#15621) by @fzyzcjy in #15621
Fix router gRPC mode launch error caused by async loading (#15368) by @fzyzcjy in #15368
[model-gateway] return 503 when all workers are circuit-broken (#15611) by @slin1237 in #15611
[model-gateway] add retry support to OpenAI router chat endpoint (#15589) by @slin1237 in #15589
Optimize Rust CI builds with proper sccache configuration (#15581) by @slin1237 in #15581
[model-gateway] add retry and circuit breaker support to gRPC routers (#15585) by @slin1237 in #15585
[model-gateway] refactor WorkerManager with fan_out helper and thin handlers (#15583) by @slin1237 in #15583
[model-gateway] add WorkerService abstraction for worker business logic (#15580) by @slin1237 in #15580
[model-gateway] minor code clean up (#15578) by @slin1237 in #15578
[model-gateway] Use UUIDs for router-managed worker resources (#15540) by @alphabetc1 in #15540
[model-gateway] /parse/easoning and parse/function_call for sgl-model-gateway (#15568) by @UbeCc in #15568
[model-gateway]: Tool parser for glm47 (#15520) by @UbeCc in #15520
[model-gateway] bugfix: backward compatibility for GET endpoints (#15413) by @alphabetc1 in #15413
[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching (#15515) by @ppraneth in #15515
[model-gateway] add model gateway multi-arch docker build, test and document docker image (#15544) by @slin1237 in #15544
[model-gateway] Implement RAII load guard with response body attachment (#15507) by @slin1237 in #15507
[router] bugfix: cache_aware in grpc inbalance forward (#15473) by @llfl in #15473
[model-gateway] simplify workflow engine backoff and reduce duplicate reads (#15505) by @slin1237 in #15505
[model-gateway] Run workflow event subscribers concurrently (#15504) by @slin1237 in #15504
[model-gateway] Optimize workflow engine with pre-computed dependency graph (#15503) by @slin1237 in #15503
[model-gateway] Improve logging across core modules (#15497) by @slin1237 in #15497
[model-gateway] Improve logging in policies module (#15496) by @slin1237 in #15496
[model-gateway] Improve logging in data_connector module (#15495) by @slin1237 in #15495
[model-gateway] refactor: extract common graceful shutdown code before TLS branch (#15494) by @slin1237 in #15494
[model-gateway] fix graceful shutdown for TLS/Non-TLS server (#15491) by @slin1237 in #15491
[model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (#15361) by @slin1237 in #15361
[model-gateway] optimize worker registry and reduce lock contention in grpc client fetch (#15336) by @slin1237 in #15336
[model-gateway] reduce cpu overhead (#15316) by @slin1237 in #15316
Super tin...

Contributors

fzyzcjy, xuwenyihust, and 11 other contributors

Assets 2

10 Dec 01:09

slin1237

gateway-v0.2.4

390406c

Release Gateway-v0.2.4

🚀 SGLang Model Gateway v0.2.4 Released!

We're excited to announce SGLang Model Gateway v0.2.4 – a massive release focused on performance, security, and production-ready observability!

✨ Headline Features

⚡ Major Performance Optimizations

We've invested heavily in performance across the entire stack:

Optimized radix tree for cache-aware load balancing – Smarter routing decisions with lower overhead
Tokenizer optimization – Dramatically reduced CPU and memory footprint during tokenization
Core module optimization – HTTP and gRPC routers now run leaner and faster
Efficient OTEL implementation – Production-grade observability with minimal performance impact

🔌 Industry-First WASM Middleware Support

Programmable middleware using WebAssembly! Extend your gateway with safe, isolated plugins. Build custom routing logic, transform requests/responses, or integrate proprietary systems – all without touching core code. Your gateway, your rules.

📊 Production-Grade Observability

Full OpenTelemetry integration with distributed tracing for both HTTP and gRPC. Track requests across your entire inference stack with native trace context propagation. Finally, real visibility into your LLM infrastructure.

⚡ Built for speed. Hardened for security. Ready for production.

Gateway Changes (98 commits)

[model-gateway] release gateway 0.2.4 (#14763) by @slin1237 in #14763
[Perf] Optimize radix tree for cache-aware load balancin (#14758) by @slin1237 in #14758
[SMG] perf: optimize tokenizer for reduced CPU and memory overhead (#14752) by @slin1237 in #14752
[model-gateway] optimize core modules (#14751) by @slin1237 in #14751
Tiny extract select_worker_min_load (#14648) by @fzyzcjy in #14648
[ci][smg] fix docker release ci and add it to pr test (#14683) by @slin1237 in #14683
Tiny support sgl-router http response status code metrics (#14689) by @fzyzcjy in #14689
[SMG]feat: implement TokenGuardBody for managing token return (#14653) by @jimmy-evo in #14653
[model-gateway] add OTEL integration to grpc router (#14671) by @slin1237 in #14671
Fix cache-aware router should pick min load instead of min tenant size (#14650) by @fzyzcjy in #14650
[model-gateway] Optimize memory usage in HTTP router (#14667) by @slin1237 in #14667
[model-gateway] fix WASM arbitrary file read security vol (#14664) by @slin1237 in #14664
[model-gateway] reduce cpu overhead in grpc router (#14663) by @slin1237 in #14663
[model-gateway] reducing cpu overhead in various of places (#14658) by @slin1237 in #14658
Fix dp-aware incompatible with service-discovery (#14629) by @fzyzcjy in #14629
Super tiny fix unused code in router (#14618) by @fzyzcjy in #14618
[model-gateway] fix WASM unbounded request/response body read vuln (#14612) by @slin1237 in #14612
Super tiny remove unused select_worker_pair (#14609) by @fzyzcjy in #14609
[model-gateway] refactor otel to be more efficient (#14604) by @slin1237 in #14604
Tiny fix missing policy decision recording (#14605) by @fzyzcjy in #14605
[model-gateway] fix WASM memory limit per module (#14600) by @slin1237 in #14600
[model-gateway] reorganize metrics, logging, and otel to its own module (#14590) by @slin1237 in #14590
[model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (#14588) by @slin1237 in #14588
[model-gateway] extra accumulator and tool handler in oai router (#14587) by @slin1237 in #14587
[Bug fix] Add /model_info endpoint to mini_lb (#14535) by @alisonshao in #14535
[model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (#13897) by @sufeng-buaa in #13897
[model-gateway] fix left over sgl-router names in wasm (#14514) by @slin1237 in #14514
[model-gateway] fix logs in smg workflow (#14513) by @slin1237 in #14513
[model-gateway] fix left over sgl-router names to sgl-model-gateway (#14512) by @slin1237 in #14512
[model-gateway] change sgl-router to sgl-model-gateway (#14312) by @slin1237 in #14312
[model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (#14405) by @xuwenyihust in #14405
Fix removing worker will make it healthy forever in prometheus metrics (#14420) by @fzyzcjy in #14420
[model-gateway] fix server info comment (#14508) by @slin1237 in #14508
[model-gateway] reorganized conversation handler (#14507) by @slin1237 in #14507
[model-gateway] Add WASM support for middleware (#12471) by @tonyluj in #12471
[model-gateway] move conversation to first class routing (#14506) by @slin1237 in #14506
[misc] add model arch and type to server info and use it for harmony (#14456) by @slin1237 in #14456
[model-gateway] grpc to leverage event type (#14450) by @slin1237 in #14450
[model-gateway] add mistral 3 image processor (#14445) by @slin1237 in #14445
[model-gateway] move all responses api event from oai to proto (#14446) by @slin1237 in #14446
[model-gateway] move oai header util to router header util (#14441) by @slin1237 in #14441
[model-gateway] extract conversation out of oai router (#14440) by @slin1237 in #14440
[model-gateway] add llama4 vision image processor (#14438) by @slin1237 in #14438
[model-gateway] introduce request ctx for oai router (#14434) by @slin1237 in #14434
[model-gateway] add phi4 vision image processor (#14430) by @slin1237 in #14430
Add Mistral Large 3 support. (#14213) by @dcampora in #14213
[model-gateway] introduce provider in openai router (#14394) by @slin1237 in #14394
[model-gateway] add phi3 vision image processor (#14381) by @slin1237 in #14381
[model-gateway][doc] Add STDIO Explicitly to Example in README (#14393) by @xuwenyihust in #14393
Fix sgl-router silently parse selector wrongly causing OME fail to discover pods (#14359) by @fzyzcjy in #14359
[model-gateway] add qwen3_vl model image processor (#14377) by @slin1237 in #14377
[model-gateway] use worker crate in openai router (#14330) by @slin1237 in #14330
[model-gateway] add qwen2.5_vl model image processor (#14375) by @slin1237 in #14375
[model-gateway] add qwen2_vl model image processor and tests (#14374) by @slin1237 in #14374
[model-gateway] add llava model image processor and tests (#14371) by @slin1237 in #14371
[model-gateway] add image processor and transformer structure (#14344) by @slin1237 in #14344
[model-gateway] multimodality initialization (#13350) by @slin1237 in #13350
[model-gateway] add workflow for external model providers (#14323) by @slin1237 in #14323
[model-gateway] change rust package name to sgl-model-gateway instead (#14283) by @slin1237 in #14283
[model-gateway] fix version output (#14276) by @slin1237 in #14276
[model-gateway] include smg version command in py binding (#14274) by @slin1237 in #14274
[model-gateway] add audio and moderation in model card (#14263) by @slin1237 in #14263
[model-gateway] Add e2e tests of streaming events and tool choice for response api (#13880) by @XinyueZhang369 in #13880
[model-gateway] Migrate Worker trait to model-aware methods (#14250) by @slin1237 in #14250
[model-gateway] add ModelCard support to WorkerMetadata (#14243) by @slin1237 in #14243
[...

Contributors

dcampora, tonyluj, and 15 other contributors

Assets 2

03 Dec 05:11

Fridge003

v0.5.6

7ae368e

Release v0.5.6

Highlights

Support for DeepSeek V3.2/V3.2 Speciale #14249
Blockwise diffusion language model support #12588
Support for new diffusion models (Flux2 #14000, Z-image #14067)
Introduce JIT Kernels #13453
Upgrade to Torch 2.9 #12969
Kimi-K2-Thinking model enhancement #12882
Memory management/Overlap spec compatibility #12224 #12839
More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
CI/CD Enhancement

What's Changed

[router][grpc] Add more mcp test cases to responses api by @CatherineSue in #12749
[Intel]Add 'intel_xpu' attention backend for llama4 by @gaopengff in #11051
[Intel XPU]Update pytorch xpu to 2.9 by @gaopengff in #12363
[Docs] fix dead links in multiple documentation pages by @mattheliu in #12764
[mem pool] bugfix: wrong position for self.device in Mamba by @stmatengss in #12684
[Fix]HTTP Stream raise exception by @jimmy-evo in #11904
[CPU] Fix TP padding case with weight block size by @jianan-gu in #8243
[docs] Remove redundant --disable-radix-cache option from by @rchalamala in #12717
Pin uvloop to 0.21.0 by @yeahdongcn in #12279
[fix] Only enable flashinfer all reduce fusion by default for single-node servers by @leejnau in #12724
chore: update CODEOWNERS by @zhyncs in #12795
Fix hang in deepgemm compilation with symmetric memory enabled by @nvcastet in #12715
Add bot-bump-kernel-version-to-sglang workflow by @alisonshao in #12794
ignore the deepgemm check when the model weight with nvfp4 and moe ba… by @rainj-me in #12782
[AMD] Update wave-lang to 3.8.2 by @xintin in #12576
[DeepSeek-V3.2][NSA] Enable MHA Pathway for Short Sequence Prefill on B200 (SM100) by @YAMY1234 in #12788
[hotfix]: Resolve ModuleNotFoundError in PD deployment for is_in_ci() by @hzh0425 in #12772
[HotFix]: Add missing SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL env var by @hzh0425 in #12776
Add PP support for dots_vlm by @gty111 in #12763
fixes hardcoded "cuda" device references in unit tests to use a dynamic device selection by @kalyank007 in #12761
fix multimodal gen issues by @yhyang201 in #12765
[Test] Add DeepSeekV3.2 NSA Indexer Test Suite by @Johnsonms in #12520
[Bugfix] Fix illegal memory access by @elvischenv in #12758
[MoE] Add Comprehensive MoE Integration Tests by @Jonahcb in #12090
[Deepseek V3.2] Only skip Indexer logits computation when is_extend_without_speculative by @hlu1 in #12816
Fix missing dp_max_padding argument in set_dp_buffer_len by @Chen-0210 in #12812
optm(checkpoint-engine): disable multi-thread loading when update weights by @BraveY in #12374
Fix piecewise cuda graph ci test by @ispobock in #12836
update multimodal_gen readme by @mickqian in #12825
[router] Support structured model output for openai and grpc router by @key4ng in #12431
Fix data parallel controller launch for num nodes > 2 by @merrymercy in #12822
remove the fa4 page_size hardcode to 128 restriction on mla model arch by @rainj-me in #12801
sglang diffusion announcement by @wisclmy0611 in #12856
add back flashinfer jit cache to dev docker by @b8zhong in #12851
[router][grpc] Refactor: Add builders for chat and responses by @CatherineSue in #12852
[router][grpc] Move all error logs to their call sites by @CatherineSue in #12859
[router] Switch MCP tests from DeepWiki to self-hosted Brave search server by @key4ng in #12849
Add nightly performance test for GPT-OSS 4GPU models by @alisonshao in #12805
[sgl-kernel][Deepseek V3.2] Add row_starts to topk kernel by @hlu1 in #12582
[CI] Fix huggingface access for test_flash_attention_4.py by @Fridge003 in #12846
[Auto Sync] Update activation.py, logits_processor.py, rota... (20251107) by @merrymercy in #12853
[Docs][DeepseekV3.2] Update deepseekv3.2 docs for mha short seq prefill by @YAMY1234 in #12868
Support capturing aux_hidden_states for minimax m2. by @pyc96 in #12798
[CI] Tiny adjust CI esitmation time by @hnyls2002 in #12886
[DP-Attn] Clarify MLP sync / idle batch preparation logic by @hnyls2002 in #12843
Fix sending all requests to the first rank in DP attention by @fzyzcjy in #12832
Apply moe_reduce_sum kernel for fused_marlin_moe by @ispobock in #12888
use fast stream instead of torch.cuda.current_stream in llama 4 shared experts overlap by @b8zhong in #12811
[Fix] Fix trtllm-mla backend when chunked prefix cache is disabled by @Fridge003 in #12361
Refs/heads/add nightly test multi gpu configs by @alisonshao in #12870
chore: bump sgl-kernel version to 0.3.16.post6 by @sglang-bot in #12889
Update CODEOWNERS by @ispobock in #12897
Tiny simplify can_run_dp_cuda_graph gather logic by @hnyls2002 in #12891
Fix spec decoding acc length for dpsk-r1-fp4 tp8 by @Qiaolin-Yu in #12896
Revert "Fix spec decoding acc length for dpsk-r1-fp4 tp8" by @Qiaolin-Yu in #12900
Add Deepseek models into nightly tests by @Kangyan-Zhou in #12865
Fix empty server args in marlin moe test by @ispobock in #12904
Fix duplicate nightly test name by @Kangyan-Zhou in #12905
Add HF cleanup logic in ci_install_dependency.sh by @Kangyan-Zhou in #12895
fallback to triton mm_persistent kernel when deepGemm fail by @zminglei in #12911
Add kimi k2 thinking to ci by @ispobock in #12907
Fix Deepseek nightly tests by @Kangyan-Zhou in #12906
Add Jet-Nemotron by @futrime in #12448
[CI] increase ut buckets & adjust estimation time. by @hnyls2002 in #12919
[PD] feat: refactor custom mem pool and add barex pd support by @stmatengss in #12332
[CI] Fix matrix.part in pr-test. by @hnyls2002 in #12920
Adjust server launch time in ci by @ispobock in #12917
feat: basic support for server-level multimodal cache by @mickqian in #10775
Refactor / Unify event loop across PD-Disagg, Overlap, DP-Attn cases by @hnyls2002 in #12839
[lint] tiny fix unimported packages. by @hnyls2002 in #12927
ci: try to fix gpg error during kernel build by @ishandhanani in #12928
Support piecewise cuda graph for MLA by @ispobock in #11812
diffusion: skip full CI suite for multimodal_gen changes by @mickqian in #12940
Minor code cleanup / improvement for PREBUILT_EXTEND mode by @hnyls2002 in #12948
Bugfix: LMCache Connector with Sglang by @MMuzzammil1 in #12946
[Docs] Add docs for Qwen3-VL image and video support by @adarshxs in #12554
[Refactor] rename set_index_k_and_scale_buffer to set_index_k_scale_b… by @edwingao28 in #12956
Refactor KTransformers heterogeneous compute with unified GPU-quantization backend by @Atream in #12834
diffusion: fix detected file changes rule in CI by @mickqian in #12943
c...

Contributors

janbernloehr, hellodanylo, and 198 other contributors

Assets 2

17 Nov 11:23

slin1237

gateway-v0.2.3

172c71a

Release Gateway-v0.2.3

🚀 SGLang Model Gateway - New Release!

We're excited to announce another powerful update to SGLang Model Gateway with performance improvements and expanded database support!

✨ Headline Features

⚡ Bucket Mode Routing - 20-30% Performance Boost
Introducing our new bucket-based routing algorithm that dramatically improves performance in PD mode. See up to 20-30% improvements in TTFT (Time To First Token) and overall throughput

💾 PostgreSQL Support for Chat History Management
Flexibility in data storage! We now support PostgreSQL alongside OracleDB and in-memory storage for chat history management.

🛠️ Enhanced Model Tool & Structured Output Support

MinMax M2 model support!
Structured model output for OpenAI and gRPC router
Streaming parsing with Tool Choice in chat completions API
Tool_choice support for Responses API
OutputItemDone events with output item array storage for better observability

🐛 Stability & Quality Improvements

Multiple bug fixes for model validation, streaming logic, reasoning content indexing, and CI stability enhancements.

🔧 Code Quality Enhancements

Refactored builders for chat and responses, restructured modules for better maintainability, and consolidated error handling.

Try the latest version: pip install sglang-router --upgrade

What's Changed in Gateway

Gateway Changes (45 commits)

[model-gateway] smg release 0.2.3 (#13312) by @slin1237 in #13312
[router]Replace requests lib with openai in e2e_response_api (#13293) by @XinyueZhang369 in #13293
fix outdated router doc (#13255) by @fzyzcjy in #13255
[router][grpc] Refine docs in minimax_m2 to match other parsers (#13218) by @CatherineSue in #13218
fix: display served_model_name in /v1/models (#13155) by @Sunhaihua1 in #13155
[router] minmax-m2 xml tool parser (#13148) by @slin1237 in #13148
[router] remove worker url requirement (#13172) by @slin1237 in #13172
[router] Fix Flaky test_circuit_breaker_opens_and_recovers (#13164) by @XinyueZhang369 in #13164
[router] Add comprehensive validation to Responses API (#13127) by @key4ng in #13127
bugfix: multi-model routing for /generate api (#12979) by @SYChen123 in #12979
[router][grpc] Support vllm backend for grpc router (#13120) by @CatherineSue in #13120
[router] add minmax m2 reasoning parser (#13137) by @slin1237 in #13137
[router] Support complex assistant and tool messages in /chat/completions (#12860) by @hellodanylo in #12860
[router] move radix tree to policy crate and addreses some code styles (#13131) by @slin1237 in #13131
[Router] use call_id instead of id for matching function calls in Responses API for Harmony (#13056) by @zhaowenzi in #13056
Revert "fix: display served_model_name in /v1/models" (#13093) by @CatherineSue in #13093
fix: display served_model_name in /v1/models (#13063) by @Sunhaihua1 in #13063
[router] add postgres databases data connector (#12218) by @lengrongfu in #12218
[router][ci] Quick Improvement to make CI more stable (#12869) by @key4ng in #12869
[router][ci] Fix maturin build (#13012) by @key4ng in #13012
[router] bucket policy (#11719) by @syy-hw in #11719
[router] Switch MCP tests from DeepWiki to self-hosted Brave search server (#12849) by @key4ng in #12849
[router][grpc] Move all error logs to their call sites (#12859) by @CatherineSue in #12859
[router][grpc] Refactor: Add builders for chat and responses (#12852) by @CatherineSue in #12852
[router] Support structured model output for openai and grpc router (#12431) by @key4ng in #12431
[router][grpc] Add more mcp test cases to responses api (#12749) by @CatherineSue in #12749
fix ci (#12760) by @key4ng in #12760
Add timing metrics for requests (#12646) by @cicirori in #12646
[router][ci] Disable cache (#12752) by @key4ng in #12752
[router][grpc] Support mixin tool calls in Responses API (#12736) by @CatherineSue in #12736
Revert "[router] web_search_preview tool basic implementation" (#12716) by @key4ng in #12716
[router] add basic ci tests for gpt-oss model support (#12651) by @key4ng in #12651
[router][quick fix] Add minimal option for reasoning effort in spec (#12711) by @key4ng in #12711
[router][grpc] Make harmony parser checks recipient first before channel (#12713) by @CatherineSue in #12713
[router][ci] speed up python binding to 1.5 min (#12673) by @key4ng in #12673
[router] fix: validate HTTP status codes in health check (#12631) by @wyx-0203 in #12631
[router][grpc] Support streaming parsing with Tool Choice in chat completions API (#12677) by @CatherineSue in #12677
[router][grpc] Implement tool_choice support for Responses API (#12668) by @CatherineSue in #12668
[router][grpc] Emit OutputItemDone event and store output item array (#12656) by @CatherineSue in #12656
[router][grpc] Fix index issues in reasoning content and missing streaming events (#12650) by @CatherineSue in #12650
[router][grpc] Fix model validation, tool call check, streaming logic and misc in responses (#12616) by @CatherineSue in #12616
Support aggregating engine metrics in sgl-router (#11456) by @fzyzcjy in #11456
[router][grpc] Restructure modules and code clean up (#12598) by @CatherineSue in #12598
[router][grpc] Consolidate error messages build in error.rs (#12301) by @CatherineSue in #12301
[ci] install released version router (#12410) by @key4ng in #12410

New Contributors

@XinyueZhang369 made their first contribution in 2cdde3d46
@Sunhaihua1 made their first contribution in a06c44f90
@zhaowenzi made their first contribution in 7b877ab83
@cicirori made their first contribution in 58095cb00
@wyx-0203 made their first contribution in 3651cfbf6
@syy-hw made their first contribution in 611a4fd08
@SYChen123 made their first contribution in 4ef439054
@hellodanylo made their first contribution in d28caaf60

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.2...gateway-v0.2.3

Contributors

hellodanylo, fzyzcjy, and 11 other contributors

Assets 2

06 Nov 17:54

ispobock

v0.5.5

0c006b8

Release v0.5.5

Highlights

Day 0 support for Kimi-K2-Thinking https://huggingface.co/moonshotai/Kimi-K2-Thinking
Day 0 support for Minimax-M2 https://huggingface.co/MiniMaxAI/MiniMax-M2
Video and image generation support https://lmsys.org/blog/2025-11-07-sglang-diffusion/
Q4 Roadmap: #12780
Blackwell kernel optimizations and MoE runner backend refactor
Overlap spec and prefill cuda graph support more models

What's Changed

[8/n] decouple quantization impl from vllm dependency - gguf srt by @FlamingoPg in #11964
lang: support direct video inference by @mickqian in #9936
Enable Llama 4 + TRTLLM MHA by @b8zhong in #12003
Refactor Triton-kernel MoE runner integration by @Jonahcb in #11795
use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk by @b8zhong in #11816
Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 by @thelongestusernameofall in #11909
Revert "lang: support direct video inference" by @merrymercy in #12038
support more model in piecewise cuda graph by @narutolhy in #11745
[Fix] Fix lint to pass CI by @Fridge003 in #12037
Revert "[Fix] Fix lint to pass CI" by @Fridge003 in #12042
fix: fix MMMU loading issue by @ZailiWang in #11759
Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once by @xu-yfei in #10953
Add gguf dependency for cpu/xpu by @ZailiWang in #12041
fix: the hardcode hf repo name comparison for deepseek-ocr by @rainj-me in #12031
Install numactl in Dockerfile for GH200/GB200/GB300 by @fzyzcjy in #11853
[router] Add mTLS Support for Router-to-Worker Communication by @slin1237 in #12019
Tiny cleanup send_single by @fzyzcjy in #12056
Refactoring GLM-4.5 and GLM-4.5V related implementations by @zRzRzRzRzRzRzR in #11800
[Fix] fix missing ipc_name of __getitem__ in some IO structs by @whybeyoung in #12053
fix: bench_serving ITL calculation when using spec-decoding by @JustinTong0323 in #12064
Fix dpsk-r1-fp4 launching crash by @Qiaolin-Yu in #12063
Revise POINTSV15Chat model by @yuan-luo in #12049
Add 'gguf' to project dependencies by @Muqi1029 in #12046
[Profiler] expand '~' by @Muqi1029 in #11999
[b200] fix piecewise cuda graph launch bug by @BBuf in #12067
Fix multi processing serializer bug by @fzyzcjy in #11958
[Fix]: HiCache hasher failed when EAGLE mode enabled by @leavelet in #12025
adjust dynamic vs static outputs comparison in test_lora_update.py by @glenliu21 in #11884
[router] implement response api get input item function and refactor input/output store by @key4ng in #11924
fix(compile_utils, ep_moe): update environment variable and dtype check by @ishandhanani in #12034
[router] fix ut router config init to use build pattern by @slin1237 in #12084
docs(server-arguments): add allowed options for each argument by @Jonahcb in #11560
[router] migrate app context to builder pattern 1/n by @slin1237 in #12086
[router] migrate app context to builder pattern 2/n by @slin1237 in #12089
[router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files by @CatherineSue in #12091
[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by @zminglei in #12000
Fix: Update blog link by @LucaLow in #12071
perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph by @cicirori in #12093
[2/N]Support DeepSeek-R1 w4a8 low latency deepep by @ayrnb in #8464
Enhance tests in deterministic kernels by @fzyzcjy in #12070
[Doc] Add documentation for DeepSeek V3.2 by @Fridge003 in #11877
[10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE by @ch-wan in #12054
Support true on-policy by @fzyzcjy in #12058
[Docs] update sgl-kernel readme by @FlamingoPg in #11379
Fix 'KeyError' for per_token expert distribution recorder by @vipwangerxiao in #9501
Fix kernel version bump file by @Kangyan-Zhou in #12087
[Fix] Set global args in cpu test by @Fridge003 in #12105
chore: bump sgl-kernel version to 0.3.16.post4 by @sglang-bot in #12103
[Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) by @merrymercy in #12083
[router] Refactor data connector architecture with unified storage modules by @key4ng in #12096
fix: release workflow should work on both archs by @ishandhanani in #12110
[bugs] docker file name should be .Dockerfile so it can properly render by @slin1237 in #11869
Clean up server args & Add CI scripts by @merrymercy in #12124
[Misc] Improve the error message of failed import by @DarkSharpness in #12119
[CI] Add ci monitor balance workflow by @BBuf in #11962
Skip TestLlama4LoRA in CI by @lifuhuang in #12098
clean up github tokens by @merrymercy in #12126
Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation by @YAMY1234 in #12115
Fix token for CI monitor by @merrymercy in #12127
Reenable b200 tests by @Kangyan-Zhou in #11814
Update document index for DeepSeek-v32 docs by @Fridge003 in #12101
Update sgl-kernel version to 0.3.16.post4 by @Fridge003 in #12125
[Doc] Fix format for deepseek v3.2 document by @Fridge003 in #12130
Accelerate deepseek fp4 b200 ci by @Qiaolin-Yu in #11993
Clean up server launch code and multi tokenizer by @merrymercy in #12132
[Test] Add dsv3.2 nsa backend testing by @Johnsonms in #11936
[docs] upd docker files names everywhere by @vincentzed in #12133
Make bmm batch invariant injection optional by @fzyzcjy in #12118
[Doc] Small update of DeepSeek v3.2 document by @Fridge003 in #12138
docs: update README by @zhyncs in #12139
[router] MCP Manager - Support Connection Pooling, Tool Inventory and Proxy by @slin1237 in #12097
[NVIDIA] Change default quant method for model_opt by @kaixih in #11991
[router] update smg code owners for each component by @slin1237 in #12141
[router] cleaned up all the redundant comments in the config module by @CatherineSue in #12147
Clean up attention backend selection code & Other minor rename by @merrymercy in #12136
[log] Make forward iter count optional by @hnyls2002 in #12116
[misc] depdencies & enviroment flag by @hnyls2002 in #12113
[quantization] AWQ Marlin doesn't work when dtype is bfloat16 by @kevin85421 in #11494
[HiCache]Page head layout IO kernel by @huangtingwei9988 in #11615
Do not use MagicMock to mock server_args in tests by @hnyls2002 in #12154
[router][grpc] Fix tool call id in parse_json_schema_response by @catheri...

Contributors

carolove, yeahdongcn, and 127 other contributors

Assets 2

17 Nov 11:19

slin1237

gateway-v0.2.2

6237754

Release Gateway-v0.2.2

🚀 SGLang Model Gateway v0.2.2 Released!

✨ Features

🎯 Industry-First Responses API for All Models
We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for Llama, DeepSeek, Qwen, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models.

☸️ Production-Ready Kubernetes Operations
Taking large-scale deployments seriously! We now support native gRPC health check endpoints, making it effortless to deploy and operate SGLang at scale on Kubernetes with proper health monitoring and orchestration.

🔐 Your Network, Your Control

mTLS Support: Secure gateway-to-SGLang communication whether you're running on edge, remote cloud, multi-cloud, or hybrid environments – we've got you covered
MCP Proxy Enhancements: Configure proxies globally or per-individual MCP server – complete network control in your hands

🤖 Harmony Pipeline
Introducing our unified OpenAI-native architecture with GPT OSS model support for both Responses API and Chat Completion – fully integrated with MCP and intelligent storage management.

🌍 Universal Platform Support
A major leap in accessibility! SGLang Model Gateway now runs on nearly every operating system and architecture: Linux, Windows, Mac, x86, and ARM. Even better – we support all Python versions from 3.8 to 3.14 in a single wheel file, while reducing wheel size by more than 40%. Deploy anywhere, on any Python version, with unprecedented efficiency!

⚡ Additional Enhancements

Multi-worker URL support for better load distribution
Connection pooling and tool inventory for MCP
Native OpenAI web search tool support and function calling for OpenAI router

🐛 Stability Improvements

We've squashed numerous bugs including background task handling, tool call IDs, conversation management, and installation dependencies.

Try it now: pip install sglang-router==0.2.2

What's Changed in Gateway

Gateway Changes (48 commits)

[router] 0.2.2 release (#12399) by @slin1237 in #12399
[router] web_search_preview tool basic implementation (#12290) by @key4ng in #12290
[router] Function call support for openai router Responses API (#12386) by @key4ng in #12386
[router] Fix safety_identifier missing (#12404) by @key4ng in #12404
[router] use safety_identifier replace user on chat history storage (#12185) by @lengrongfu in #12185
[router] harmony responses api streaming support (#12395) by @slin1237 in #12395
[router] Harmony Pipeline: Chat Completion & Responses API with MCP Support (#12153) by @slin1237 in #12153
[bug] fix router installation to include additional dependency (#12348) by @slin1237 in #12348
[router] refactor mcp to use LRU and fix pooling bug (#12346) by @CatherineSue in #12346
[bug] fix router pypi license file (#12345) by @slin1237 in #12345
[router] fix router release workflow and add build test in PR (#12315) by @CatherineSue in #12315
[Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang (#12338) by @sufeng-buaa in #12338
[router][grpc] Fix inconsistent behavior of conversation_id not found (#12299) by @CatherineSue in #12299
[router] support arm, windows, mac, linux, reduce wheel size and number (#12285) by @slin1237 in #12285
[rust][ci] Add end-to-end tests for Oracle history backend (#12233) by @key4ng in #12233
[router] upgrade grpc dependency and py 3.13 3.14 support (#12284) by @slin1237 in #12284
[router] Fix type unmatch during validation (#12257) by @key4ng in #12257
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 (#10804) by @sufeng-buaa in #10804
[router] configure workflow retries and timeout based on routerConfig (#12252) by @slin1237 in #12252
[router] use mcp struct from sdk and clean up code across codebase (#12249) by @slin1237 in #12249
[router] remove code duplication (#12245) by @slin1237 in #12245
[sgl-route] Optimize the use of constant slices and retain to simplif… (#12159) by @lengrongfu in #12159
[router] Remove SharedXxxStorage type aliases to make Arc explicit (#12171) by @CatherineSue in #12171
[router][grpc] Add ResponsesContext and fix error propagation in responses api (#12164) by @CatherineSue in #12164
[misc][grpc] Remove duplicate log (#12168) by @CatherineSue in #12168
[router] centralize mcp tool args handling (#12155) by @slin1237 in #12155
[router][grpc] Fix tool call id in parse_json_schema_response (#12152) by @CatherineSue in #12152
[router] cleaned up all the redundant comments in the config module (#12147) by @CatherineSue in #12147
[router] MCP Manager Refactoring - Flat Architecture with Connection Pooling (#12097) by @slin1237 in #12097
[router] Refactor data connector architecture with unified storage modules (#12096) by @key4ng in #12096
[router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files (#12091) by @CatherineSue in #12091
[router] migrate app context to builder pattern 2/n (#12089) by @slin1237 in #12089
[router] migrate app context to builder pattern 1/n (#12086) by @slin1237 in #12086
[router] fix ut router config init to use build pattern (#12084) by @slin1237 in #12084
[router] implement response api get input item function and refactor input/output store (#11924) by @key4ng in #11924
[router] Add mTLS Support for Router-to-Worker Communication (#12019) by @slin1237 in #12019
[router] Add builder pattern for RouterConfig with zero duplication (#12030) by @slin1237 in #12030
[router][CI] Clean up imports and prints statements in sgl-router/py_test (#12024) by @CatherineSue in #12024
[router] change ci names and update log level in ci (#12021) by @slin1237 in #12021
[Router] Consolidate ConnectionMode enum to core module (#11937) by @YouNeedCryDear in #11937
[router] Add comprehensive E2E tests for Response API (#11988) by @key4ng in #11988
[grpc] Support gRPC standard health check (#11955) by @CatherineSue in #11955
[router] create worker removal step and clean up worker manager (#11921) by @slin1237 in #11921
[router] Support multiple worker URLs for OpenAI router (#11723) by @key4ng in #11723
[router][grpc] Fix background tasks stored with wrong id (#11945) by @CatherineSue in #11945
[router] Add gRPC E2E test suite (#11790) by @key4ng in #11790
[router][grpc] Support v1/responses API (#11926) by @CatherineSue in #11926
Fix openai input_text type compatibility (#11935) by @key4ng in #11935

New Contributors

@lengrongfu made their first contribution in 09af0a7b5
@sufeng-buaa made their first contribution in ea9610600

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.1...gateway-v0.2.2

Contributors

YouNeedCryDear, CatherineSue, and 4 other contributors

Assets 2

26 Oct 02:37

hnyls2002

v0.5.4

1053e1b

Release v0.5.4

Highlights

AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
[beta] Overlap scheduler for speculative decoding: #11762
[beta] Piecewise CUDA graph for prefill: #11490
Prefix cache for qwen3 next and GDN/mamba models: #11214
Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
Various Blackwell kernel optimizations
DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
Native ModelOpt quantization support

What's Changed

[router] add ipv6 support across all components by @slin1237 in #11219
Remove env var warnings for release by @merrymercy in #11262
Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
[router][tool call] Clean up redundant detect_format and has_tool_markers by @CatherineSue in #11270
disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
docker: add manifest to versioned docker releases by @ishandhanani in #11268
[Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
[router][grpc] Refine streaming processes by @CatherineSue in #11277
Fix code sync scripts by @merrymercy in #11276
[Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
docs: update sgl-kernel README by @zhyncs in #11286
chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
convert test_deterministic into unit tests by @skyzh in #11095
Feature/longbench v2 evaluation utils by @alhridoy in #10949
[ci] fix pp test by @hnyls2002 in #11294
EAGLE cache fix for SWARadixCache by @ispobock in #11231
Remove overlap thread by @hnyls2002 in #11210
[router] add reasoning and tool parser argument in router by @slin1237 in #11290
Remove sampling info events and overlap thread file by @hnyls2002 in #11300
Introduce future indices by @hnyls2002 in #11301
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
[Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
[router] add get server info and get model info in grpc server by @slin1237 in #11303
[router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
[Doc] HiCache Design Documents by @ykwd in #11027
[Doc]: Best Practice for HICache by @hzh0425 in #11001
[router] fix grpc connection conversion and add optimization by @slin1237 in #11305
[router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
Update tool parser and related documentation by @JustinTong0323 in #11223
[router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
[quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
[router] support Openai router conversation API CRUD by @key4ng in #11297
[router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
[router] cleanup worker health check to return early by @slin1237 in #11310
[oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults by @CatherineSue in #11304
Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
ci: unify the model launch method of nightly ci by @mickqian in #11230
[Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
update sampling_params documentation with defaults by @JustinTong0323 in #11315
Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
Rename ngram_utils -> ngram_info by @hnyls2002 in #11316
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
[8/N] MoE Refactor: deprecate EPMoE by @ch-wan in #11211
Skip weight loading in deepgemm compilation by @ch-wan in #11312
[2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
[Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
Support LoRA in bench_serving oai interface by @lifuhuang in #11318
benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
[CI] improve disaggregation CI. by @hnyls2002 in #11264
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
[router] refactor generate to use new pipeline arch by @slin1237 in #11323
[router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
[router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
[router] Fix all unused_qualifications by @CatherineSue in #11341
[router] Support history management using conversation by @key4ng in #11339
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
[Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
[Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
[router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
[router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
[router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
[router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in https://github.c...