Releases: sgl-project/sglang
v0.5.7
Highlights
- New Model Support:
- Day 0 Support for Mimo-V2-Flash: #15207, https://lmsys.org/blog/2025-12-16-mimo-v2-flash/
- Day 0 Support for Nemotron-Nano-v3: https://lmsys.org/blog/2025-12-15-run-nvidia-nemotron-3-nano/
- Day 0 Support for LLaDA 2.0: https://lmsys.org/blog/2025-12-19-diffusion-llm/
- [SGLang-Diffusion] Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered
- EAGLE 3 speculative decoding draft models for popular models: https://lmsys.org/blog/2025-12-23-spec-bundle-phase-1/
- Model Gateway v0.3.0 Release:
https://docs.sglang.io/advanced_features/sgl_model_gateway.html - Scalable pipeline parallelism with dynamic chunking support for ultra-long contexts (PP Refactor Roadmap #11857)
- Encoder Disaggregation for Multi-modal models (Roadmap #15118)
- SGLang-Diffusion:
- Set
--dit-layerwise-offload trueto reduce peak VRAM usage by up to 30GB, and improve performance by up to 58% for all models - Significantly reduce the latency of
Qwen-Image-Edit, making it one-of-the-fastest among all open-source solutions. More improvements are on the way - Add support for AMD/4090/5090, along with additional attention choices (sage-attn, sage-attn3), more parallelism options (TP) and enhancements to HTTP API (Google vertex supported)
- Cache-dit integration to improve performance by up to 165%
- Set
What's Changed
- Refactor custom allreduce logics by @iforgetmyname in #13710
- [Doc] Update DeepSeek-V3.2 document by @Fridge003 in #14321
- Feature/support distilled vae generic by @baonudesifeizhai in #14195
- [Performance] Optimize NSA Indexer K/S Buffer Access with Fused Triton Kernels by @Johnsonms in #13812
- Update CODEOWNERS for multimodal by @mickqian in #14329
- [bug fix] use npu phy id in container env by @jinke446 in #14266
- [model-gateway] multimodality initialization by @slin1237 in #13350
- [Doc] Fix DeepSeek V32 Doc by @Fridge003 in #14336
- sync attention, deepseek doc by @b8zhong in #14335
- [PD] Support decode pp for PD disaggregation by @ShangmingCai in #14265
- [model-gateway] add image processor and transformer structure by @slin1237 in #14344
- [CPU] Support chunk_gated_delta_rule kernel for Qwen3-Next by @Valentine233 in #12441
- [bugfix] Fix prefill tbo disabled when --deepep-mode=auto by @yuhyao in #14333
- [CI] update estimated elapsed time of some unittests by @ch-wan in #14347
- [NPU] bug fix: w_vc need contiguous for NPU batch_matmul_transpose ops by @ZhengdQin in #13980
- [bugfix] NpuFuseEPMoE miss initialization parameters by @chenxu140 in #14295
- [Ascend] fix AscendAttnMaskBuilder bug to support float16 models by @MichelleWu351 in #14271
- Tiny adjust CI testcases by @hnyls2002 in #14362
- [NPU][Doc] updated installation guide for Ascend NPU by @VDV1985 in #13585
- Feature/add vae path to cli doc#14004 by @baonudesifeizhai in #14355
- [CPU] add fused_qkvzba_split_reshape_cat kernel for Qwen3-next by @blzheng in #12330
- Single Batch Overlap for MoE Models by @Sulfur6 in #9660
- Move custom_ops under layers; move _custom_ops.py → custom_all_reduce_ops.py by @merrymercy in #14326
- [model-gateway] add llava model image processor and tests by @slin1237 in #14371
- ci: Migrate AMD workflows to new MI325 runners; temporarily disabled failed CI's to be added back by @sunxxuns in #14226
- [Tiny]Small fixes in deepseek v32 doc by @Fridge003 in #14372
- Fix validation to detect missing model files before loading by @alisonshao in #14253
- [model-gateway] add qwen2_vl model image processor and tests by @slin1237 in #14374
- [model-gateway] add qwen2.5_vl model image processor by @slin1237 in #14375
- Revert "Revert "enable csgmv automatically on cuda"" by @b8zhong in #14277
- [model-gateway] use worker crate in openai router by @slin1237 in #14330
- [model-gateway] add qwen3_vl model image processor by @slin1237 in #14377
- Fix sgl-router silently parse selector wrongly causing OME fail to discover pods by @fzyzcjy in #14359
- [sgl-kernel][Feat][B200][1/N]Support MXFP8 Grouped GEMM in Blackwell by @HydraQYH in #13731
- [CPU] document updates by @ZailiWang in #14272
- Support PP x PD decode with nixl backend by @bluecoffee8 in #14392
- [VLM] Introduce Cache for positional embedding ids for Qwen-VL family by @yuan-luo in #14292
- use faster covnersion from float8_e4m3fn to bfloat16 by @mingfeima in #12316
- [model-gateway][doc] Add STDIO Explicitly to Example in README by @xuwenyihust in #14393
- [CPU] add support for mamba causal conv1d for qwen3-next by @mingfeima in #12309
- [model-gateway] add phi3 vision image processor by @slin1237 in #14381
- [model-gateway] introduce provider in openai router by @slin1237 in #14394
- [AMD] fix the regression issue for DeepseekV3 on MI300 by @yctseng0211 in #14383
- [NPU][1/N] NPU basic functions refactor and new modelslim quant type by @iforgetmyname in #13359
- [CPU] Optimize small oc GEMM for Qwen3-next on CPU by @jianan-gu in #12446
- Try to fix B200 DeepEP error by @fzyzcjy in #14399
- [1/2] Add rope kernel in sgl-kernel by @Qiaolin-Yu in #14334
- [bug fix] fix ima with get_mla_kv_buffer_kernel overflow by @XucSh in #14224
- Add Mistral Large 3 support. by @dcampora in #14213
- [diffusion] fix gen video doc by @yeahdongcn in #14409
- Add 'NPU' to the runtime exception message in
get_deviceby @rauletorresc in #14225 - Add mooncake
transfer_engine_benchinto maunal test by @hnyls2002 in #14429 - [model-gateway] add phi4 vision image processor by @slin1237 in #14430
- diffusion: Add Configurable Generator Device and Seed Support via API by @niehen6174 in #14366
- [model-gateway] introduce request ctx for oai router by @slin1237 in #14434
- [NPU]add nightly-test-npu by @cherryblo in #14143
- [model-gateway] add llama4 vision image processor by @slin1237 in #14438
- [model-gateway] extract conversation out of oai router by @slin1237 in #14440
- [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 by @YAMY1234 in #14325
- [model-gateway] move oai header util to router header util by @slin1237 in #14441
- [FIX] trtllm-moe-fp4-renorm for Qwen series models by @samuellees in #14350
- add doc for quantized kv cache by @b8zhong in #14348
- fix: Correct environment variable syntax in docker-compose configuration by @yankay in #8287
- [model-gateway] move all responses api event from oai to proto by @slin1237 in #14446
- [model-gateway] add mistral 3 image processor by @slin1237 in #14445
- [model-gateway] grpc to leverage event type by @slin1237 in #14450
- ministral3 by @JustinTong0323 in #14251
- [Bug] fix not desired disable fused share experts caused by rocm logic by @ocss884 in #14432
- Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE by @sglang-bot in #14421
- further optimze model load by @zyksir in #13836
- Add CI permissions for user 'yushengsu-thu' by @alisonshao in #14468
- [ez] Fix ty...
Release Gateway-v0.3.0
🚀 SGLang Model Gateway v0.3.0 Released!
We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!
⚠️ Breaking Changes
📊 Metrics Architecture Redesigned
Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.
🔧 UUID-Based Worker Resource Management
Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.
✨ New Features
🌐 Unified Inference Gateway Mode (IGW)
Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:
- gRPC router (PD and regular mode)
- HTTP router (PD and regular mode)
- OpenAI router
Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.
🔤 Tokenize/Detokenize HTTP Endpoints
- Direct HTTP endpoints for tokenization operations
- Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
- TokenizerRegistry for efficient dynamic loading
🧠 Parser Endpoints
/parse/reasoning- Parse reasoning outputs/parse/function_call- Parse function call responses- GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models
📊 Embeddings Support
Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.
🔐 Server-Side TLS Support
Secure your gateway deployments with native TLS support.
🌐 Go Implementation, contributed by iFlytek MaaS team.
Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!
⚡ Major Enhancements
Control Plane - Workflow Engine
Intelligent lifecycle orchestration with:
- DAG-based parallel execution with pre-computed dependency graphs
- Concurrent event processing for maximum throughput
- Modular add/remove/update workflows
Performance Optimization
- Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
- Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
- Optimized router management: Improved selection algorithms and state management
Resilience & Reliability:
- Retry and circuit breaker support for OpenAI and gRPC routers
- Enhanced circuit breaker with better state management
- Graceful shutdown for TLS and non-TLS servers
- Unified error responses with error codes and X-SMG-Error-Code headers
Infrastructure:
- Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
- Custom Prometheus duration buckets
- Improved logging across all modules
🐛 Bug Fixes & Stability
- Fixed cache-aware routing in gRPC mode
- Resolved load metric tracking and double-decrease issues for cache aware load balancing
- Improved backward compatibility for GET endpoints
- Fixed gRPC scheduler launcher issues
- Fixed token bucket negative duration panics
- Resolved MCP server initialization issues
📚 Documentation
Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.
⚠️ Migration checklist:
- Update Prometheus dashboards for new metrics
- Update worker API integrations for UUID-based management
- Review new error response format
⚡ Built for speed. Engineered for scale. Production-proven.
Gateway Changes (108 commits)
- [model-gateway] release smg 0.3.0 (#15781) by @slin1237 in #15781
- [model-gateway] Fix logging module name, parse endpoint context, and tokenizer factory (#15782) by @slin1237 in #15782
- [model-gateway] Implement Zero-Copy Vision Tensor Access (#15750) by @ppraneth in #15750
- [model-gateway] Fix IGW routing and optimize RouterManager (#15741) by @slin1237 in #15741
- Fix smg_http_requests_total semantics (#15655) by @fzyzcjy in #15655
- [model-gateway]Enable IGW mode with gRPC router and auto enable IGW when service discovery is turned on (#15459) by @YouNeedCryDear in #15459
- [docs] major SGL Model Gateway documentation update (#15715) by @slin1237 in #15715
- [model-gateway] add back router worker health metric and fix init state (#15622) by @fzyzcjy in #15622
- [mode;-gateway] add back fixes of incorrect metrics after worker removal (#15624) by @fzyzcjy in #15624
- [model-gateway] Add tokenize/detokenize HTTP endpoints and tokenizer management (#15702) by @slin1237 in #15702
- [model-gateway] Fix tokenizer caching and improve error handling (#15695) by @slin1237 in #15695
- [model-gateway]: add gRPC router embeddings endpoint implementation (#15273) by @Ratish1 in #15273
- [model-gateway] Optimize router selection with lock-free snapshots (#15672) by @ppraneth in #15672
- [model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router (#12968) by @YouNeedCryDear in #12968
- Improve engine customization interface (#15635) by @merrymercy in #15635
- Tiny add back missing router per attempt response metric (#15621) by @fzyzcjy in #15621
- Fix router gRPC mode launch error caused by async loading (#15368) by @fzyzcjy in #15368
- [model-gateway] return 503 when all workers are circuit-broken (#15611) by @slin1237 in #15611
- [model-gateway] add retry support to OpenAI router chat endpoint (#15589) by @slin1237 in #15589
- Optimize Rust CI builds with proper sccache configuration (#15581) by @slin1237 in #15581
- [model-gateway] add retry and circuit breaker support to gRPC routers (#15585) by @slin1237 in #15585
- [model-gateway] refactor WorkerManager with fan_out helper and thin handlers (#15583) by @slin1237 in #15583
- [model-gateway] add WorkerService abstraction for worker business logic (#15580) by @slin1237 in #15580
- [model-gateway] minor code clean up (#15578) by @slin1237 in #15578
- [model-gateway] Use UUIDs for router-managed worker resources (#15540) by @alphabetc1 in #15540
- [model-gateway] /parse/easoning and parse/function_call for sgl-model-gateway (#15568) by @UbeCc in #15568
- [model-gateway]: Tool parser for glm47 (#15520) by @UbeCc in #15520
- [model-gateway] bugfix: backward compatibility for GET endpoints (#15413) by @alphabetc1 in #15413
- [model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching (#15515) by @ppraneth in #15515
- [model-gateway] add model gateway multi-arch docker build, test and document docker image (#15544) by @slin1237 in #15544
- [model-gateway] Implement RAII load guard with response body attachment (#15507) by @slin1237 in #15507
- [router] bugfix: cache_aware in grpc inbalance forward (#15473) by @llfl in #15473
- [model-gateway] simplify workflow engine backoff and reduce duplicate reads (#15505) by @slin1237 in #15505
- [model-gateway] Run workflow event subscribers concurrently (#15504) by @slin1237 in #15504
- [model-gateway] Optimize workflow engine with pre-computed dependency graph (#15503) by @slin1237 in #15503
- [model-gateway] Improve logging across core modules (#15497) by @slin1237 in #15497
- [model-gateway] Improve logging in policies module (#15496) by @slin1237 in #15496
- [model-gateway] Improve logging in data_connector module (#15495) by @slin1237 in #15495
- [model-gateway] refactor: extract common graceful shutdown code before TLS branch (#15494) by @slin1237 in #15494
- [model-gateway] fix graceful shutdown for TLS/Non-TLS server (#15491) by @slin1237 in #15491
- [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (#15361) by @slin1237 in #15361
- [model-gateway] optimize worker registry and reduce lock contention in grpc client fetch (#15336) by @slin1237 in #15336
- [model-gateway] reduce cpu overhead (#15316) by @slin1237 in #15316
- Super tin...
Release Gateway-v0.2.4
🚀 SGLang Model Gateway v0.2.4 Released!
We're excited to announce SGLang Model Gateway v0.2.4 – a massive release focused on performance, security, and production-ready observability!
✨ Headline Features
⚡ Major Performance Optimizations
We've invested heavily in performance across the entire stack:
- Optimized radix tree for cache-aware load balancing – Smarter routing decisions with lower overhead
- Tokenizer optimization – Dramatically reduced CPU and memory footprint during tokenization
- Core module optimization – HTTP and gRPC routers now run leaner and faster
- Efficient OTEL implementation – Production-grade observability with minimal performance impact
🔌 Industry-First WASM Middleware Support
Programmable middleware using WebAssembly! Extend your gateway with safe, isolated plugins. Build custom routing logic, transform requests/responses, or integrate proprietary systems – all without touching core code. Your gateway, your rules.
📊 Production-Grade Observability
Full OpenTelemetry integration with distributed tracing for both HTTP and gRPC. Track requests across your entire inference stack with native trace context propagation. Finally, real visibility into your LLM infrastructure.
⚡ Built for speed. Hardened for security. Ready for production.
Gateway Changes (98 commits)
- [model-gateway] release gateway 0.2.4 (#14763) by @slin1237 in #14763
- [Perf] Optimize radix tree for cache-aware load balancin (#14758) by @slin1237 in #14758
- [SMG] perf: optimize tokenizer for reduced CPU and memory overhead (#14752) by @slin1237 in #14752
- [model-gateway] optimize core modules (#14751) by @slin1237 in #14751
- Tiny extract select_worker_min_load (#14648) by @fzyzcjy in #14648
- [ci][smg] fix docker release ci and add it to pr test (#14683) by @slin1237 in #14683
- Tiny support sgl-router http response status code metrics (#14689) by @fzyzcjy in #14689
- [SMG]feat: implement TokenGuardBody for managing token return (#14653) by @jimmy-evo in #14653
- [model-gateway] add OTEL integration to grpc router (#14671) by @slin1237 in #14671
- Fix cache-aware router should pick min load instead of min tenant size (#14650) by @fzyzcjy in #14650
- [model-gateway] Optimize memory usage in HTTP router (#14667) by @slin1237 in #14667
- [model-gateway] fix WASM arbitrary file read security vol (#14664) by @slin1237 in #14664
- [model-gateway] reduce cpu overhead in grpc router (#14663) by @slin1237 in #14663
- [model-gateway] reducing cpu overhead in various of places (#14658) by @slin1237 in #14658
- Fix dp-aware incompatible with service-discovery (#14629) by @fzyzcjy in #14629
- Super tiny fix unused code in router (#14618) by @fzyzcjy in #14618
- [model-gateway] fix WASM unbounded request/response body read vuln (#14612) by @slin1237 in #14612
- Super tiny remove unused select_worker_pair (#14609) by @fzyzcjy in #14609
- [model-gateway] refactor otel to be more efficient (#14604) by @slin1237 in #14604
- Tiny fix missing policy decision recording (#14605) by @fzyzcjy in #14605
- [model-gateway] fix WASM memory limit per module (#14600) by @slin1237 in #14600
- [model-gateway] reorganize metrics, logging, and otel to its own module (#14590) by @slin1237 in #14590
- [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (#14588) by @slin1237 in #14588
- [model-gateway] extra accumulator and tool handler in oai router (#14587) by @slin1237 in #14587
- [Bug fix] Add /model_info endpoint to mini_lb (#14535) by @alisonshao in #14535
- [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (#13897) by @sufeng-buaa in #13897
- [model-gateway] fix left over sgl-router names in wasm (#14514) by @slin1237 in #14514
- [model-gateway] fix logs in smg workflow (#14513) by @slin1237 in #14513
- [model-gateway] fix left over sgl-router names to sgl-model-gateway (#14512) by @slin1237 in #14512
- [model-gateway] change sgl-router to sgl-model-gateway (#14312) by @slin1237 in #14312
- [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (#14405) by @xuwenyihust in #14405
- Fix removing worker will make it healthy forever in prometheus metrics (#14420) by @fzyzcjy in #14420
- [model-gateway] fix server info comment (#14508) by @slin1237 in #14508
- [model-gateway] reorganized conversation handler (#14507) by @slin1237 in #14507
- [model-gateway] Add WASM support for middleware (#12471) by @tonyluj in #12471
- [model-gateway] move conversation to first class routing (#14506) by @slin1237 in #14506
- [misc] add model arch and type to server info and use it for harmony (#14456) by @slin1237 in #14456
- [model-gateway] grpc to leverage event type (#14450) by @slin1237 in #14450
- [model-gateway] add mistral 3 image processor (#14445) by @slin1237 in #14445
- [model-gateway] move all responses api event from oai to proto (#14446) by @slin1237 in #14446
- [model-gateway] move oai header util to router header util (#14441) by @slin1237 in #14441
- [model-gateway] extract conversation out of oai router (#14440) by @slin1237 in #14440
- [model-gateway] add llama4 vision image processor (#14438) by @slin1237 in #14438
- [model-gateway] introduce request ctx for oai router (#14434) by @slin1237 in #14434
- [model-gateway] add phi4 vision image processor (#14430) by @slin1237 in #14430
- Add Mistral Large 3 support. (#14213) by @dcampora in #14213
- [model-gateway] introduce provider in openai router (#14394) by @slin1237 in #14394
- [model-gateway] add phi3 vision image processor (#14381) by @slin1237 in #14381
- [model-gateway][doc] Add STDIO Explicitly to Example in README (#14393) by @xuwenyihust in #14393
- Fix sgl-router silently parse selector wrongly causing OME fail to discover pods (#14359) by @fzyzcjy in #14359
- [model-gateway] add qwen3_vl model image processor (#14377) by @slin1237 in #14377
- [model-gateway] use worker crate in openai router (#14330) by @slin1237 in #14330
- [model-gateway] add qwen2.5_vl model image processor (#14375) by @slin1237 in #14375
- [model-gateway] add qwen2_vl model image processor and tests (#14374) by @slin1237 in #14374
- [model-gateway] add llava model image processor and tests (#14371) by @slin1237 in #14371
- [model-gateway] add image processor and transformer structure (#14344) by @slin1237 in #14344
- [model-gateway] multimodality initialization (#13350) by @slin1237 in #13350
- [model-gateway] add workflow for external model providers (#14323) by @slin1237 in #14323
- [model-gateway] change rust package name to sgl-model-gateway instead (#14283) by @slin1237 in #14283
- [model-gateway] fix version output (#14276) by @slin1237 in #14276
- [model-gateway] include smg version command in py binding (#14274) by @slin1237 in #14274
- [model-gateway] add audio and moderation in model card (#14263) by @slin1237 in #14263
- [model-gateway] Add e2e tests of streaming events and tool choice for response api (#13880) by @XinyueZhang369 in #13880
- [model-gateway] Migrate Worker trait to model-aware methods (#14250) by @slin1237 in #14250
- [model-gateway] add ModelCard support to WorkerMetadata (#14243) by @slin1237 in #14243
- [...
Release v0.5.6
Highlights
- Support for DeepSeek V3.2/V3.2 Speciale #14249
- Blockwise diffusion language model support #12588
- Support for new diffusion models (Flux2 #14000, Z-image #14067)
- Introduce JIT Kernels #13453
- Upgrade to Torch 2.9 #12969
- Kimi-K2-Thinking model enhancement #12882
- Memory management/Overlap spec compatibility #12224 #12839
- More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
- CI/CD Enhancement
What's Changed
- [router][grpc] Add more mcp test cases to responses api by @CatherineSue in #12749
- [Intel]Add 'intel_xpu' attention backend for llama4 by @gaopengff in #11051
- [Intel XPU]Update pytorch xpu to 2.9 by @gaopengff in #12363
- [Docs] fix dead links in multiple documentation pages by @mattheliu in #12764
- [mem pool] bugfix: wrong position for self.device in Mamba by @stmatengss in #12684
- [Fix]HTTP Stream raise exception by @jimmy-evo in #11904
- [CPU] Fix TP padding case with weight block size by @jianan-gu in #8243
- [docs] Remove redundant --disable-radix-cache option from by @rchalamala in #12717
- Pin uvloop to 0.21.0 by @yeahdongcn in #12279
- [fix] Only enable flashinfer all reduce fusion by default for single-node servers by @leejnau in #12724
- chore: update CODEOWNERS by @zhyncs in #12795
- Fix hang in deepgemm compilation with symmetric memory enabled by @nvcastet in #12715
- Add bot-bump-kernel-version-to-sglang workflow by @alisonshao in #12794
- ignore the deepgemm check when the model weight with nvfp4 and moe ba… by @rainj-me in #12782
- [AMD] Update wave-lang to 3.8.2 by @xintin in #12576
- [DeepSeek-V3.2][NSA] Enable MHA Pathway for Short Sequence Prefill on B200 (SM100) by @YAMY1234 in #12788
- [hotfix]: Resolve ModuleNotFoundError in PD deployment for is_in_ci() by @hzh0425 in #12772
- [HotFix]: Add missing SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL env var by @hzh0425 in #12776
- Add PP support for dots_vlm by @gty111 in #12763
- fixes hardcoded "cuda" device references in unit tests to use a dynamic device selection by @kalyank007 in #12761
- fix multimodal gen issues by @yhyang201 in #12765
- [Test] Add DeepSeekV3.2 NSA Indexer Test Suite by @Johnsonms in #12520
- [Bugfix] Fix illegal memory access by @elvischenv in #12758
- [MoE] Add Comprehensive MoE Integration Tests by @Jonahcb in #12090
- [Deepseek V3.2] Only skip Indexer logits computation when is_extend_without_speculative by @hlu1 in #12816
- Fix missing dp_max_padding argument in set_dp_buffer_len by @Chen-0210 in #12812
- optm(checkpoint-engine): disable multi-thread loading when update weights by @BraveY in #12374
- Fix piecewise cuda graph ci test by @ispobock in #12836
- update multimodal_gen readme by @mickqian in #12825
- [router] Support structured model output for openai and grpc router by @key4ng in #12431
- Fix data parallel controller launch for num nodes > 2 by @merrymercy in #12822
- remove the fa4 page_size hardcode to 128 restriction on mla model arch by @rainj-me in #12801
- sglang diffusion announcement by @wisclmy0611 in #12856
- add back flashinfer jit cache to dev docker by @b8zhong in #12851
- [router][grpc] Refactor: Add builders for chat and responses by @CatherineSue in #12852
- [router][grpc] Move all error logs to their call sites by @CatherineSue in #12859
- [router] Switch MCP tests from DeepWiki to self-hosted Brave search server by @key4ng in #12849
- Add nightly performance test for GPT-OSS 4GPU models by @alisonshao in #12805
- [sgl-kernel][Deepseek V3.2] Add row_starts to topk kernel by @hlu1 in #12582
- [CI] Fix huggingface access for test_flash_attention_4.py by @Fridge003 in #12846
- [Auto Sync] Update activation.py, logits_processor.py, rota... (20251107) by @merrymercy in #12853
- [Docs][DeepseekV3.2] Update deepseekv3.2 docs for mha short seq prefill by @YAMY1234 in #12868
- Support capturing aux_hidden_states for minimax m2. by @pyc96 in #12798
- [CI] Tiny adjust CI esitmation time by @hnyls2002 in #12886
- [DP-Attn] Clarify MLP sync / idle batch preparation logic by @hnyls2002 in #12843
- Fix sending all requests to the first rank in DP attention by @fzyzcjy in #12832
- Apply moe_reduce_sum kernel for fused_marlin_moe by @ispobock in #12888
- use fast stream instead of torch.cuda.current_stream in llama 4 shared experts overlap by @b8zhong in #12811
- [Fix] Fix trtllm-mla backend when chunked prefix cache is disabled by @Fridge003 in #12361
- Refs/heads/add nightly test multi gpu configs by @alisonshao in #12870
- chore: bump sgl-kernel version to 0.3.16.post6 by @sglang-bot in #12889
- Update CODEOWNERS by @ispobock in #12897
- Tiny simplify
can_run_dp_cuda_graphgather logic by @hnyls2002 in #12891 - Fix spec decoding acc length for dpsk-r1-fp4 tp8 by @Qiaolin-Yu in #12896
- Revert "Fix spec decoding acc length for dpsk-r1-fp4 tp8" by @Qiaolin-Yu in #12900
- Add Deepseek models into nightly tests by @Kangyan-Zhou in #12865
- Fix empty server args in marlin moe test by @ispobock in #12904
- Fix duplicate nightly test name by @Kangyan-Zhou in #12905
- Add HF cleanup logic in ci_install_dependency.sh by @Kangyan-Zhou in #12895
- fallback to triton mm_persistent kernel when deepGemm fail by @zminglei in #12911
- Add kimi k2 thinking to ci by @ispobock in #12907
- Fix Deepseek nightly tests by @Kangyan-Zhou in #12906
- Add Jet-Nemotron by @futrime in #12448
- [CI] increase ut buckets & adjust estimation time. by @hnyls2002 in #12919
- [PD] feat: refactor custom mem pool and add barex pd support by @stmatengss in #12332
- [CI] Fix
matrix.partin pr-test. by @hnyls2002 in #12920 - Adjust server launch time in ci by @ispobock in #12917
- feat: basic support for server-level multimodal cache by @mickqian in #10775
- Refactor / Unify event loop across PD-Disagg, Overlap, DP-Attn cases by @hnyls2002 in #12839
- [lint] tiny fix unimported packages. by @hnyls2002 in #12927
- ci: try to fix gpg error during kernel build by @ishandhanani in #12928
- Support piecewise cuda graph for MLA by @ispobock in #11812
- diffusion: skip full CI suite for multimodal_gen changes by @mickqian in #12940
- Minor code cleanup / improvement for
PREBUILT_EXTENDmode by @hnyls2002 in #12948 - Bugfix: LMCache Connector with Sglang by @MMuzzammil1 in #12946
- [Docs] Add docs for Qwen3-VL image and video support by @adarshxs in #12554
- [Refactor] rename set_index_k_and_scale_buffer to set_index_k_scale_b… by @edwingao28 in #12956
- Refactor KTransformers heterogeneous compute with unified GPU-quantization backend by @Atream in #12834
- diffusion: fix detected file changes rule in CI by @mickqian in #12943
- c...
Release Gateway-v0.2.3
🚀 SGLang Model Gateway - New Release!
We're excited to announce another powerful update to SGLang Model Gateway with performance improvements and expanded database support!
✨ Headline Features
⚡ Bucket Mode Routing - 20-30% Performance Boost
Introducing our new bucket-based routing algorithm that dramatically improves performance in PD mode. See up to 20-30% improvements in TTFT (Time To First Token) and overall throughput
💾 PostgreSQL Support for Chat History Management
Flexibility in data storage! We now support PostgreSQL alongside OracleDB and in-memory storage for chat history management.
🛠️ Enhanced Model Tool & Structured Output Support
- MinMax M2 model support!
- Structured model output for OpenAI and gRPC router
- Streaming parsing with Tool Choice in chat completions API
- Tool_choice support for Responses API
- OutputItemDone events with output item array storage for better observability
🐛 Stability & Quality Improvements
Multiple bug fixes for model validation, streaming logic, reasoning content indexing, and CI stability enhancements.
🔧 Code Quality Enhancements
Refactored builders for chat and responses, restructured modules for better maintainability, and consolidated error handling.
Try the latest version: pip install sglang-router --upgrade
What's Changed in Gateway
Gateway Changes (45 commits)
- [model-gateway] smg release 0.2.3 (#13312) by @slin1237 in #13312
- [router]Replace requests lib with openai in e2e_response_api (#13293) by @XinyueZhang369 in #13293
- fix outdated router doc (#13255) by @fzyzcjy in #13255
- [router][grpc] Refine docs in minimax_m2 to match other parsers (#13218) by @CatherineSue in #13218
- fix: display served_model_name in /v1/models (#13155) by @Sunhaihua1 in #13155
- [router] minmax-m2 xml tool parser (#13148) by @slin1237 in #13148
- [router] remove worker url requirement (#13172) by @slin1237 in #13172
- [router] Fix Flaky test_circuit_breaker_opens_and_recovers (#13164) by @XinyueZhang369 in #13164
- [router] Add comprehensive validation to Responses API (#13127) by @key4ng in #13127
- bugfix: multi-model routing for /generate api (#12979) by @SYChen123 in #12979
- [router][grpc] Support vllm backend for grpc router (#13120) by @CatherineSue in #13120
- [router] add minmax m2 reasoning parser (#13137) by @slin1237 in #13137
- [router] Support complex assistant and tool messages in /chat/completions (#12860) by @hellodanylo in #12860
- [router] move radix tree to policy crate and addreses some code styles (#13131) by @slin1237 in #13131
- [Router] use call_id instead of id for matching function calls in Responses API for Harmony (#13056) by @zhaowenzi in #13056
- Revert "fix: display served_model_name in /v1/models" (#13093) by @CatherineSue in #13093
- fix: display served_model_name in /v1/models (#13063) by @Sunhaihua1 in #13063
- [router] add postgres databases data connector (#12218) by @lengrongfu in #12218
- [router][ci] Quick Improvement to make CI more stable (#12869) by @key4ng in #12869
- [router][ci] Fix maturin build (#13012) by @key4ng in #13012
- [router] bucket policy (#11719) by @syy-hw in #11719
- [router] Switch MCP tests from DeepWiki to self-hosted Brave search server (#12849) by @key4ng in #12849
- [router][grpc] Move all error logs to their call sites (#12859) by @CatherineSue in #12859
- [router][grpc] Refactor: Add builders for chat and responses (#12852) by @CatherineSue in #12852
- [router] Support structured model output for openai and grpc router (#12431) by @key4ng in #12431
- [router][grpc] Add more mcp test cases to responses api (#12749) by @CatherineSue in #12749
- fix ci (#12760) by @key4ng in #12760
- Add timing metrics for requests (#12646) by @cicirori in #12646
- [router][ci] Disable cache (#12752) by @key4ng in #12752
- [router][grpc] Support mixin tool calls in Responses API (#12736) by @CatherineSue in #12736
- Revert "[router] web_search_preview tool basic implementation" (#12716) by @key4ng in #12716
- [router] add basic ci tests for gpt-oss model support (#12651) by @key4ng in #12651
- [router][quick fix] Add minimal option for reasoning effort in spec (#12711) by @key4ng in #12711
- [router][grpc] Make harmony parser checks recipient first before channel (#12713) by @CatherineSue in #12713
- [router][ci] speed up python binding to 1.5 min (#12673) by @key4ng in #12673
- [router] fix: validate HTTP status codes in health check (#12631) by @wyx-0203 in #12631
- [router][grpc] Support streaming parsing with Tool Choice in chat completions API (#12677) by @CatherineSue in #12677
- [router][grpc] Implement tool_choice support for Responses API (#12668) by @CatherineSue in #12668
- [router][grpc] Emit OutputItemDone event and store output item array (#12656) by @CatherineSue in #12656
- [router][grpc] Fix index issues in reasoning content and missing streaming events (#12650) by @CatherineSue in #12650
- [router][grpc] Fix model validation, tool call check, streaming logic and misc in responses (#12616) by @CatherineSue in #12616
- Support aggregating engine metrics in sgl-router (#11456) by @fzyzcjy in #11456
- [router][grpc] Restructure modules and code clean up (#12598) by @CatherineSue in #12598
- [router][grpc] Consolidate error messages build in error.rs (#12301) by @CatherineSue in #12301
- [ci] install released version router (#12410) by @key4ng in #12410
New Contributors
- @XinyueZhang369 made their first contribution in 2cdde3d46
- @Sunhaihua1 made their first contribution in a06c44f90
- @zhaowenzi made their first contribution in 7b877ab83
- @cicirori made their first contribution in 58095cb00
- @wyx-0203 made their first contribution in 3651cfbf6
- @syy-hw made their first contribution in 611a4fd08
- @SYChen123 made their first contribution in 4ef439054
- @hellodanylo made their first contribution in d28caaf60
Paths Included
sgl-routerpython/sglang/srt/grpcpython/sglang/srt/entrypoints/grpc_server.py
Full Changelog: gateway-v0.2.2...gateway-v0.2.3
Release v0.5.5
Highlights
- Day 0 support for Kimi-K2-Thinking https://huggingface.co/moonshotai/Kimi-K2-Thinking
- Day 0 support for Minimax-M2 https://huggingface.co/MiniMaxAI/MiniMax-M2
- Video and image generation support https://lmsys.org/blog/2025-11-07-sglang-diffusion/
- Q4 Roadmap: #12780
- Blackwell kernel optimizations and MoE runner backend refactor
- Overlap spec and prefill cuda graph support more models
What's Changed
- [8/n] decouple quantization impl from vllm dependency - gguf srt by @FlamingoPg in #11964
- lang: support direct video inference by @mickqian in #9936
- Enable Llama 4 + TRTLLM MHA by @b8zhong in #12003
- Refactor Triton-kernel MoE runner integration by @Jonahcb in #11795
- use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk by @b8zhong in #11816
- Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 by @thelongestusernameofall in #11909
- Revert "lang: support direct video inference" by @merrymercy in #12038
- support more model in piecewise cuda graph by @narutolhy in #11745
- [Fix] Fix lint to pass CI by @Fridge003 in #12037
- Revert "[Fix] Fix lint to pass CI" by @Fridge003 in #12042
- fix: fix MMMU loading issue by @ZailiWang in #11759
- Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once by @xu-yfei in #10953
- Add gguf dependency for cpu/xpu by @ZailiWang in #12041
- fix: the hardcode hf repo name comparison for deepseek-ocr by @rainj-me in #12031
- Install numactl in Dockerfile for GH200/GB200/GB300 by @fzyzcjy in #11853
- [router] Add mTLS Support for Router-to-Worker Communication by @slin1237 in #12019
- Tiny cleanup send_single by @fzyzcjy in #12056
- Refactoring GLM-4.5 and GLM-4.5V related implementations by @zRzRzRzRzRzRzR in #11800
- [Fix] fix missing
ipc_nameof__getitem__in some IO structs by @whybeyoung in #12053 - fix: bench_serving ITL calculation when using spec-decoding by @JustinTong0323 in #12064
- Fix dpsk-r1-fp4 launching crash by @Qiaolin-Yu in #12063
- Revise POINTSV15Chat model by @yuan-luo in #12049
- Add 'gguf' to project dependencies by @Muqi1029 in #12046
- [Profiler] expand '~' by @Muqi1029 in #11999
- [b200] fix piecewise cuda graph launch bug by @BBuf in #12067
- Fix multi processing serializer bug by @fzyzcjy in #11958
- [Fix]: HiCache hasher failed when EAGLE mode enabled by @leavelet in #12025
- adjust dynamic vs static outputs comparison in test_lora_update.py by @glenliu21 in #11884
- [router] implement response api get input item function and refactor input/output store by @key4ng in #11924
- fix(compile_utils, ep_moe): update environment variable and dtype check by @ishandhanani in #12034
- [router] fix ut router config init to use build pattern by @slin1237 in #12084
- docs(server-arguments): add allowed options for each argument by @Jonahcb in #11560
- [router] migrate app context to builder pattern 1/n by @slin1237 in #12086
- [router] migrate app context to builder pattern 2/n by @slin1237 in #12089
- [router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files by @CatherineSue in #12091
- [1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU by @zminglei in #12000
- Fix: Update blog link by @LucaLow in #12071
- perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph by @cicirori in #12093
- [2/N]Support DeepSeek-R1 w4a8 low latency deepep by @ayrnb in #8464
- Enhance tests in deterministic kernels by @fzyzcjy in #12070
- [Doc] Add documentation for DeepSeek V3.2 by @Fridge003 in #11877
- [10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE by @ch-wan in #12054
- Support true on-policy by @fzyzcjy in #12058
- [Docs] update sgl-kernel readme by @FlamingoPg in #11379
- Fix 'KeyError' for per_token expert distribution recorder by @vipwangerxiao in #9501
- Fix kernel version bump file by @Kangyan-Zhou in #12087
- [Fix] Set global args in cpu test by @Fridge003 in #12105
- chore: bump sgl-kernel version to 0.3.16.post4 by @sglang-bot in #12103
- [Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) by @merrymercy in #12083
- [router] Refactor data connector architecture with unified storage modules by @key4ng in #12096
- fix: release workflow should work on both archs by @ishandhanani in #12110
- [bugs] docker file name should be .Dockerfile so it can properly render by @slin1237 in #11869
- Clean up server args & Add CI scripts by @merrymercy in #12124
- [Misc] Improve the error message of failed import by @DarkSharpness in #12119
- [CI] Add ci monitor balance workflow by @BBuf in #11962
- Skip TestLlama4LoRA in CI by @lifuhuang in #12098
- clean up github tokens by @merrymercy in #12126
- Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation by @YAMY1234 in #12115
- Fix token for CI monitor by @merrymercy in #12127
- Reenable b200 tests by @Kangyan-Zhou in #11814
- Update document index for DeepSeek-v32 docs by @Fridge003 in #12101
- Update sgl-kernel version to 0.3.16.post4 by @Fridge003 in #12125
- [Doc] Fix format for deepseek v3.2 document by @Fridge003 in #12130
- Accelerate deepseek fp4 b200 ci by @Qiaolin-Yu in #11993
- Clean up server launch code and multi tokenizer by @merrymercy in #12132
- [Test] Add dsv3.2 nsa backend testing by @Johnsonms in #11936
- [docs] upd docker files names everywhere by @vincentzed in #12133
- Make bmm batch invariant injection optional by @fzyzcjy in #12118
- [Doc] Small update of DeepSeek v3.2 document by @Fridge003 in #12138
- docs: update README by @zhyncs in #12139
- [router] MCP Manager - Support Connection Pooling, Tool Inventory and Proxy by @slin1237 in #12097
- [NVIDIA] Change default quant method for model_opt by @kaixih in #11991
- [router] update smg code owners for each component by @slin1237 in #12141
- [router] cleaned up all the redundant comments in the config module by @CatherineSue in #12147
- Clean up attention backend selection code & Other minor rename by @merrymercy in #12136
- [log] Make forward iter count optional by @hnyls2002 in #12116
- [misc] depdencies & enviroment flag by @hnyls2002 in #12113
- [quantization] AWQ Marlin doesn't work when dtype is bfloat16 by @kevin85421 in #11494
- [HiCache]Page head layout IO kernel by @huangtingwei9988 in #11615
- Do not use
MagicMockto mockserver_argsin tests by @hnyls2002 in #12154 - [router][grpc] Fix tool call id in
parse_json_schema_responseby @catheri...
Release Gateway-v0.2.2
🚀 SGLang Model Gateway v0.2.2 Released!
✨ Features
🎯 Industry-First Responses API for All Models
We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for Llama, DeepSeek, Qwen, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models.
☸️ Production-Ready Kubernetes Operations
Taking large-scale deployments seriously! We now support native gRPC health check endpoints, making it effortless to deploy and operate SGLang at scale on Kubernetes with proper health monitoring and orchestration.
🔐 Your Network, Your Control
- mTLS Support: Secure gateway-to-SGLang communication whether you're running on edge, remote cloud, multi-cloud, or hybrid environments – we've got you covered
- MCP Proxy Enhancements: Configure proxies globally or per-individual MCP server – complete network control in your hands
🤖 Harmony Pipeline
Introducing our unified OpenAI-native architecture with GPT OSS model support for both Responses API and Chat Completion – fully integrated with MCP and intelligent storage management.
🌍 Universal Platform Support
A major leap in accessibility! SGLang Model Gateway now runs on nearly every operating system and architecture: Linux, Windows, Mac, x86, and ARM. Even better – we support all Python versions from 3.8 to 3.14 in a single wheel file, while reducing wheel size by more than 40%. Deploy anywhere, on any Python version, with unprecedented efficiency!
⚡ Additional Enhancements
- Multi-worker URL support for better load distribution
- Connection pooling and tool inventory for MCP
- Native OpenAI web search tool support and function calling for OpenAI router
🐛 Stability Improvements
We've squashed numerous bugs including background task handling, tool call IDs, conversation management, and installation dependencies.
Try it now: pip install sglang-router==0.2.2
What's Changed in Gateway
Gateway Changes (48 commits)
- [router] 0.2.2 release (#12399) by @slin1237 in #12399
- [router] web_search_preview tool basic implementation (#12290) by @key4ng in #12290
- [router] Function call support for openai router Responses API (#12386) by @key4ng in #12386
- [router] Fix safety_identifier missing (#12404) by @key4ng in #12404
- [router] use safety_identifier replace user on chat history storage (#12185) by @lengrongfu in #12185
- [router] harmony responses api streaming support (#12395) by @slin1237 in #12395
- [router] Harmony Pipeline: Chat Completion & Responses API with MCP Support (#12153) by @slin1237 in #12153
- [bug] fix router installation to include additional dependency (#12348) by @slin1237 in #12348
- [router] refactor mcp to use LRU and fix pooling bug (#12346) by @CatherineSue in #12346
- [bug] fix router pypi license file (#12345) by @slin1237 in #12345
- [router] fix router release workflow and add build test in PR (#12315) by @CatherineSue in #12315
- [Bug fix] trace: fix import error in mini_lb if sgl-router image does not install sglang (#12338) by @sufeng-buaa in #12338
- [router][grpc] Fix inconsistent behavior of conversation_id not found (#12299) by @CatherineSue in #12299
- [router] support arm, windows, mac, linux, reduce wheel size and number (#12285) by @slin1237 in #12285
- [rust][ci] Add end-to-end tests for Oracle history backend (#12233) by @key4ng in #12233
- [router] upgrade grpc dependency and py 3.13 3.14 support (#12284) by @slin1237 in #12284
- [router] Fix type unmatch during validation (#12257) by @key4ng in #12257
- [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 (#10804) by @sufeng-buaa in #10804
- [router] configure workflow retries and timeout based on routerConfig (#12252) by @slin1237 in #12252
- [router] use mcp struct from sdk and clean up code across codebase (#12249) by @slin1237 in #12249
- [router] remove code duplication (#12245) by @slin1237 in #12245
- [sgl-route] Optimize the use of constant slices and retain to simplif… (#12159) by @lengrongfu in #12159
- [router] Remove SharedXxxStorage type aliases to make Arc explicit (#12171) by @CatherineSue in #12171
- [router][grpc] Add
ResponsesContextand fix error propagation in responses api (#12164) by @CatherineSue in #12164 - [misc][grpc] Remove duplicate log (#12168) by @CatherineSue in #12168
- [router] centralize mcp tool args handling (#12155) by @slin1237 in #12155
- [router][grpc] Fix tool call id in
parse_json_schema_response(#12152) by @CatherineSue in #12152 - [router] cleaned up all the redundant comments in the config module (#12147) by @CatherineSue in #12147
- [router] MCP Manager Refactoring - Flat Architecture with Connection Pooling (#12097) by @slin1237 in #12097
- [router] Refactor data connector architecture with unified storage modules (#12096) by @key4ng in #12096
- [router][grpc] Remove gpt_oss parsers and remove _parser suffix in tool parser files (#12091) by @CatherineSue in #12091
- [router] migrate app context to builder pattern 2/n (#12089) by @slin1237 in #12089
- [router] migrate app context to builder pattern 1/n (#12086) by @slin1237 in #12086
- [router] fix ut router config init to use build pattern (#12084) by @slin1237 in #12084
- [router] implement response api get input item function and refactor input/output store (#11924) by @key4ng in #11924
- [router] Add mTLS Support for Router-to-Worker Communication (#12019) by @slin1237 in #12019
- [router] Add builder pattern for RouterConfig with zero duplication (#12030) by @slin1237 in #12030
- [router][CI] Clean up imports and prints statements in sgl-router/py_test (#12024) by @CatherineSue in #12024
- [router] change ci names and update log level in ci (#12021) by @slin1237 in #12021
- [Router] Consolidate ConnectionMode enum to core module (#11937) by @YouNeedCryDear in #11937
- [router] Add comprehensive E2E tests for Response API (#11988) by @key4ng in #11988
- [grpc] Support gRPC standard health check (#11955) by @CatherineSue in #11955
- [router] create worker removal step and clean up worker manager (#11921) by @slin1237 in #11921
- [router] Support multiple worker URLs for OpenAI router (#11723) by @key4ng in #11723
- [router][grpc] Fix background tasks stored with wrong id (#11945) by @CatherineSue in #11945
- [router] Add gRPC E2E test suite (#11790) by @key4ng in #11790
- [router][grpc] Support
v1/responsesAPI (#11926) by @CatherineSue in #11926 - Fix openai input_text type compatibility (#11935) by @key4ng in #11935
New Contributors
- @lengrongfu made their first contribution in 09af0a7b5
- @sufeng-buaa made their first contribution in ea9610600
Paths Included
sgl-routerpython/sglang/srt/grpcpython/sglang/srt/entrypoints/grpc_server.py
Full Changelog: gateway-v0.2.1...gateway-v0.2.2
Release v0.5.4
Highlights
- AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
- Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
- [beta] Overlap scheduler for speculative decoding: #11762
- [beta] Piecewise CUDA graph for prefill: #11490
- Prefix cache for qwen3 next and GDN/mamba models: #11214
- Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
- Various Blackwell kernel optimizations
- DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
- KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
- New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
- Native ModelOpt quantization support
What's Changed
- [router] add ipv6 support across all components by @slin1237 in #11219
- Remove env var warnings for release by @merrymercy in #11262
- Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
- [router][tool call] Clean up redundant
detect_formatandhas_tool_markersby @CatherineSue in #11270 - disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
- docker: add manifest to versioned docker releases by @ishandhanani in #11268
- [Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
- [router][grpc] Refine streaming processes by @CatherineSue in #11277
- Fix code sync scripts by @merrymercy in #11276
- [Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
- Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
- Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
- Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
- fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
- docs: update sgl-kernel README by @zhyncs in #11286
- chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
- [router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
- convert test_deterministic into unit tests by @skyzh in #11095
- Feature/longbench v2 evaluation utils by @alhridoy in #10949
- [ci] fix pp test by @hnyls2002 in #11294
- EAGLE cache fix for SWARadixCache by @ispobock in #11231
- Remove overlap thread by @hnyls2002 in #11210
- [router] add reasoning and tool parser argument in router by @slin1237 in #11290
- Remove sampling info events and overlap thread file by @hnyls2002 in #11300
- Introduce future indices by @hnyls2002 in #11301
- [sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
- [Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
- [router] add get server info and get model info in grpc server by @slin1237 in #11303
- [router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
- [Doc] HiCache Design Documents by @ykwd in #11027
- [Doc]: Best Practice for HICache by @hzh0425 in #11001
- [router] fix grpc connection conversion and add optimization by @slin1237 in #11305
- [router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
- Update tool parser and related documentation by @JustinTong0323 in #11223
- [router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
- [quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
- [router] support Openai router conversation API CRUD by @key4ng in #11297
- [router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
- [router] cleanup worker health check to return early by @slin1237 in #11310
- [oai serving chat] Add argument
--sampling-defaultsand fixChatCompletionRequestdefaults by @CatherineSue in #11304 - Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
- ci: unify the model launch method of nightly ci by @mickqian in #11230
- [Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
- update sampling_params documentation with defaults by @JustinTong0323 in #11315
- Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
- Rename
ngram_utils->ngram_infoby @hnyls2002 in #11316 - [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
- [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
- [8/N] MoE Refactor: deprecate
EPMoEby @ch-wan in #11211 - Skip weight loading in deepgemm compilation by @ch-wan in #11312
- [2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
- [Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
- fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
- Support LoRA in bench_serving oai interface by @lifuhuang in #11318
- benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
- [CI] improve disaggregation CI. by @hnyls2002 in #11264
- model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
- [router] refactor generate to use new pipeline arch by @slin1237 in #11323
- [router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
- [router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
- [router] Fix all unused_qualifications by @CatherineSue in #11341
- [router] Support history management using conversation by @key4ng in #11339
- [router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
- fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
- [Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
- [Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
- [router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
- [router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
- [router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
- [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
- [router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in https://github.c...
Release Gateway-v0.2.1
🚀 SGLang Model Gateway v0.2.1 Released!
This release focuses on stability, cleanup, and two big new performance features.
🧾 Docs & CI
- Updated router documentation to reflect recent feature additions
🧹 Code Cleanup
- Refactored StopSequenceDecoder for cleaner incremental decoding
- Added spec.rs test harness under spec/ for structured unit tests
🐞 Bug Fixes
- Fixed UTF-8 boundary in stop-sequence decoding
- Fixed gRPC timeout configuration
- Fixed worker filtering, tool-choice normalization, and bootstrap-port handling
- Additional gRPC server warm-up and concurrency fixes
🌟 New Features
- Two-Level Tokenizer Caching (L0 + L1)
- L0: exact-match cache for repeated prompts
- L1: prefix-aware cache at special-token boundaries
- OpenAI-Style Classification API → new /v1/classifications endpoint, shout out to yanbo for the contribution
- Worker Management Workflow Engine → improved async registration, worker self discovery, and health orchestration
What's Changed in Gateway
Gateway Changes (26 commits)
- [router] release router 0.2.1 (#11885) by @slin1237 in #11885
- [router][grpc] Fix wram-up random token ids for small models (#11887) by @CatherineSue in #11887
- [router] clean up workflow logs to debug for implementation details logs (#11886) by @slin1237 in #11886
- fix(sql-router): fix conflict port in test (#11826) by @htiennv in #11826
- [router][grpc] Remove
continue_final_messageinChatTemplateParamsand addminijinja-contrib(#11882) by @CatherineSue in #11882 - [router] remove encoding header for oai router (#11881) by @slin1237 in #11881
- [router] Worker Management Workflow Engine (#11868) by @slin1237 in #11868
- [2/2] [feature] support openai like classification api in router (#11670) by @whybeyoung in #11670
- [router] Add Configurable L0 and L1 Tokenizer Caching (#11688) by @slin1237 in #11688
- [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client (#11798) by @CatherineSue in #11798
- [Lint] Add
python/sglangto ruff F401 checks and remove unused imports in files (#11685) by @CatherineSue in #11685 - [router][grpc] Remove timeout for connections and remove
max_tokensdeprecation warning log (#11775) by @CatherineSue in #11775 - [doc] update router document (#11767) by @key4ng in #11767
- [router] fix grpc client time out to 1h (#11768) by @slin1237 in #11768
- [router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder (#11766) by @slin1237 in #11766
- Revert "[router] fix get_models endpoint for openai router (#11687)" (#11740) by @key4ng in #11687
- [router] Add rustfmt and set group imports by default (#11732) by @CatherineSue in #11732
- [router] add spec.rs to enables tests under spec folder (#11734) by @key4ng in #11734
- [router] Fix tool_choice normalization in ChatCompletionRequest and fix ut (#11731) by @CatherineSue in #11731
- [router][grpc] add dissag info to warm up in grpc server (#11727) by @slin1237 in #11727
- [router] fix p and d worker filtering and bootstrap port handling (#11729) by @slin1237 in #11729
- [Router] Refactor protocol definitions: split spec.rs into modular files (#11677) by @key4ng in #11677
- [router] fix get_models endpoint for openai router (#11687) by @key4ng in #11687
- [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676) by @slin1237 in #11676
- [router][grpc] Simplify model_id determination (#11684) by @CatherineSue in #11684
- [router] Fix response api related spec (#11621) by @key4ng in #11621
Paths Included
sgl-routerpython/sglang/srt/grpcpython/sglang/srt/entrypoints/grpc_server.py
Full Changelog: gateway-v0.2.0...gateway-v0.2.1
Release Gateway-v0.2.0
🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”)
🔥 What’s new
🧠 Multi-Model Inference Gateway (IGW) Mode
IGW turns one router into many — letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface.
You can dynamically register models via /workers, assign labels like tier or policy, and let the gateway handle routing, health checks, and load balancing.
Whether you’re mixing Llama, Mistral, and DeepSeek, or orchestrating per-tenant routing in enterprise setups, IGW gives you total control.
Your fleet, your rules. ⚡
⚡ gRPC Mode: Rust-Powered, Built for Throughput
This is the heart of 0.2.0. The new gRPC data plane runs entirely in Rust — tokenizer, reasoning parser, and tool parser included — giving you native-speed performance, and lower latency.
You can connect to gRPC-based SGLang workers, stream tokens in real time, and even handle OpenAI-compatible APIs like
🌐 OpenAI-Compatible Gateway
Seamlessly proxy requests to OpenAI, while keeping data control local.
Conversation history, responses, and background jobs all flow through the gateway — same API, enterprise privacy.
💾 Pluggable History Storage
Choose between memory, none, or oracle for conversation and /v1/responses data.
memory: Fastest for ephemeral runs.none: Zero persistence, zero latency overhead.oracle: Full persistence via Oracle ATP with connection pooling and credentials support.🧩 Pluggable MCP Integration
The gateway now natively speaks MCP across all transports (STDIO, HTTP, SSE, Streamable), so your tools can plug directly into reasoning and response loops — perfect for agentic workflows and cross-model orchestration.
🛡️ Reliability & Observability Upgrades
Built-in:
Retries with exponential backoff + jitterPer-worker circuit breakersToken-bucket rate limiting & FIFO queuingPrometheus metrics for latency, load, queue depth, PD pipelines, tokenizer speed, and MCP activityStructured tracing & request-ID propagation
✨ SGLang Model Gateway v0.2.0 — built in Rust, designed for scale, ready for reasoning.
What's Changed in Gateway
Gateway Changes (238 commits)
- [router] upgrade to 0.2.0 (#11642) by @slin1237 in #11642
- [router] add worker self discovery for metadata (#11638) by @slin1237 in #11638
- [router][grpc] add warm up to grpc server (#11627) by @slin1237 in #11627
- [router] update router readme to latest features (#11619) by @slin1237 in #11619
- [router] add chang and keyang to sgl router author (#11620) by @slin1237 in #11620
- [router] cleanup app context and move to startup (#11617) by @slin1237 in #11617
- [router] add py binding and readme for openai router and history backend (#11453) by @key4ng in #11453
- [router] when given both local tokenizer and chat template, log all (#11601) by @slin1237 in #11601
- [router] allow router launch server to use grpc mode (#11600) by @slin1237 in #11600
- [router] delete useless table content comment in spec (#11597) by @slin1237 in #11597
- [router] change worker api to async instead of sync (#11566) by @slin1237 in #11566
- [router] update generate spec to align with sgl io struct (#11591) by @slin1237 in #11591
- [router][protocols] Add Axum validate extractor and use it for
/v1/chat/completionsendpoint (#11588) by @CatherineSue in #11588 - [router][grpc] Add
serve_grpctolaunch_serverand log id for HealthCheck (#11564) by @CatherineSue in #11564 - [router][grpc] Add error handling to
generate_tool_constraints(#11562) by @CatherineSue in #11562 - [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483) by @Jonahcb in #11483
- [router] allow user to specify chat template path (#11549) by @slin1237 in #11549
- [router][grpc] Further delegate non-stream processing to
processing.rs(#11553) by @CatherineSue in #11553 - [router][Fix] Include grpc reflection runtime dependency (#11419) by @ai-jz in #11419
- [router] allow tokenizer path to be dir (#11530) by @slin1237 in #11530
- [router] openai router: support grok model (#11511) by @key4ng in #11511
- Fix the GPT function calling regex to allow dash in the name (#10577) by @antoine-roux in #10577
- [Router]: Small Typo in a comment within tree.rs (#11489) by @xuwenyihust in #11489
- Super tiny delete unused openai router in sgl-router (#11448) by @fzyzcjy in #11448
- [router][grpc] Consolidate parser checks for chat completions (#11439) by @CatherineSue in #11439
- [router] leverage RAII to actively cancel request during client disconnect (#11399) by @slin1237 in #11399
- [router] disable rate limiter by default (#11435) by @slin1237 in #11435
- [router] Fix ci nvcc not found error (#11411) by @key4ng in #11411
- move more files under srt/utils (#11285) by @merrymercy in #11285
- [router] conversation item API: create, retrieve and delete (#11369) by @key4ng in #11369
- [router] change grpc client from mutable to clone (#11394) by @slin1237 in #11394
- [router][grpc] Replace fake health check with correct ones (#11387) by @CatherineSue in #11387
- [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (#11373) by @CatherineSue in #11373
- [router][lint] Add unused_qualifications to cargo lint warnings (#11366) by @CatherineSue in #11366
- [router] Refactor OpenAI router: split monolithic file and move location (#11359) by @key4ng in #11359
- [router][grpc] disable health check generation and increase timeout (#11353) by @slin1237 in #11353
- [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (#11342) by @CatherineSue in #11342
- [router] Support history management using conversation (#11339) by @key4ng in #11339
- [router] Fix all unused_qualifications (#11341) by @CatherineSue in #11341
- [router][grpc] Cleanup debug logs in grpc_server and grpc_router (#11340) by @CatherineSue in #11340
- [router] improve reasoning parser lock and reduce req cloning (#11336) by @slin1237 in #11336
- [router] refactor generate to use new pipeline arch (#11323) by @slin1237 in #11323
- [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (#11314) by @CatherineSue in #11314
- [router] cleanup worker health check to return early (#11310) by @slin1237 in #11310
- [router] support Openai router conversation API CRUD (#11297) by @key4ng in #11297
- [router][grpc] Fix error message format in grpc chat handler (#11307) by @CatherineSue in #11307
- [router][grpc] Fix sampling_params.stop_strs is None (#11306) by @CatherineSue in #11306
- [router] fix grpc connection conversion and add optimization (#11305) by @slin1237 in #11305
- [router][grpc] Refactor chat template content format detection (#11288) by @CatherineSue in #11288
- [router] add get server info and get model info in grpc server (#11303) by @slin1237 in #11303
- [router] add reasoning and tool parser argument in router (#11290) by @slin1237 in #11290
- [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (#11283) by @CatherineSue in #11283
- [router][grpc] Refine streaming processes (#11277) by @CatherineSue in #11277
- [router][tool call] Clean up redundant
detect_formatandhas_tool_markers(#11270) by @CatherineSue in #11270 - [router] add ipv6 support across all components (#11219) by @slin1237 in #11219
- [router] add grpc router pd mode for chat and generate (#11140) by @slin1237 in #11140
- [router] fix get load response parsin...