Skip to content

Releases: sgl-project/sglang

v0.5.8

23 Jan 22:09

Choose a tag to compare

Highlights

New Model Support

DeepSeek V3.2 Optimization

  • Context Parallelism Optimization with support for fused MoE, multi-batch, and FP8 KV cache: #13959

Flash Attention 4

  • Support for Flash Attention 4 decoding kernels: #16034

SGLang-Diffusion

  • Run sglang-diffusion with diffusers backend
  • Features: Multi-LoRA inference, SLA attention backends, warmup switch in CLI, ComfyUI Plugin
  • Performance improvements for all models

Dependencies

  • sgl-kernel updated to 0.3.21: #17075
  • Cutedsl updated to 4.3.4: #17075
  • Added dependencies for tvm-ffi and quack-kernels: #17075
  • Flashinfer updated to 0.6.1: #15551
  • Mooncake transfer engine updated to 0.3.8.post1: #16792

Security

  • Fixed urllib and gpgv vulnerabilities: #17439

What's Changed

Read more

Release Gateway-v0.3.1

09 Jan 06:18
7460240

Choose a tag to compare

🚀 SMG v0.3.1 Released!

We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security!

🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡

Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains:

Performance Improvements

  • Our cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900), with latency dropping from 52.9 microseconds to just 4.6 microseconds per operation.
  • For prefix matching across 10,000 tree entries, throughput jumped from 41,000 to 124,000 operations per second.
  • Under concurrent load with 64 threads, the system processes 474,000 operations per second – a 7.9x improvement over the previous 59,000 ops/sec.

Data processing

  • INSERT operations now process 440 MB/s (up from 38 MB/s),
  • MATCH operations handle 253 MB/s (up from 83 MB/s).

Memory Improvements:

  • ~99% memory reduction per tree node:
  • Before: ~180 KB per node (DashMap default config on 170-core machines)
  • After: ~1.4 KB per node
    Result: Deploy 100x more cache entries in the same memory footprint!
    For a typical deployment with 10,000 cached prefixes, memory usage drops from ~1.8 GB to just ~14 MB – freeing up resources for actual inference workloads.
    Impact: Cache-aware routing is now 10-12x faster and uses 99% less memory. This is critical for large-scale multi-tenant deployments.

🔐 JWT/OIDC Authentication

Production-grade security for control plane APIs with native support for industry-standard OIDC providers: Google, Azure, Oracle, GitHub, and more. Protect tokenizer management, worker registration, and admin endpoints with enterprise authentication infrastructure you already use. Critical for enterprise deployments – seamlessly integrate SMG into your existing identity and access management systems.

📊 Classification API Support

Native support for classification workloads! Deploy and serve classification models alongside your existing inference fleet with dedicated pipeline stages and protocol types.

✨ Additional Features

  • PrefixHash Load Balancing: New KV cache-aware load balancing policy using prefix hashing for improved cache hit rates in multi-tenant environments.
  • Nemotron Nano V3 Parser
  • In-Flight Request Age Metrics: Track request age in-flight for better observability and SLA monitoring.

🛠️ Enhancements

Developer Experience:

  • Organized CLI arguments into logical groups
  • Shortened logging targets (sgl_model_gateway → smg)
  • Comprehensive embedding correctness tests against HuggingFace
  • Auto-generate protobuf files during wheel build

Reliability:

  • Fix IGW routing for external OpenAI workers
  • Work around orphan process problems
  • Prevent potential hangs in subprocess handling
  • Use 504 Gateway Timeout for upstream timeouts (proper HTTP semantics)

🐛 Bug Fixes

  • Fixed embedding worker health check crash
  • Fixed tokenizer to match transformers special token handling
  • Fixed age bucket rendering issue
  • Fixed non-PD router HTTP header whitelist
  • Fixed duplicate classify prefix in response ID
  • Fixed WASM test errors on machines with many cores

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (120 commits)

Read more

v0.5.7

01 Jan 10:01
232982a

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.3.0

24 Dec 22:00
5454d2a

Choose a tag to compare

🚀 SGLang Model Gateway v0.3.0 Released!

We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!

⚠️ Breaking Changes

📊 Metrics Architecture Redesigned

Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.

🔧 UUID-Based Worker Resource Management

Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.

✨ New Features

🌐 Unified Inference Gateway Mode (IGW)

Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:

  • gRPC router (PD and regular mode)
  • HTTP router (PD and regular mode)
  • OpenAI router
    Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.

🔤 Tokenize/Detokenize HTTP Endpoints

  • Direct HTTP endpoints for tokenization operations
  • Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
  • TokenizerRegistry for efficient dynamic loading

🧠 Parser Endpoints

  • /parse/reasoning - Parse reasoning outputs
  • /parse/function_call - Parse function call responses
  • GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models

📊 Embeddings Support

Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.

🔐 Server-Side TLS Support

Secure your gateway deployments with native TLS support.

🌐 Go Implementation, contributed by iFlytek MaaS team.

Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!

⚡ Major Enhancements

Control Plane - Workflow Engine

Intelligent lifecycle orchestration with:

  • DAG-based parallel execution with pre-computed dependency graphs
  • Concurrent event processing for maximum throughput
  • Modular add/remove/update workflows

Performance Optimization

  • Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
  • Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
  • Optimized router management: Improved selection algorithms and state management

Resilience & Reliability:

  • Retry and circuit breaker support for OpenAI and gRPC routers
  • Enhanced circuit breaker with better state management
  • Graceful shutdown for TLS and non-TLS servers
  • Unified error responses with error codes and X-SMG-Error-Code headers

Infrastructure:

  • Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
  • Custom Prometheus duration buckets
  • Improved logging across all modules

🐛 Bug Fixes & Stability

  • Fixed cache-aware routing in gRPC mode
  • Resolved load metric tracking and double-decrease issues for cache aware load balancing
  • Improved backward compatibility for GET endpoints
  • Fixed gRPC scheduler launcher issues
  • Fixed token bucket negative duration panics
  • Resolved MCP server initialization issues

📚 Documentation

Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.

⚠️ Migration checklist:

  • Update Prometheus dashboards for new metrics
  • Update worker API integrations for UUID-based management
  • Review new error response format

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (108 commits)

Read more

Release Gateway-v0.2.4

10 Dec 01:09
390406c

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.4 Released!

We're excited to announce SGLang Model Gateway v0.2.4 – a massive release focused on performance, security, and production-ready observability!

✨ Headline Features

⚡ Major Performance Optimizations

We've invested heavily in performance across the entire stack:

  • Optimized radix tree for cache-aware load balancing – Smarter routing decisions with lower overhead
  • Tokenizer optimization – Dramatically reduced CPU and memory footprint during tokenization
  • Core module optimization – HTTP and gRPC routers now run leaner and faster
  • Efficient OTEL implementation – Production-grade observability with minimal performance impact

🔌 Industry-First WASM Middleware Support

Programmable middleware using WebAssembly! Extend your gateway with safe, isolated plugins. Build custom routing logic, transform requests/responses, or integrate proprietary systems – all without touching core code. Your gateway, your rules.

📊 Production-Grade Observability

Full OpenTelemetry integration with distributed tracing for both HTTP and gRPC. Track requests across your entire inference stack with native trace context propagation. Finally, real visibility into your LLM infrastructure.

⚡ Built for speed. Hardened for security. Ready for production.

Gateway Changes (98 commits)

Read more

Release v0.5.6

03 Dec 05:11
7ae368e

Choose a tag to compare

Highlights

  • Support for DeepSeek V3.2/V3.2 Speciale #14249
  • Blockwise diffusion language model support #12588
  • Support for new diffusion models (Flux2 #14000, Z-image #14067)
  • Introduce JIT Kernels #13453
  • Upgrade to Torch 2.9 #12969
  • Kimi-K2-Thinking model enhancement #12882
  • Memory management/Overlap spec compatibility #12224 #12839
  • More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
  • CI/CD Enhancement

What's Changed

Read more

Release Gateway-v0.2.3

17 Nov 11:23
172c71a

Choose a tag to compare

🚀 SGLang Model Gateway - New Release!

We're excited to announce another powerful update to SGLang Model Gateway with performance improvements and expanded database support!

Headline Features

⚡ Bucket Mode Routing - 20-30% Performance Boost
Introducing our new bucket-based routing algorithm that dramatically improves performance in PD mode. See up to 20-30% improvements in TTFT (Time To First Token) and overall throughput

💾 PostgreSQL Support for Chat History Management
Flexibility in data storage! We now support PostgreSQL alongside OracleDB and in-memory storage for chat history management.

🛠️ Enhanced Model Tool & Structured Output Support

  • MinMax M2 model support!
  • Structured model output for OpenAI and gRPC router
  • Streaming parsing with Tool Choice in chat completions API
  • Tool_choice support for Responses API
  • OutputItemDone events with output item array storage for better observability

🐛 Stability & Quality Improvements

Multiple bug fixes for model validation, streaming logic, reasoning content indexing, and CI stability enhancements.

🔧 Code Quality Enhancements

Refactored builders for chat and responses, restructured modules for better maintainability, and consolidated error handling.

Try the latest version: pip install sglang-router --upgrade

What's Changed in Gateway

Gateway Changes (45 commits)

New Contributors

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.2...gateway-v0.2.3

Release v0.5.5

06 Nov 17:54
0c006b8

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.2.2

17 Nov 11:19
6237754

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.2 Released!

Features

🎯 Industry-First Responses API for All Models
We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for Llama, DeepSeek, Qwen, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models.

☸️ Production-Ready Kubernetes Operations
Taking large-scale deployments seriously! We now support native gRPC health check endpoints, making it effortless to deploy and operate SGLang at scale on Kubernetes with proper health monitoring and orchestration.

🔐 Your Network, Your Control

  • mTLS Support: Secure gateway-to-SGLang communication whether you're running on edge, remote cloud, multi-cloud, or hybrid environments – we've got you covered
  • MCP Proxy Enhancements: Configure proxies globally or per-individual MCP server – complete network control in your hands

🤖 Harmony Pipeline
Introducing our unified OpenAI-native architecture with GPT OSS model support for both Responses API and Chat Completion – fully integrated with MCP and intelligent storage management.

🌍 Universal Platform Support
A major leap in accessibility! SGLang Model Gateway now runs on nearly every operating system and architecture: Linux, Windows, Mac, x86, and ARM. Even better – we support all Python versions from 3.8 to 3.14 in a single wheel file, while reducing wheel size by more than 40%. Deploy anywhere, on any Python version, with unprecedented efficiency!

⚡ Additional Enhancements

  • Multi-worker URL support for better load distribution
  • Connection pooling and tool inventory for MCP
  • Native OpenAI web search tool support and function calling for OpenAI router

🐛 Stability Improvements

We've squashed numerous bugs including background task handling, tool call IDs, conversation management, and installation dependencies.

Try it now: pip install sglang-router==0.2.2


What's Changed in Gateway

Gateway Changes (48 commits)

New Contributors

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.1...gateway-v0.2.2

Release v0.5.4

26 Oct 02:37
1053e1b

Choose a tag to compare

Highlights

What's Changed

Read more