Skip to content

Releases: sgl-project/sglang

v0.5.7

01 Jan 10:01
232982a

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.3.0

24 Dec 22:00
5454d2a

Choose a tag to compare

🚀 SGLang Model Gateway v0.3.0 Released!

We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!

⚠️ Breaking Changes

📊 Metrics Architecture Redesigned

Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.

🔧 UUID-Based Worker Resource Management

Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.

✨ New Features

🌐 Unified Inference Gateway Mode (IGW)

Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:

  • gRPC router (PD and regular mode)
  • HTTP router (PD and regular mode)
  • OpenAI router
    Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.

🔤 Tokenize/Detokenize HTTP Endpoints

  • Direct HTTP endpoints for tokenization operations
  • Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
  • TokenizerRegistry for efficient dynamic loading

🧠 Parser Endpoints

  • /parse/reasoning - Parse reasoning outputs
  • /parse/function_call - Parse function call responses
  • GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models

📊 Embeddings Support

Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.

🔐 Server-Side TLS Support

Secure your gateway deployments with native TLS support.

🌐 Go Implementation, contributed by iFlytek MaaS team.

Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!

⚡ Major Enhancements

Control Plane - Workflow Engine

Intelligent lifecycle orchestration with:

  • DAG-based parallel execution with pre-computed dependency graphs
  • Concurrent event processing for maximum throughput
  • Modular add/remove/update workflows

Performance Optimization

  • Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
  • Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
  • Optimized router management: Improved selection algorithms and state management

Resilience & Reliability:

  • Retry and circuit breaker support for OpenAI and gRPC routers
  • Enhanced circuit breaker with better state management
  • Graceful shutdown for TLS and non-TLS servers
  • Unified error responses with error codes and X-SMG-Error-Code headers

Infrastructure:

  • Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
  • Custom Prometheus duration buckets
  • Improved logging across all modules

🐛 Bug Fixes & Stability

  • Fixed cache-aware routing in gRPC mode
  • Resolved load metric tracking and double-decrease issues for cache aware load balancing
  • Improved backward compatibility for GET endpoints
  • Fixed gRPC scheduler launcher issues
  • Fixed token bucket negative duration panics
  • Resolved MCP server initialization issues

📚 Documentation

Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.

⚠️ Migration checklist:

  • Update Prometheus dashboards for new metrics
  • Update worker API integrations for UUID-based management
  • Review new error response format

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (108 commits)

Read more

Release Gateway-v0.2.4

10 Dec 01:09
390406c

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.4 Released!

We're excited to announce SGLang Model Gateway v0.2.4 – a massive release focused on performance, security, and production-ready observability!

✨ Headline Features

⚡ Major Performance Optimizations

We've invested heavily in performance across the entire stack:

  • Optimized radix tree for cache-aware load balancing – Smarter routing decisions with lower overhead
  • Tokenizer optimization – Dramatically reduced CPU and memory footprint during tokenization
  • Core module optimization – HTTP and gRPC routers now run leaner and faster
  • Efficient OTEL implementation – Production-grade observability with minimal performance impact

🔌 Industry-First WASM Middleware Support

Programmable middleware using WebAssembly! Extend your gateway with safe, isolated plugins. Build custom routing logic, transform requests/responses, or integrate proprietary systems – all without touching core code. Your gateway, your rules.

📊 Production-Grade Observability

Full OpenTelemetry integration with distributed tracing for both HTTP and gRPC. Track requests across your entire inference stack with native trace context propagation. Finally, real visibility into your LLM infrastructure.

⚡ Built for speed. Hardened for security. Ready for production.

Gateway Changes (98 commits)

Read more

Release v0.5.6

03 Dec 05:11
7ae368e

Choose a tag to compare

Highlights

  • Support for DeepSeek V3.2/V3.2 Speciale #14249
  • Blockwise diffusion language model support #12588
  • Support for new diffusion models (Flux2 #14000, Z-image #14067)
  • Introduce JIT Kernels #13453
  • Upgrade to Torch 2.9 #12969
  • Kimi-K2-Thinking model enhancement #12882
  • Memory management/Overlap spec compatibility #12224 #12839
  • More performance optimization: DeepSeek-v3-fp4/GLM-4.6/Kimi-K2/DeepSeek-V3.2...
  • CI/CD Enhancement

What's Changed

Read more

Release Gateway-v0.2.3

17 Nov 11:23
172c71a

Choose a tag to compare

🚀 SGLang Model Gateway - New Release!

We're excited to announce another powerful update to SGLang Model Gateway with performance improvements and expanded database support!

Headline Features

⚡ Bucket Mode Routing - 20-30% Performance Boost
Introducing our new bucket-based routing algorithm that dramatically improves performance in PD mode. See up to 20-30% improvements in TTFT (Time To First Token) and overall throughput

💾 PostgreSQL Support for Chat History Management
Flexibility in data storage! We now support PostgreSQL alongside OracleDB and in-memory storage for chat history management.

🛠️ Enhanced Model Tool & Structured Output Support

  • MinMax M2 model support!
  • Structured model output for OpenAI and gRPC router
  • Streaming parsing with Tool Choice in chat completions API
  • Tool_choice support for Responses API
  • OutputItemDone events with output item array storage for better observability

🐛 Stability & Quality Improvements

Multiple bug fixes for model validation, streaming logic, reasoning content indexing, and CI stability enhancements.

🔧 Code Quality Enhancements

Refactored builders for chat and responses, restructured modules for better maintainability, and consolidated error handling.

Try the latest version: pip install sglang-router --upgrade

What's Changed in Gateway

Gateway Changes (45 commits)

New Contributors

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.2...gateway-v0.2.3

Release v0.5.5

06 Nov 17:54
0c006b8

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.2.2

17 Nov 11:19
6237754

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.2 Released!

Features

🎯 Industry-First Responses API for All Models
We're bringing OpenAI's Responses API to the entire open-source ecosystem! Now enjoy native support for Llama, DeepSeek, Qwen, and more – with built-in chat history management, multi-turn conversations, and seamless MCP integration. This is the first solution to democratize advanced conversation management across all OSS models.

☸️ Production-Ready Kubernetes Operations
Taking large-scale deployments seriously! We now support native gRPC health check endpoints, making it effortless to deploy and operate SGLang at scale on Kubernetes with proper health monitoring and orchestration.

🔐 Your Network, Your Control

  • mTLS Support: Secure gateway-to-SGLang communication whether you're running on edge, remote cloud, multi-cloud, or hybrid environments – we've got you covered
  • MCP Proxy Enhancements: Configure proxies globally or per-individual MCP server – complete network control in your hands

🤖 Harmony Pipeline
Introducing our unified OpenAI-native architecture with GPT OSS model support for both Responses API and Chat Completion – fully integrated with MCP and intelligent storage management.

🌍 Universal Platform Support
A major leap in accessibility! SGLang Model Gateway now runs on nearly every operating system and architecture: Linux, Windows, Mac, x86, and ARM. Even better – we support all Python versions from 3.8 to 3.14 in a single wheel file, while reducing wheel size by more than 40%. Deploy anywhere, on any Python version, with unprecedented efficiency!

⚡ Additional Enhancements

  • Multi-worker URL support for better load distribution
  • Connection pooling and tool inventory for MCP
  • Native OpenAI web search tool support and function calling for OpenAI router

🐛 Stability Improvements

We've squashed numerous bugs including background task handling, tool call IDs, conversation management, and installation dependencies.

Try it now: pip install sglang-router==0.2.2


What's Changed in Gateway

Gateway Changes (48 commits)

New Contributors

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.1...gateway-v0.2.2

Release v0.5.4

26 Oct 02:37
1053e1b

Choose a tag to compare

Highlights

What's Changed

Read more

Release Gateway-v0.2.1

17 Nov 11:13
8a801ee

Choose a tag to compare

🚀 SGLang Model Gateway v0.2.1 Released!

This release focuses on stability, cleanup, and two big new performance features.

🧾 Docs & CI

  • Updated router documentation to reflect recent feature additions

🧹 Code Cleanup

  • Refactored StopSequenceDecoder for cleaner incremental decoding
  • Added spec.rs test harness under spec/ for structured unit tests

🐞 Bug Fixes

  • Fixed UTF-8 boundary in stop-sequence decoding
  • Fixed gRPC timeout configuration
  • Fixed worker filtering, tool-choice normalization, and bootstrap-port handling
  • Additional gRPC server warm-up and concurrency fixes

🌟 New Features

  • Two-Level Tokenizer Caching (L0 + L1)
  • L0: exact-match cache for repeated prompts
  • L1: prefix-aware cache at special-token boundaries
  • OpenAI-Style Classification API → new /v1/classifications endpoint, shout out to yanbo for the contribution
  • Worker Management Workflow Engine → improved async registration, worker self discovery, and health orchestration

What's Changed in Gateway

Gateway Changes (26 commits)

Paths Included

  • sgl-router
  • python/sglang/srt/grpc
  • python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.2.0...gateway-v0.2.1

Release Gateway-v0.2.0

17 Nov 11:03
74737b2

Choose a tag to compare

🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”)

🔥 What’s new

🧠 Multi-Model Inference Gateway (IGW) Mode

IGW turns one router into many — letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface.
You can dynamically register models via /workers, assign labels like tier or policy, and let the gateway handle routing, health checks, and load balancing.
Whether you’re mixing Llama, Mistral, and DeepSeek, or orchestrating per-tenant routing in enterprise setups, IGW gives you total control.
Your fleet, your rules. ⚡

⚡ gRPC Mode: Rust-Powered, Built for Throughput

This is the heart of 0.2.0. The new gRPC data plane runs entirely in Rust — tokenizer, reasoning parser, and tool parser included — giving you native-speed performance, and lower latency.
You can connect to gRPC-based SGLang workers, stream tokens in real time, and even handle OpenAI-compatible APIs like

🌐 OpenAI-Compatible Gateway

Seamlessly proxy requests to OpenAI, while keeping data control local.
Conversation history, responses, and background jobs all flow through the gateway — same API, enterprise privacy.
💾 Pluggable History Storage
Choose between memory, none, or oracle for conversation and /v1/responses data.
memory: Fastest for ephemeral runs.none: Zero persistence, zero latency overhead.oracle: Full persistence via Oracle ATP with connection pooling and credentials support.🧩 Pluggable MCP Integration
The gateway now natively speaks MCP across all transports (STDIO, HTTP, SSE, Streamable), so your tools can plug directly into reasoning and response loops — perfect for agentic workflows and cross-model orchestration.

🛡️ Reliability & Observability Upgrades

Built-in:
Retries with exponential backoff + jitterPer-worker circuit breakersToken-bucket rate limiting & FIFO queuingPrometheus metrics for latency, load, queue depth, PD pipelines, tokenizer speed, and MCP activityStructured tracing & request-ID propagation

✨ SGLang Model Gateway v0.2.0 — built in Rust, designed for scale, ready for reasoning.

What's Changed in Gateway

Gateway Changes (238 commits)

Read more