Release Release v0.5.4 · sgl-project/sglang

Highlights

AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
[beta] Overlap scheduler for speculative decoding: #11762
[beta] Piecewise CUDA graph for prefill: #11490
Prefix cache for qwen3 next and GDN/mamba models: #11214
Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
Various Blackwell kernel optimizations
DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
Native ModelOpt quantization support

What's Changed

[router] add ipv6 support across all components by @slin1237 in #11219
Remove env var warnings for release by @merrymercy in #11262
Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
[router][tool call] Clean up redundant detect_format and has_tool_markers by @CatherineSue in #11270
disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
docker: add manifest to versioned docker releases by @ishandhanani in #11268
[Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
[router][grpc] Refine streaming processes by @CatherineSue in #11277
Fix code sync scripts by @merrymercy in #11276
[Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
docs: update sgl-kernel README by @zhyncs in #11286
chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
convert test_deterministic into unit tests by @skyzh in #11095
Feature/longbench v2 evaluation utils by @alhridoy in #10949
[ci] fix pp test by @hnyls2002 in #11294
EAGLE cache fix for SWARadixCache by @ispobock in #11231
Remove overlap thread by @hnyls2002 in #11210
[router] add reasoning and tool parser argument in router by @slin1237 in #11290
Remove sampling info events and overlap thread file by @hnyls2002 in #11300
Introduce future indices by @hnyls2002 in #11301
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
[Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
[router] add get server info and get model info in grpc server by @slin1237 in #11303
[router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
[Doc] HiCache Design Documents by @ykwd in #11027
[Doc]: Best Practice for HICache by @hzh0425 in #11001
[router] fix grpc connection conversion and add optimization by @slin1237 in #11305
[router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
Update tool parser and related documentation by @JustinTong0323 in #11223
[router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
[quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
[router] support Openai router conversation API CRUD by @key4ng in #11297
[router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
[router] cleanup worker health check to return early by @slin1237 in #11310
[oai serving chat] Add argument --sampling-defaults and fix ChatCompletionRequest defaults by @CatherineSue in #11304
Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
ci: unify the model launch method of nightly ci by @mickqian in #11230
[Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
update sampling_params documentation with defaults by @JustinTong0323 in #11315
Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
Rename ngram_utils -> ngram_info by @hnyls2002 in #11316
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
[Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
[8/N] MoE Refactor: deprecate EPMoE by @ch-wan in #11211
Skip weight loading in deepgemm compilation by @ch-wan in #11312
[2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
[Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
Support LoRA in bench_serving oai interface by @lifuhuang in #11318
benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
[CI] improve disaggregation CI. by @hnyls2002 in #11264
model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
[router] refactor generate to use new pipeline arch by @slin1237 in #11323
[router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
[router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
[router] Fix all unused_qualifications by @CatherineSue in #11341
[router] Support history management using conversation by @key4ng in #11339
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
[Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
[Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
[router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
[router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
[router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
[DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
[router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in #11373
add code pp support for nixl by @shaharmor98 in #11375
fix bench_serving mishandling of internal states by @shaharmor98 in #11376
[router][grpc] Replace fake health check with correct ones by @CatherineSue in #11387
[router] change grpc client from mutable to clone by @slin1237 in #11394
chore: upgrade flashinfer 0.4.0 by @zhyncs in #11364
[router] conversation item API: create, retrieve and delete by @key4ng in #11369
chore: bump SGLang version to 0.5.3.post1 by @sglang-bot in #11324
move more files under srt/utils by @merrymercy in #11285
[grammar] Avoid server crash when grammar backend is None by @JustinTong0323 in #11401
fix: fix gpu-proc affinity set incorrectly when pp_size > 1 by @acelyc111 in #11389
[Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded by @glenliu21 in #11365
[CI] Refactor PD disaggregation test suite by @ShangmingCai in #11363
Replace pad with cat for better performance by @yuan-luo in #11388
fix: reinstall torch in deps install by @zhyncs in #11414
feat(hicache): Support passing prefix keys for l3 store. by @hzh0425 in #9045
fix file and object naming scheme in HiCacheNixl to avoid data corruption by @ziruiliu in #10969
Dedicated toml files for CPU/XPU by @ZailiWang in #10734
Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11144
chore: update pyproject by @zhyncs in #11420
fix: fix video input for qwen3-vl by @mickqian in #11361
perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11381
[HiCache] feat: add multi tenant with prefix tag by @stmatengss in #9256
[CI] Merge build-dev into workflow matrix by @csahithi in #11345
Revert "perf: optimize qwen-vl with symm mem allreduce" by @ch-wan in #11436
Revert "fix: fix video input for qwen3-vl" by @merrymercy in #11437
Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" by @scottjlee in #11433
[router] Fix ci nvcc not found error by @key4ng in #11411
feat(mooncake): support GB suffix for global_segment_size by @xiaguan in #10745
Separate allocation logic from scheduler by @cctry in #11313
[router] disable rate limiter by default by @slin1237 in #11435
[router] leverage RAII to actively cancel request during client disconnect by @slin1237 in #11399
[router][grpc] Consolidate parser checks for chat completions by @CatherineSue in #11439
Reorder PD disagg CI tests by @merrymercy in #11438
fix: Change dsv32 hack temporary path to use system temp directory by @wxsms in #11445
Fix batch invariant ops by @hebiao064 in #11368
[BugFix] test_mla_fp8.py fails on Cublas 12.9 by @Liu-congo in #11360
[DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton by @byjiang1996 in #11450
Remove tilelang dependency in Dockerfile by @Fridge003 in #11455
Enable native ModelOpt quantization support (2/3) by @Edwardf0t1 in #9991
Reland [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #10312
Super tiny delete unused openai router in sgl-router by @fzyzcjy in #11448
Adjust logits metada init for target verify by @hnyls2002 in #11467
[Documentation][Configuration] Server args and documentation of PD-Multiplexing. by @ykcombat in #11427
Fix enable_v2 in int8 quant by @fzyzcjy in #11470
[Fix] Fix split prefill with fa3. by @ykcombat in #11428
fix stop when stream by @whybeyoung in #11462
Add option to disable any_whitespace for xgrammar and llguidance backends. by @lulor in #8919
[7/n] decouple quantization impl from vllm dependency - gguf kernel by @FlamingoPg in #11019
fix Xeon CI by @ZailiWang in #11454
[CI] Add nightly builds to dockerhub by @csahithi in #9804
[Feature] support regex strings as a stopping condition by @glenliu21 in #10635
Beta spec-overlap for EAGLE by @hnyls2002 in #11398
Piecewise CUDA Graph Support & Torch Compile Backend by @Oasis-Git in #10062
[Router]: Small Typo in a comment within tree.rs by @xuwenyihust in #11489
chore: bump sgl-kernel version to 0.3.16 by @sglang-bot in #11476
[smol] [perf] Qwen3-VL in place op. by @vincentzed in #11481
[chore][1/N] Avoid using default mutable parameters by @kevin85421 in #11478
[bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends by @MahmoudAshraf97 in #10172
[ perf ] Replace json-> orjson in hot path by @vincentzed in #11221
[chore][2/N] Avoid using default mutable parameters by @kevin85421 in #11479
Fix the GPT function calling regex to allow dash in the name by @antoine-roux in #10577
bailingMoE: Fix Key error of deepep_mode by @QiuMike in #11465
Fix CI break by express-laned PRs. by @hnyls2002 in #11499
Move args from global_config to environ by @hnyls2002 in #11332
move fla env check position by @yizhang2077 in #11500
Temporarily remove b200 tests by @merrymercy in #11501
Fix port conflicts in CI by @merrymercy in #11497
temporarily remove b200 tests by @merrymercy in #11502
Fix unit tests by @merrymercy in #11503
Bugfix: Fix Type consistency for KV indices in SWARadixCache by @hzh0425 in #11452
doc: add doc for adding new models into nightly-ci by @mickqian in #11443
[CI] fix lint by @hnyls2002 in #11509
Deprecate global_server_args_dict by @hnyls2002 in #11331
chore: remove flashinfer cleanup cache by @zhyncs in #11514
fix: revert temporarily remove b200 tests by @zhyncs in #11515
[Fix] Improve longbench prompt and other logics by @byjiang1996 in #11474
Sync changes on io_struct.py and deterministic ops by @merrymercy in #11498
[lint] Fix the lint issue by @ch-wan in #11516
Revert "Deprecate global_server_args_dict" by @ch-wan in #11520
Improve dp attention port assignment scheme by @jokerwyt in #5889
[router] openai router: support grok model by @key4ng in #11511
docs(router): add token-bucket rate limiting to the docs by @Jonahcb in #11485
[sgl-kernel][1/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11432
Update DeepSeek-R1-FP4 default config on blackwell by @Qiaolin-Yu in #11512
[Fix]: add missing device attribute to ChunkCache by @leavelet in #11493
[Feature] Support mamba radix cache v0 by @yizhang2077 in #11214
ci: improve nightly-ci by @mickqian in #11385
[CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering by @BBuf in #11505
[HICache]: Support 3FS-Store with page_first_direct layout by @hzh0425 in #11460
Tiny fix test run estimated time by @ShangmingCai in #11544
[Reland] perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11457
Depreate global_server_args_dict by @hnyls2002 in #11528
[Fix] Add per_channel_quant parameter to MoE config functions by @mmangkad in #11201
[router][ci] Add Nightly Release Workflow for SGLang Router by @slin1237 in #11527
[router] add tokenizer path to be dir by @slin1237 in #11530
Remove tp_worker.worker by @hnyls2002 in #11548
fix: fix video input for qwen3-vl by @mickqian in #11442
[NVIDIA] BUMP FA3 by @johnnynunez in #11444
[Fix] Include grpc reflection runtime dependency by @ai-jz in #11419
Adjust overlap event loop by @hnyls2002 in #11507
Move deep gemm related arguments to sglang.srt.environ by @hnyls2002 in #11547
[router][grpc] Further delegate non-stream processing to processing.rs by @CatherineSue in #11553
[router] allow user to specify chat template path by @slin1237 in #11549
Minor: improve sampler & remove unused fields from model_config.py by @merrymercy in #11531
[router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter by @Jonahcb in #11483
Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11441
Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) by @trevor-m in #11557
[CI] Add Basic Test for DeepSeek V3.2 by @Fridge003 in #11308
[router][grpc] Add error handling to generate_tool_constraints by @CatherineSue in #11562
[NVIDIA] update pyproject.toml to support cu130 option by @johnnynunez in #11521
[CI Monitor] Ci monitor only deal with main branch in default by @BBuf in #11538
Tiny cleanup fp4 gemm calls by @fzyzcjy in #11537
[router][grpc] Add serve_grpc to launch_server and log id for HealthCheck by @CatherineSue in #11564
[router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds by @YouNeedCryDear in #11571
[sgl-kernel][2/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11534
chore: bump sgl-kernel version to 0.3.16.post1 by @sglang-bot in #11573
Fix accept rate in speculative decoding metrics by @Qiaolin-Yu in #11572
Compilation Folder Reset by @Oasis-Git in #11539
[FEATURE] Add Profile Trace Merger for Distributed Traces by @neelabhsinha in #11413
[DSv32] Use torch.compile for _get_logits_head_gate by @trevor-m in #11565
Make DeepEP combine recv do not overlap by @fzyzcjy in #11535
bench_serving support PD Disaggregation by @BBuf in #11542
Implement LRU eviction policy for LoRA adapters by @ConnorLi96 in #11041
Revert "[NVIDIA] BUMP FA3 (#11444)" by @zhyncs in #11582
chore: bump sgl-kernel version to 0.3.16.post2 by @sglang-bot in #11583
[Auto Sync] Update model_config.py (20251014) by @merrymercy in #11580
Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json by @Qiaolin-Yu in #11587
[router][protocols] Add Axum validate extractor and use it for /v1/chat/completions endpoint by @CatherineSue in #11588
[router] update generate spec to align with sgl io struct by @slin1237 in #11591
[router] change worker api to async instead of sync by @slin1237 in #11566
Update news section in README.md by @merrymercy in #11598
[router] delete useless table content comment in spec by @slin1237 in #11597
[router] allow router launch server to use grpc mode by @slin1237 in #11600
[Docs] [Router]: Update sg-router doc on circuit breaker by @xuwenyihust in #11449
[router] when given both local tokenizer and chat template, log all by @slin1237 in #11601
[AMD CI] Add image and weights caching. by @saienduri in #11593
Update release-docker-dev.yml by @sglang-bot in #11603
Optimize Triton Draft Backend by @hnyls2002 in #11556
Refactor spec decoding metrics calculation into separate TokenizerManager utility function by @scottjlee in #11586
make radix cache deterministic by @skyzh in #10721
move eagle draft post process to cuda graph by @cicirori in #11434
Reduce one step decode for draft model. by @hnyls2002 in #11561
[router] add py binding and readme for openai router and history backend by @key4ng in #11453
[router] cleanup app context and move to startup by @slin1237 in #11617
[router] add chang and keyang to sgl router author by @slin1237 in #11620
use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. by @strgrb in #11605
[router] update router readme to latest features by @slin1237 in #11619
Fix log for chunked prefix cache by @Fridge003 in #11624
[Auto Sync] Update scheduler.py, server_args.py (20251014) by @merrymercy in #11623
[Auto Sync] Update collector.py (20251014) by @merrymercy in #11625
[Minor] Update xgrammar dependency by @DarkSharpness in #11622
Update install.md by @merrymercy in #11631
fix: Update SGL_KERNEL_VERSION to 0.3.15 by @zhyncs in #11633
[router][grpc] add warm up to grpc server by @slin1237 in #11627
Refactor kv cache free by @cctry in #11351
[router] update router doc to latest features by @slin1237 in #11639
fix: upgrade transformers to 4.57.1 by @csahithi in #11628
[router] add worker self discovery for metadata by @slin1237 in #11638
[router] upgrade to 0.2.0 by @slin1237 in #11642
[1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP by @UNIDY2002 in #10423
[1/N]Support DeepSeek-R1 w4a8 normal deepep by @ayrnb in #8247
[Fix] Fix accuracy bug in CSGMV kernel caching key. by @lifuhuang in #11579
feat: add add_chunked_prefix_cache_attention_backend by @zhyncs in #11636
Super tiny improve FA3 import error message by @fzyzcjy in #11590
[BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl by @ZhengWG in #11458
[Doc] Update support matrix for attn and hybrid attn by @b8zhong in #11293
Clean up some Qwen3-Next and deterministic code by @hebiao064 in #11585
docs: update sglang installation guide by @zhyncs in #11659
Tiny cleanup some eagle unused codes by @hnyls2002 in #11660
Fix 1-step draft model forward by @ShangmingCai in #11653
[tool call] Fix prev_tool_call_arr management in base_format_detector.py by @CatherineSue in #11367
[router] Fix response api related spec by @key4ng in #11621
Fix missing json imports in serving_responses.py by @CatherineSue in #11681
[sgl-kernel][3/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11674
[sgl-kernel] Optimize gguf test by @FlamingoPg in #11667
[router][grpc] Simplify model_id determination by @CatherineSue in #11684
[router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding by @slin1237 in #11676
chore: bump SGLang version to 0.5.3.post2 by @sglang-bot in #11680
[CI][XPU]enable sglang CI on Intel XPU by @DiweiSun in #9493
enable rmsnorm on XPU by @huaiyuzh in #10248
Sync code and test CI; rename some env vars by @merrymercy in #11686
docs: Add Contributor Covenant Code of Conduct by @zhyncs in #11689
[Mamba] Increase default mamba_full_memory_ratio to 0.9 by @hanming-lu in #11679
[PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) by @ShangmingCai in #10912
[sgl-kernel] support hadamard by @FlamingoPg in #11663
Fix missing a2a backend init of GLM4.5 MoE Block by @ShangmingCai in #11692
Split test_intel_amx_attention_backend.py to pass CI of timeout by @yanbing-j in #11370
Set csgmv as default lora backend. by @lifuhuang in #11488
[Bugfix] Fix Qwen3/DSV3/DSV3.2 model support by @iforgetmyname in #11510
[CI] Add GLM4MoE model test by @ShangmingCai in #11706
[router] fix get_models endpoint for openai router by @key4ng in #11687
[ci]use H20 to run disaggregation test by @HanHan009527 in #11543
chore: bump SGLang version to 0.5.3.post3 by @sglang-bot in #11693
model: qwen3-omni (thinker-only) by @mickqian in #10911
[Router] Refactor protocol definitions: split spec.rs into modular files by @key4ng in #11677
[router] fix p and d worker filtering and bootstrap port handling by @slin1237 in #11729
[router][grpc] add dissag info to warm up in grpc server by @slin1237 in #11727
[router] Fix tool_choice normalization in ChatCompletionRequest and fix ut by @CatherineSue in #11731
Revert "make radix cache deterministic" by @Fridge003 in #11728
Reduce the image processing latency in VLM by @zhooooong in #11541
[router] add spec.rs to enables tests under spec folder by @key4ng in #11734
[router] Add rustfmt and set group imports by default by @CatherineSue in #11732
Revert "[router] fix get_models endpoint for openai router (#11687)" by @key4ng in #11740
[router][CI] Clean up deprecated fields in pr-test-pd-router.yml by @CatherineSue in #11739
[CI] Fix broken event loop creation by @hnyls2002 in #11746
[overlap-spec] Make plan stream an option by @hnyls2002 in #11724
ci: reduce and refactor vlm ut and combine test files by @mickqian in #11062
Abstraction for spec worker and code cleanup by @hnyls2002 in #11643
add tuned fuse moe kernel for qwen3 235b fp8 on h200 by @pdasgup in #11730
Revert "Set csgmv as default lora backend. (#11488)" by @zhyncs in #11735
[router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder by @slin1237 in #11766
[router] fix grpc client time out to 1h by @slin1237 in #11768
[doc] update router document by @key4ng in #11767
[Feature] Reuse flashinfer workspace for PD-Multiplexing. by @ykcombat in #11540
Turn on shm_allreduce and shm_allgather for fp16 by @chunyuan-w in #10725
[Auto Sync] Update scheduler.py (20251017) by @zhyncs in #11738
[router][grpc] Remove timeout for connections and remove max_tokens deprecation warning log by @CatherineSue in #11775
Cleaning indexer for DeepSeek V3.2 by @Fridge003 in #11682
[minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests by @merrymercy in #11777
[Auto Sync] Update common.py (20251017) by @merrymercy in #11782
[Fix] Skip visual layers when applying LoRA to Qwen2VL modules by @anvdn in #11519
[Lint] Add python/sglang to ruff F401 checks and remove unused imports in files by @CatherineSue in #11685
Super tiny fix missing input throughput by @fzyzcjy in #11607
Support shared experts overlap in cutlass moe by @fzyzcjy in #11611
Support casting bf16 NextN moe to fp8 by @fzyzcjy in #11613
Manually flip deepep_mode for cuda_graph by @zhuzilin in #11666
Set CUDA_VISIBLE_DEVICES to achieve one GPU per process by @merrymercy in #9170
Super tiny fix CI by @fzyzcjy in #11788
Make single-batch overlap compatible with offloading by @fzyzcjy in #11614
completely remove mixed mode deterministic test as prefix mode could cover it by @zminglei in #11783
[Refactor] move deep_gemm_wrapper out of quantization by @ch-wan in #11784
Enable lint on main by @fzyzcjy in #11794
[router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client by @CatherineSue in #11798
Try add back no-commit-to-branch by @fzyzcjy in #11799
fix(glm45): disable reduce scatter by @jinmingyi1998 in #11665
fix command line usage of profiling by @Qiaolin-Yu in #11793
[RL] support weight update with DP attention by @zhuzilin in #11669
[RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded by @zhuzilin in #10152
set default attention backend for deterministic inference by @zminglei in #11801
Eager Compiler for Torch Compile by @Oasis-Git in #11803
Fix install instructions and pyproject.tomls by @merrymercy in #11781
Bump torch_memory_saver to avoid installing pre-release versions by @fzyzcjy in #11797
[HiCache] feat: add more eviction policy by @stmatengss in #11506
[overlap-spec] support page size > 1 by @hnyls2002 in #11772
support server arg override KV cache to bf16 to avoid slow cases by @b8zhong in #11749
feat(example/fastapi): support --startup-timeout using Qwen3-Next-80B-A3B-Instruct as example by @Kindyaa in #11710
ci: update lmms-eval to speed up multimodal CI by @b8zhong in #11000
Use cutlass fp4 gemm by default by @Qiaolin-Yu in #11813
Fix Dockerfile not installing correct version of DeepEP for arm build by @kyleliang-nv in #11773
[router] Add Configurable L0 and L1 Tokenizer Caching by @slin1237 in #11688
[2/2] [feature] support openai like classification api in router by @whybeyoung in #11670
[1/2][feature] support openai like classification api by @whybeyoung in #11618
make sure logit bias is applied during eagle spec decoding verification by @petricevich in #11555
fix: do not wrap invalid grammar objects during constrained generation by @tazjin in #11328
Improve send_sone script by @hnyls2002 in #11817
Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in #10788
Update CODEOWNERS for layer quantization path by @merrymercy in #11818
support tokenized batch request by @narutolhy in #11091
Tiny add hints when users send requests to wrong place by @fzyzcjy in #11808
Make single-batch overlap compatible with NextN by @fzyzcjy in #11804
Support not officially supported high sgl-kernel version with low srt version by @fzyzcjy in #11786
Avoid generation gets hanging when user specifies multiple event loops by @fzyzcjy in #5162
Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 by @fzyzcjy in #11805
Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" by @hnyls2002 in #11827
[overlap-spec] fix stop condition and trimming by @hnyls2002 in #11819
[Spec Decoding] Support MTP for dsv3.2 by @Paiiiiiiiiiiiiii in #11652
[CI] always print back trace in retry() by @hnyls2002 in #11834
[Test] Add basic matched stop for beta eagle by @hnyls2002 in #11833
Deterministic Mode: Add 1-stage triton kernel for prefill by @hebiao064 in #11147
[logprobs] Enable local deterministic logrprobs testing with strict threshold by @PrinsYin in #10994
[CI] Add CI test for DeepSeek V3.2 MTP by @Fridge003 in #11835
[NVIDIA] FA3/FA4 Fix by @johnnynunez in #11606
[DeepseekV32] Add fast_topk_transform_ragged_fused kernel by @hlu1 in #11815
Fix triton_kernels import error on some hardwares by @fzyzcjy in #11831
Tiny bump DeepEP version in ARM blackwell by @fzyzcjy in #11810
[BugFix] replace the input_to_float8 used in dsv2 by @Liu-congo in #11612
[Doc] Update documents for FA4 by @Fridge003 in #11778
fix(ci): Fix CI Monitor limit parameter and add CI Analysis to summary by @BBuf in #11832
Fix version bump script to handle TOML files with outdated versions by @Kangyan-Zhou in #11787
Improve Kernel Build Time by @Kangyan-Zhou in #11508
check master server for mooncake store by @huangtingwei9988 in #10510
chore: bump sgl-kernel version to 0.3.16.post3 by @sglang-bot in #11733
Recapture cuda graph after model weight update to resolve IMA error by @harrisonlimh in #11780
[Feature] Use current greenctx stream to communicate in PD-Multiplexing. by @ykcombat in #11594
Support mrope triton kernel and add unit test by @yuan-luo in #11722
[PD] Improve eagle acceptance rate by transferring draft model hidden states by @ZeldaHuang in #10801
Tiny clean up for PD module and doc by @ShangmingCai in #11747
Revert "[CI Monitor] Ci monitor only deal with main branch in default" by @BBuf in #11846
[Model] Add Olmo 3 model support by @2015aroras in #11396
Update amd gpu install docs. by @saienduri in #11849
[AMD CI] Populate image cache in nightly docker release. by @saienduri in #11822
fix(server_args): handle tokenizer init conflicts by @ishandhanani in #11776
[Feature] New structural tag support by @DarkSharpness in #10691
Tiny fix main lint by @hnyls2002 in #11862
[9/N] MoE Refactor: cleanup dispatcher interfaces by @ch-wan in #11847
Fix acc len and gen throughput metrics when enabling overlap-spec by @Qiaolin-Yu in #11823
Replace function call with set literal by @penguin-wwy in #11867
Support mixing cutedsl and deepgemm backend by @fzyzcjy in #11807
[router] Worker Management Workflow Engine by @slin1237 in #11868
[router] remove encoding header for oai router by @slin1237 in #11881
[Auto Sync] Update scheduler.py, server_args.py (20251020) by @merrymercy in #11875
[router][grpc] Remove continue_final_message in ChatTemplateParams and add minijinja-contrib by @CatherineSue in #11882
fix(sql-router): fix conflict port in test by @htiennv in #11826
[router] clean up workflow logs to debug for implementation details logs by @slin1237 in #11886
[code move] move pp into a separate mixin by @merrymercy in #11838
[router][grpc] Fix wram-up random token ids for small models by @CatherineSue in #11887
Revise MRotaryEmbedding's forward by @yuan-luo in #11859
piecewise cuda graph support qwen3-moe by @BBuf in #11845
Fix RotaryEmbedding for fp32 input by @zhangdonghao-zdh in #11843
Init attention backend for Intel XPU by @airMeng in #10656
Use trtllm_mla decode kernel for draft extend in speculative decoding by @Qiaolin-Yu in #11664
[router] release router 0.2.1 by @slin1237 in #11885
[AMD] Update wave-lang to 3.8.0 by @xintin in #11878
init support for KTransformers Heterogeneous Computing by @Atream in #11487
[FEATURE] Add OpenAI-Compatible LoRA Adapter Selection by @neelabhsinha in #11570
[fix] fix ci uv install dependency by @HanHan009527 in #11895
Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] by @whybeyoung in #11416
Simplify multi-tokenizer by @zhengkezhou1 in #11295
[CI] disable glm4.1v and fix the flashinfer installation by @ShangmingCai in #11902
vlm: enforce pybase64 for image and str encode/decode by @b8zhong in #10700
[smol] [perf] Inverse perm improvement by @vincentzed in #11482
[quantization][MoE] fix the check for tp_size / moe_ep_size / moe_intermediate_size / weight_block_size_n by @kevin85421 in #11702
[CI] Fix b200 flashinfer installation by @ShangmingCai in #11915
Fix flush cache API for spec v2 by @hnyls2002 in #11918
[NVIDIA] Add new SMs support for Spark & Thor by @Kh4L in #11287
Update sgl-kernel and remove fast hadamard depedency by @Fridge003 in #11844
Rename flashmla kernel options of nsa backend for better readability by @Fridge003 in #11876
chore: upgrade flashinfer 0.4.1 by @zhyncs in #11933
[BugFix][Qwen3-VL]: add metadata for video in qwen3-vl by @ZhengWG in #11377
[Auto Sync] Update forward_batch_info.py (20251021) by @zhyncs in #11934
Fix openai input_text type compatibility by @key4ng in #11935
fix: resolve flashinfer 0.4.1 import by @zhyncs in #11940
[router][grpc] Support v1/responses API by @CatherineSue in #11926
[router] Add gRPC E2E test suite by @key4ng in #11790
[router][grpc] Fix background tasks stored with wrong id by @CatherineSue in #11945
[lint] improve ruff check by @hnyls2002 in #11922
[sgl-kernel] support flashmla libtorch by @FlamingoPg in #11717
[NVIDIA] upstream FA4 and fix cccl path by @johnnynunez in #11929
Enable native ModelOpt quantization support (3/3) by @Edwardf0t1 in #10154
Fix mooncake dispatcher by @UNIDY2002 in #11908
[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank by @HanHan009527 in #10606
[model] Support POINTSV15Chat model by @josephydu in #9651
Fix flaky hicache test with mooncake backend by @ShangmingCai in #11953
[Fix] Remove unused import from triton_kernels_moe.py by @FlamingoPg in #11967
[router] Support multiple worker URLs for OpenAI router by @key4ng in #11723
[Documentation] add doc for deterministic inference by @zminglei in #11956
[6/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #10750
[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… by @yuho8818 in #11977
Revert "Recapture cuda graph after model weight update to resolve IMA error " by @merrymercy in #11980
[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel by @jiahanc in #11563
[router] create worker removal step and clean up worker manager by @slin1237 in #11921
Implement BGE-M3 Sparse Embeddings in SGLang by @approximated-intelligence in #10869
[Doc] Update deterministic inference flag in server_arguments.md by @Fridge003 in #11978
[grpc] Support gRPC standard health check by @CatherineSue in #11955
[AMD] Support a new flag to disable quant on parallelLinear layer if required by @yichiche in #11811
[ROCm] Remove vLLM rope dependency & use AITER impl by @b8zhong in #11322
[NVIDIA] Build CUDA 13 by @johnnynunez in #11299
Bump grace blackwell DeepEP version by @fzyzcjy in #11990
[CPU] misc updates by @ZailiWang in #11906
fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 by @zheng1 in #11965
[CPU] Optimize FP16 decode_attention_cpu by @blzheng in #10652
Allow to disable batch decoding. by @LorrinWWW in #11944
Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend by @cicirori in #11985
aiter update to v0.1.6.post1 by @HaiShaw in #12004
Support overlap-spec-v2 with trtllm_mla attention backend by @Qiaolin-Yu in #11821
Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 by @netanel-haber in #11866
[router] Add comprehensive E2E tests for Response API by @key4ng in #11988
[Router] Consolidate ConnectionMode enum to core module by @YouNeedCryDear in #11937
Move memory runtime checker to mixin class by @hnyls2002 in #12014
Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" by @hnyls2002 in #12015
[Fix] memory leak by overlap + retract by @cctry in #11981
[Feature] Support loading weights from ckpt engine worker by @stmatengss in #11755
[router] change ci names and update log level in ci by @slin1237 in #12021
Feature/nano v2 offline modelopt fp8 and nvfp4 by @netanel-haber in #12018
[Auto Sync] Update test_deterministic_utils.py (20251023) by @merrymercy in #12022
ci: fix night-ci with push retry mechanism by @mickqian in #11765
[router][CI] Clean up imports and prints statements in sgl-router/py_test by @CatherineSue in #12024
Add AWQ quantization support for NPU. by @ErvinXie in #10158
model: support deepseek-ocr by @mickqian in #11891
Log iteration # for prefill and decode by @nvcastet in #9366
Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" by @b8zhong in #12028
Fix mamba radix cache eviction logic in alloc_req_slots by @rogeryoungh in #11616
Update Github action title for kernel build by @Kangyan-Zhou in #12029
[router] Add builder pattern for RouterConfig with zero duplication by @slin1237 in #12030
Fixed aarch64 flash-mla by @nvjullin in #12009
chore: bump SGLang version to 0.5.4 by @sglang-bot in #12027

New Contributors

@xuwenyihust made their first contribution in #11302
@ziruiliu made their first contribution in #10969
@scottjlee made their first contribution in #11144
@Liu-congo made their first contribution in #11360
@lulor made their first contribution in #8919
@antoine-roux made their first contribution in #10577
@QiuMike made their first contribution in #11465
@ai-jz made their first contribution in #11419
@neelabhsinha made their first contribution in #11413
@UNIDY2002 made their first contribution in #10423
@zhooooong made their first contribution in #11541
@pdasgup made their first contribution in #11730
@anvdn made their first contribution in #11519
@Kindyaa made their first contribution in #11710
@petricevich made their first contribution in #11555
@tazjin made their first contribution in #11328
@Paiiiiiiiiiiiiii made their first contribution in #11652
@2015aroras made their first contribution in #11396
@zhangdonghao-zdh made their first contribution in #11843
@xintin made their first contribution in #11878
@zhengkezhou1 made their first contribution in #11295
@Kh4L made their first contribution in #11287
@yuho8818 made their first contribution in #11977
@jiahanc made their first contribution in #11563
@approximated-intelligence made their first contribution in #10869
@zheng1 made their first contribution in #11965
@ErvinXie made their first contribution in #10158
@rogeryoungh made their first contribution in #11616
@nvjullin made their first contribution in #12009

Full Changelog: v0.5.3...v0.5.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.5.4

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

New Contributors

Contributors

Uh oh!