Highlights
- AMD AI Dev Day 2025 SGLang (slide), PyTorch Conference 2025 SGLang (slide)
 - Model gateway v0.2 release: https://docs.sglang.ai/advanced_features/router.html
 - [beta] Overlap scheduler for speculative decoding: #11762
 - [beta] Piecewise CUDA graph for prefill: #11490
 - Prefix cache for qwen3 next and GDN/mamba models: #11214
 - Fullset optimizations for DeepSeek-V3.2 (MTP, PD-Disagg, Function Calling) (https://docs.sglang.ai/basic_usage/deepseek_v32.html, #11989)
 - Various Blackwell kernel optimizations
 - DGX Spark Support: https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/
 - KTransformer integration: https://lmsys.org/blog/2025-10-22-KTransformers/
 - New model support: Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3
 - Native ModelOpt quantization support
 
What's Changed
- [router] add ipv6 support across all components by @slin1237 in #11219
 - Remove env var warnings for release by @merrymercy in #11262
 - Enable native ModelOpt quantization support (1/3) by @Edwardf0t1 in #7149
 - [router][tool call] Clean up redundant 
detect_formatandhas_tool_markersby @CatherineSue in #11270 - disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 by @gongwei-130 in #11274
 - docker: add manifest to versioned docker releases by @ishandhanani in #11268
 - [Bug] Fix incorrect assertion in FA4 and add UT. by @lifuhuang in #11182
 - [router][grpc] Refine streaming processes by @CatherineSue in #11277
 - Fix code sync scripts by @merrymercy in #11276
 - [Auto Sync] Update test_utils.py (20251006) by @merrymercy in #11280
 - Rename max_micro_batch_size -> pp_max_micro_batch_size by @merrymercy in #11279
 - Reverse the AMD CI test back to 1200s and split the 8-gpu deepseek job into two. by @sunxxuns in #11238
 - Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components by @ConnorLi96 in #11261
 - fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration by @JustinTong0323 in #11282
 - docs: update sgl-kernel README by @zhyncs in #11286
 - chore: bump sgl-kernel version to 0.3.15 by @sglang-bot in #11281
 - [router][grpc] Fix proto3 default value mismatches and cleanup unused fields by @CatherineSue in #11283
 - convert test_deterministic into unit tests by @skyzh in #11095
 - Feature/longbench v2 evaluation utils by @alhridoy in #10949
 - [ci] fix pp test by @hnyls2002 in #11294
 - EAGLE cache fix for SWARadixCache by @ispobock in #11231
 - Remove overlap thread by @hnyls2002 in #11210
 - [router] add reasoning and tool parser argument in router by @slin1237 in #11290
 - Remove sampling info events and overlap thread file by @hnyls2002 in #11300
 - Introduce future indices by @hnyls2002 in #11301
 - [sgl-kernel] Support float64 moe_sum_reduce cuda kernel by @yuan-luo in #11068
 - [Docs] [Router] Update Observability and Common Issues Section by @xuwenyihust in #11302
 - [router] add get server info and get model info in grpc server by @slin1237 in #11303
 - [router][grpc] Refactor chat template content format detection by @CatherineSue in #11288
 - [Doc] HiCache Design Documents by @ykwd in #11027
 - [Doc]: Best Practice for HICache by @hzh0425 in #11001
 - [router] fix grpc connection conversion and add optimization by @slin1237 in #11305
 - [router][grpc] Fix sampling_params.stop_strs is None by @CatherineSue in #11306
 - Update tool parser and related documentation by @JustinTong0323 in #11223
 - [router][grpc] Fix error message format in grpc chat handler by @CatherineSue in #11307
 - [quantization] Properly ignore quantization for layers excluded in quant_config by @BowenBao in #11205
 - [router] support Openai router conversation API CRUD by @key4ng in #11297
 - [router][grpc] Fix request_id extraction when n > 1 by @CatherineSue in #11311
 - [router] cleanup worker health check to return early by @slin1237 in #11310
 - [oai serving chat] Add argument 
--sampling-defaultsand fixChatCompletionRequestdefaults by @CatherineSue in #11304 - Clean match_prefix and prepare_for_extend for mem cache V2 by @cctry in #11200
 - ci: unify the model launch method of nightly ci by @mickqian in #11230
 - [Chore] Update xgrammar 0.1.24 -> 0.1.25 by @DarkSharpness in #10710
 - update sampling_params documentation with defaults by @JustinTong0323 in #11315
 - Optimize copy_kv_cache for spec decoding by @YAMY1234 in #11126
 - Rename 
ngram_utils->ngram_infoby @hnyls2002 in #11316 - [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator by @CatherineSue in #11314
 - [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints by @adarshxs in #9545
 - [8/N] MoE Refactor: deprecate 
EPMoEby @ch-wan in #11211 - Skip weight loading in deepgemm compilation by @ch-wan in #11312
 - [2/2] Support MHA prefill with FlashAttention 4. by @lifuhuang in #10937
 - [Doc] Update mooncake nvlink transport doc for PD disaggregation by @ShangmingCai in #11321
 - fix(decode): adjust ServerArgs import to explicit module path by @xiaguan in #11007
 - Support LoRA in bench_serving oai interface by @lifuhuang in #11318
 - benchmark: enhance configurable multimodal benchmarking in bench_serving by @AlienKevin in #9812
 - [CI] improve disaggregation CI. by @hnyls2002 in #11264
 - model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) by @netanel-haber in #10909
 - [router] refactor generate to use new pipeline arch by @slin1237 in #11323
 - [router] improve reasoning parser lock and reduce req cloning by @slin1237 in #11336
 - [router][grpc] Cleanup debug logs in grpc_server and grpc_router by @CatherineSue in #11340
 - [router] Fix all unused_qualifications by @CatherineSue in #11341
 - [router] Support history management using conversation by @key4ng in #11339
 - [router][grpc] Add dependencies in Cargo.toml to support chat template rendering by @CatherineSue in #11342
 - fix: fix revision for sgl-flash-attn in sgl-kernel by @mickqian in #11327
 - [Auto Sync] Update scheduler.py (20251009) by @zhyncs in #11350
 - [Generative Score API] Multi-Item scoring with custom attention mask. by @sundar24295s in #10979
 - [router][grpc] disable health check generation and increase timeout by @slin1237 in #11353
 - [router] Refactor OpenAI router: split monolithic file and move location by @key4ng in #11359
 - [router][lint] Add unused_qualifications to cargo lint warnings by @CatherineSue in #11366
 - [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size by @trevor-m in #11309
 - [router][grpc] Fix tool call streaming bugs: empty tool names, state pollution, and panics by @CatherineSue in #11373
 - add code pp support for nixl by @shaharmor98 in #11375
 - fix bench_serving mishandling of internal states by @shaharmor98 in #11376
 - [router][grpc] Replace fake health check with correct ones by @CatherineSue in #11387
 - [router] change grpc client from mutable to clone by @slin1237 in #11394
 - chore: upgrade flashinfer 0.4.0 by @zhyncs in #11364
 - [router] conversation item API: create, retrieve and delete by @key4ng in #11369
 - chore: bump SGLang version to 0.5.3.post1 by @sglang-bot in #11324
 - move more files under srt/utils by @merrymercy in #11285
 - [grammar] Avoid server crash when grammar backend is None by @JustinTong0323 in #11401
 - fix: fix gpu-proc affinity set incorrectly when pp_size > 1 by @acelyc111 in #11389
 - [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded by @glenliu21 in #11365
 - [CI] Refactor PD disaggregation test suite by @ShangmingCai in #11363
 - Replace pad with cat for better performance by @yuan-luo in #11388
 - fix: reinstall torch in deps install by @zhyncs in #11414
 - feat(hicache): Support passing prefix keys for l3 store. by @hzh0425 in #9045
 - fix file and object naming scheme in HiCacheNixl to avoid data corruption by @ziruiliu in #10969
 - Dedicated toml files for CPU/XPU by @ZailiWang in #10734
 - Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11144
 - chore: update pyproject by @zhyncs in #11420
 - fix: fix video input for qwen3-vl by @mickqian in #11361
 - perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11381
 - [HiCache] feat: add multi tenant with prefix tag by @stmatengss in #9256
 - [CI] Merge build-dev into workflow matrix by @csahithi in #11345
 - Revert "perf: optimize qwen-vl with symm mem allreduce" by @ch-wan in #11436
 - Revert "fix: fix video input for qwen3-vl" by @merrymercy in #11437
 - Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" by @scottjlee in #11433
 - [router] Fix ci nvcc not found error by @key4ng in #11411
 - feat(mooncake): support GB suffix for global_segment_size by @xiaguan in #10745
 - Separate allocation logic from scheduler by @cctry in #11313
 - [router] disable rate limiter by default by @slin1237 in #11435
 - [router] leverage RAII to actively cancel request during client disconnect by @slin1237 in #11399
 - [router][grpc] Consolidate parser checks for chat completions by @CatherineSue in #11439
 - Reorder PD disagg CI tests by @merrymercy in #11438
 - fix: Change dsv32 hack temporary path to use system temp directory by @wxsms in #11445
 - Fix batch invariant ops by @hebiao064 in #11368
 - [BugFix] test_mla_fp8.py fails on Cublas 12.9 by @Liu-congo in #11360
 - [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton by @byjiang1996 in #11450
 - Remove tilelang dependency in Dockerfile by @Fridge003 in #11455
 - Enable native ModelOpt quantization support (2/3) by @Edwardf0t1 in #9991
 - Reland [1/2] Optimizations and refactors about quant kernel by @fzyzcjy in #10312
 - Super tiny delete unused openai router in sgl-router by @fzyzcjy in #11448
 - Adjust logits metada init for target verify by @hnyls2002 in #11467
 - [Documentation][Configuration] Server args and documentation of PD-Multiplexing. by @ykcombat in #11427
 - Fix enable_v2 in int8 quant by @fzyzcjy in #11470
 - [Fix] Fix split prefill with fa3. by @ykcombat in #11428
 - fix stop when stream by @whybeyoung in #11462
 - Add option to disable 
any_whitespaceforxgrammarandllguidancebackends. by @lulor in #8919 - [7/n] decouple quantization impl from vllm dependency - gguf kernel by @FlamingoPg in #11019
 - fix Xeon CI by @ZailiWang in #11454
 - [CI] Add nightly builds to dockerhub by @csahithi in #9804
 - [Feature] support regex strings as a stopping condition by @glenliu21 in #10635
 - Beta spec-overlap for EAGLE by @hnyls2002 in #11398
 - Piecewise CUDA Graph Support & Torch Compile Backend by @Oasis-Git in #10062
 - [Router]: Small Typo in a comment within tree.rs by @xuwenyihust in #11489
 - chore: bump sgl-kernel version to 0.3.16 by @sglang-bot in #11476
 - [smol] [perf] Qwen3-VL in place op. by @vincentzed in #11481
 - [chore][1/N] Avoid using default mutable parameters by @kevin85421 in #11478
 - [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends by @MahmoudAshraf97 in #10172
 - [ perf ] Replace json-> orjson in hot path by @vincentzed in #11221
 - [chore][2/N] Avoid using default mutable parameters by @kevin85421 in #11479
 - Fix the GPT function calling regex to allow dash in the name by @antoine-roux in #10577
 - bailingMoE: Fix Key error of deepep_mode by @QiuMike in #11465
 - Fix CI break by express-laned PRs. by @hnyls2002 in #11499
 - Move args from 
global_configtoenvironby @hnyls2002 in #11332 - move fla env check position by @yizhang2077 in #11500
 - Temporarily remove b200 tests by @merrymercy in #11501
 - Fix port conflicts in CI by @merrymercy in #11497
 - temporarily remove b200 tests by @merrymercy in #11502
 - Fix unit tests by @merrymercy in #11503
 - Bugfix: Fix Type consistency for KV indices in SWARadixCache by @hzh0425 in #11452
 - doc: add doc for adding new models into nightly-ci by @mickqian in #11443
 - [CI] fix lint by @hnyls2002 in #11509
 - Deprecate 
global_server_args_dictby @hnyls2002 in #11331 - chore: remove flashinfer cleanup cache by @zhyncs in #11514
 - fix: revert temporarily remove b200 tests by @zhyncs in #11515
 - [Fix] Improve longbench prompt and other logics by @byjiang1996 in #11474
 - Sync changes on io_struct.py and deterministic ops by @merrymercy in #11498
 - [lint] Fix the lint issue by @ch-wan in #11516
 - Revert "Deprecate 
global_server_args_dict" by @ch-wan in #11520 - Improve dp attention port assignment scheme by @jokerwyt in #5889
 - [router] openai router: support grok model by @key4ng in #11511
 - docs(router): add token-bucket rate limiting to the docs by @Jonahcb in #11485
 - [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11432
 - Update DeepSeek-R1-FP4 default config on blackwell by @Qiaolin-Yu in #11512
 - [Fix]: add missing device attribute to ChunkCache by @leavelet in #11493
 - [Feature] Support mamba radix cache v0 by @yizhang2077 in #11214
 - ci: improve nightly-ci by @mickqian in #11385
 - [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering by @BBuf in #11505
 - [HICache]: Support 3FS-Store with page_first_direct layout by @hzh0425 in #11460
 - Tiny fix test run estimated time by @ShangmingCai in #11544
 - [Reland] perf: optimize qwen-vl with symm mem allreduce by @yuan-luo in #11457
 - Depreate 
global_server_args_dictby @hnyls2002 in #11528 - [Fix] Add per_channel_quant parameter to MoE config functions by @mmangkad in #11201
 - [router][ci] Add Nightly Release Workflow for SGLang Router by @slin1237 in #11527
 - [router] add tokenizer path to be dir by @slin1237 in #11530
 - Remove 
tp_worker.workerby @hnyls2002 in #11548 - fix: fix video input for qwen3-vl by @mickqian in #11442
 - [NVIDIA] BUMP FA3 by @johnnynunez in #11444
 - [Fix] Include grpc reflection runtime dependency by @ai-jz in #11419
 - Adjust overlap event loop by @hnyls2002 in #11507
 - Move deep gemm related arguments to 
sglang.srt.environby @hnyls2002 in #11547 - [router][grpc] Further delegate non-stream processing to 
processing.rsby @CatherineSue in #11553 - [router] allow user to specify chat template path by @slin1237 in #11549
 - Minor: improve sampler & remove unused fields from model_config.py by @merrymercy in #11531
 - [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter by @Jonahcb in #11483
 - Add metrics for speculative decoding (acceptance rate, average acceptance length) by @scottjlee in #11441
 - Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) by @trevor-m in #11557
 - [CI] Add Basic Test for DeepSeek V3.2 by @Fridge003 in #11308
 - [router][grpc] Add error handling to 
generate_tool_constraintsby @CatherineSue in #11562 - [NVIDIA] update pyproject.toml to support cu130 option by @johnnynunez in #11521
 - [CI Monitor] Ci monitor only deal with main branch in default by @BBuf in #11538
 - Tiny cleanup fp4 gemm calls by @fzyzcjy in #11537
 - [router][grpc] Add 
serve_grpctolaunch_serverand log id for HealthCheck by @CatherineSue in #11564 - [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds by @YouNeedCryDear in #11571
 - [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11534
 - chore: bump sgl-kernel version to 0.3.16.post1 by @sglang-bot in #11573
 - Fix accept rate in speculative decoding metrics by @Qiaolin-Yu in #11572
 - Compilation Folder Reset by @Oasis-Git in #11539
 - [FEATURE] Add Profile Trace Merger for Distributed Traces by @neelabhsinha in #11413
 - [DSv32] Use torch.compile for _get_logits_head_gate by @trevor-m in #11565
 - Make DeepEP combine recv do not overlap by @fzyzcjy in #11535
 - bench_serving support PD Disaggregation by @BBuf in #11542
 - Implement LRU eviction policy for LoRA adapters by @ConnorLi96 in #11041
 - Revert "[NVIDIA] BUMP FA3 (#11444)" by @zhyncs in #11582
 - chore: bump sgl-kernel version to 0.3.16.post2 by @sglang-bot in #11583
 - [Auto Sync] Update model_config.py (20251014) by @merrymercy in #11580
 - Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json by @Qiaolin-Yu in #11587
 - [router][protocols] Add Axum validate extractor and use it for 
/v1/chat/completionsendpoint by @CatherineSue in #11588 - [router] update generate spec to align with sgl io struct by @slin1237 in #11591
 - [router] change worker api to async instead of sync by @slin1237 in #11566
 - Update news section in README.md by @merrymercy in #11598
 - [router] delete useless table content comment in spec by @slin1237 in #11597
 - [router] allow router launch server to use grpc mode by @slin1237 in #11600
 - [Docs] [Router]: Update sg-router doc on circuit breaker by @xuwenyihust in #11449
 - [router] when given both local tokenizer and chat template, log all by @slin1237 in #11601
 - [AMD CI] Add image and weights caching. by @saienduri in #11593
 - Update release-docker-dev.yml by @sglang-bot in #11603
 - Optimize Triton Draft Backend by @hnyls2002 in #11556
 - Refactor spec decoding metrics calculation into separate 
TokenizerManagerutility function by @scottjlee in #11586 - make radix cache deterministic by @skyzh in #10721
 - move eagle draft post process to cuda graph by @cicirori in #11434
 - Reduce one step decode for draft model. by @hnyls2002 in #11561
 - [router] add py binding and readme for openai router and history backend by @key4ng in #11453
 - [router] cleanup app context and move to startup by @slin1237 in #11617
 - [router] add chang and keyang to sgl router author by @slin1237 in #11620
 - use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. by @strgrb in #11605
 - [router] update router readme to latest features by @slin1237 in #11619
 - Fix log for chunked prefix cache by @Fridge003 in #11624
 - [Auto Sync] Update scheduler.py, server_args.py (20251014) by @merrymercy in #11623
 - [Auto Sync] Update collector.py (20251014) by @merrymercy in #11625
 - [Minor] Update xgrammar dependency by @DarkSharpness in #11622
 - Update install.md by @merrymercy in #11631
 - fix: Update SGL_KERNEL_VERSION to 0.3.15 by @zhyncs in #11633
 - [router][grpc] add warm up to grpc server by @slin1237 in #11627
 - Refactor kv cache free by @cctry in #11351
 - [router] update router doc to latest features by @slin1237 in #11639
 - fix: upgrade transformers to 4.57.1 by @csahithi in #11628
 - [router] add worker self discovery for metadata by @slin1237 in #11638
 - [router] upgrade to 0.2.0 by @slin1237 in #11642
 - [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP by @UNIDY2002 in #10423
 - [1/N]Support DeepSeek-R1 w4a8 normal deepep by @ayrnb in #8247
 - [Fix] Fix accuracy bug in CSGMV kernel caching key. by @lifuhuang in #11579
 - feat: add add_chunked_prefix_cache_attention_backend by @zhyncs in #11636
 - Super tiny improve FA3 import error message by @fzyzcjy in #11590
 - [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl by @ZhengWG in #11458
 - [Doc] Update support matrix for attn and hybrid attn by @b8zhong in #11293
 - Clean up some Qwen3-Next and deterministic code by @hebiao064 in #11585
 - docs: update sglang installation guide by @zhyncs in #11659
 - Tiny cleanup some eagle unused codes by @hnyls2002 in #11660
 - Fix 1-step draft model forward by @ShangmingCai in #11653
 - [tool call] Fix prev_tool_call_arr management in base_format_detector.py by @CatherineSue in #11367
 - [router] Fix response api related spec by @key4ng in #11621
 - Fix missing json imports in serving_responses.py by @CatherineSue in #11681
 - [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM by @HydraQYH in #11674
 - [sgl-kernel] Optimize gguf test by @FlamingoPg in #11667
 - [router][grpc] Simplify model_id determination by @CatherineSue in #11684
 - [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding by @slin1237 in #11676
 - chore: bump SGLang version to 0.5.3.post2 by @sglang-bot in #11680
 - [CI][XPU]enable sglang CI on Intel XPU by @DiweiSun in #9493
 - enable rmsnorm on XPU by @huaiyuzh in #10248
 - Sync code and test CI; rename some env vars by @merrymercy in #11686
 - docs: Add Contributor Covenant Code of Conduct by @zhyncs in #11689
 - [Mamba] Increase default mamba_full_memory_ratio to 0.9 by @hanming-lu in #11679
 - [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) by @ShangmingCai in #10912
 - [sgl-kernel] support hadamard by @FlamingoPg in #11663
 - Fix missing a2a backend init of GLM4.5 MoE Block by @ShangmingCai in #11692
 - Split test_intel_amx_attention_backend.py to pass CI of timeout by @yanbing-j in #11370
 - Set csgmv as default lora backend. by @lifuhuang in #11488
 - [Bugfix] Fix Qwen3/DSV3/DSV3.2 model support by @iforgetmyname in #11510
 - [CI] Add GLM4MoE model test by @ShangmingCai in #11706
 - [router] fix get_models endpoint for openai router by @key4ng in #11687
 - [ci]use H20 to run disaggregation test by @HanHan009527 in #11543
 - chore: bump SGLang version to 0.5.3.post3 by @sglang-bot in #11693
 - model: qwen3-omni (thinker-only) by @mickqian in #10911
 - [Router] Refactor protocol definitions: split spec.rs into modular files by @key4ng in #11677
 - [router] fix p and d worker filtering and bootstrap port handling by @slin1237 in #11729
 - [router][grpc] add dissag info to warm up in grpc server by @slin1237 in #11727
 - [router] Fix tool_choice normalization in ChatCompletionRequest and fix ut by @CatherineSue in #11731
 - Revert "make radix cache deterministic" by @Fridge003 in #11728
 - Reduce the image processing latency in VLM by @zhooooong in #11541
 - [router] add spec.rs to enables tests under spec folder by @key4ng in #11734
 - [router] Add rustfmt and set group imports by default by @CatherineSue in #11732
 - Revert "[router] fix get_models endpoint for openai router (#11687)" by @key4ng in #11740
 - [router][CI] Clean up deprecated fields in 
pr-test-pd-router.ymlby @CatherineSue in #11739 - [CI] Fix broken event loop creation by @hnyls2002 in #11746
 - [overlap-spec] Make plan stream an option by @hnyls2002 in #11724
 - ci: reduce and refactor vlm ut and combine test files by @mickqian in #11062
 - Abstraction for spec worker and code cleanup by @hnyls2002 in #11643
 - add tuned fuse moe kernel for qwen3 235b fp8 on h200 by @pdasgup in #11730
 - Revert "Set csgmv as default lora backend. (#11488)" by @zhyncs in #11735
 - [router] Fix UTF-8 Boundary Panic in Stop Sequence Decoder by @slin1237 in #11766
 - [router] fix grpc client time out to 1h by @slin1237 in #11768
 - [doc] update router document by @key4ng in #11767
 - [Feature] Reuse flashinfer workspace for PD-Multiplexing. by @ykcombat in #11540
 - Turn on shm_allreduce and shm_allgather for fp16 by @chunyuan-w in #10725
 - [Auto Sync] Update scheduler.py (20251017) by @zhyncs in #11738
 - [router][grpc] Remove timeout for connections and remove 
max_tokensdeprecation warning log by @CatherineSue in #11775 - Cleaning indexer for DeepSeek V3.2 by @Fridge003 in #11682
 - [minor] sync code on python/sglang/test/test_deterministic.py and improve ci tests by @merrymercy in #11777
 - [Auto Sync] Update common.py (20251017) by @merrymercy in #11782
 - [Fix] Skip visual layers when applying LoRA to Qwen2VL modules by @anvdn in #11519
 - [Lint] Add 
python/sglangto ruff F401 checks and remove unused imports in files by @CatherineSue in #11685 - Super tiny fix missing input throughput by @fzyzcjy in #11607
 - Support shared experts overlap in cutlass moe by @fzyzcjy in #11611
 - Support casting bf16 NextN moe to fp8 by @fzyzcjy in #11613
 - Manually flip deepep_mode for cuda_graph by @zhuzilin in #11666
 - Set CUDA_VISIBLE_DEVICES to achieve one GPU per process by @merrymercy in #9170
 - Super tiny fix CI by @fzyzcjy in #11788
 - Make single-batch overlap compatible with offloading by @fzyzcjy in #11614
 - completely remove mixed mode deterministic test as prefix mode could cover it by @zminglei in #11783
 - [Refactor] move 
deep_gemm_wrapperout ofquantizationby @ch-wan in #11784 - Enable lint on main by @fzyzcjy in #11794
 - [router][grpc] Support parallel queue puts in grpc_request_manager and remove mutex for grpc_client by @CatherineSue in #11798
 - Try add back no-commit-to-branch by @fzyzcjy in #11799
 - fix(glm45): disable reduce scatter by @jinmingyi1998 in #11665
 - fix command line usage of profiling by @Qiaolin-Yu in #11793
 - [RL] support weight update with DP attention by @zhuzilin in #11669
 - [RL] use cpu group to prepare_mlp_sync_batch_raw when the server is offloaded by @zhuzilin in #10152
 - set default attention backend for deterministic inference by @zminglei in #11801
 - Eager Compiler for Torch Compile by @Oasis-Git in #11803
 - Fix install instructions and pyproject.tomls by @merrymercy in #11781
 - Bump torch_memory_saver to avoid installing pre-release versions by @fzyzcjy in #11797
 - [HiCache] feat: add more eviction policy by @stmatengss in #11506
 - [overlap-spec] support page size > 1 by @hnyls2002 in #11772
 - support server arg override KV cache to bf16 to avoid slow cases by @b8zhong in #11749
 - feat(example/fastapi): support --startup-timeout using Qwen3-Next-80B-A3B-Instruct as example by @Kindyaa in #11710
 - ci: update 
lmms-evalto speed up multimodal CI by @b8zhong in #11000 - Use cutlass fp4 gemm by default by @Qiaolin-Yu in #11813
 - Fix Dockerfile not installing correct version of DeepEP for arm build by @kyleliang-nv in #11773
 - [router] Add Configurable L0 and L1 Tokenizer Caching by @slin1237 in #11688
 - [2/2] [feature] support openai like classification api in router by @whybeyoung in #11670
 - [1/2][feature] support openai like classification api by @whybeyoung in #11618
 - make sure logit bias is applied during eagle spec decoding verification by @petricevich in #11555
 - fix: do not wrap invalid grammar objects during constrained generation by @tazjin in #11328
 - Improve 
send_sonescript by @hnyls2002 in #11817 - Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by @YAMY1234 in #10788
 - Update CODEOWNERS for layer quantization path by @merrymercy in #11818
 - support tokenized batch request by @narutolhy in #11091
 - Tiny add hints when users send requests to wrong place by @fzyzcjy in #11808
 - Make single-batch overlap compatible with NextN by @fzyzcjy in #11804
 - Support not officially supported high sgl-kernel version with low srt version by @fzyzcjy in #11786
 - Avoid generation gets hanging when user specifies multiple event loops by @fzyzcjy in #5162
 - Change bf16 to fp8 for some gemms in attention for DeepSeek ckpt v2 by @fzyzcjy in #11805
 - Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" by @hnyls2002 in #11827
 - [overlap-spec] fix stop condition and trimming by @hnyls2002 in #11819
 - [Spec Decoding] Support MTP for dsv3.2 by @Paiiiiiiiiiiiiii in #11652
 - [CI] always print back trace in 
retry()by @hnyls2002 in #11834 - [Test] Add basic matched stop for beta eagle by @hnyls2002 in #11833
 - Deterministic Mode: Add 1-stage triton kernel for prefill by @hebiao064 in #11147
 - [logprobs] Enable local deterministic logrprobs testing with strict threshold by @PrinsYin in #10994
 - [CI] Add CI test for DeepSeek V3.2 MTP by @Fridge003 in #11835
 - [NVIDIA] FA3/FA4 Fix by @johnnynunez in #11606
 - [DeepseekV32] Add fast_topk_transform_ragged_fused kernel by @hlu1 in #11815
 - Fix triton_kernels import error on some hardwares by @fzyzcjy in #11831
 - Tiny bump DeepEP version in ARM blackwell by @fzyzcjy in #11810
 - [BugFix] replace the input_to_float8 used in dsv2 by @Liu-congo in #11612
 - [Doc] Update documents for FA4 by @Fridge003 in #11778
 - fix(ci): Fix CI Monitor limit parameter and add CI Analysis to summary by @BBuf in #11832
 - Fix version bump script to handle TOML files with outdated versions by @Kangyan-Zhou in #11787
 - Improve Kernel Build Time by @Kangyan-Zhou in #11508
 - check master server for mooncake store by @huangtingwei9988 in #10510
 - chore: bump sgl-kernel version to 0.3.16.post3 by @sglang-bot in #11733
 - Recapture cuda graph after model weight update to resolve IMA error by @harrisonlimh in #11780
 - [Feature] Use current greenctx stream to communicate in PD-Multiplexing. by @ykcombat in #11594
 - Support mrope triton kernel and add unit test by @yuan-luo in #11722
 - [PD] Improve eagle acceptance rate by transferring draft model hidden states by @ZeldaHuang in #10801
 - Tiny clean up for PD module and doc by @ShangmingCai in #11747
 - Revert "[CI Monitor] Ci monitor only deal with main branch in default" by @BBuf in #11846
 - [Model] Add Olmo 3 model support by @2015aroras in #11396
 - Update amd gpu install docs. by @saienduri in #11849
 - [AMD CI] Populate image cache in nightly docker release. by @saienduri in #11822
 - fix(server_args): handle tokenizer init conflicts by @ishandhanani in #11776
 - [Feature] New structural tag support by @DarkSharpness in #10691
 - Tiny fix main lint by @hnyls2002 in #11862
 - [9/N] MoE Refactor: cleanup dispatcher interfaces by @ch-wan in #11847
 - Fix acc len and gen throughput metrics when enabling overlap-spec by @Qiaolin-Yu in #11823
 - Replace function call with set literal by @penguin-wwy in #11867
 - Support mixing cutedsl and deepgemm backend by @fzyzcjy in #11807
 - [router] Worker Management Workflow Engine by @slin1237 in #11868
 - [router] remove encoding header for oai router by @slin1237 in #11881
 - [Auto Sync] Update scheduler.py, server_args.py (20251020) by @merrymercy in #11875
 - [router][grpc] Remove 
continue_final_messageinChatTemplateParamsand addminijinja-contribby @CatherineSue in #11882 - fix(sql-router): fix conflict port in test by @htiennv in #11826
 - [router] clean up workflow logs to debug for implementation details logs by @slin1237 in #11886
 - [code move] move pp into a separate mixin by @merrymercy in #11838
 - [router][grpc] Fix wram-up random token ids for small models by @CatherineSue in #11887
 - Revise MRotaryEmbedding's forward by @yuan-luo in #11859
 - piecewise cuda graph support qwen3-moe by @BBuf in #11845
 - Fix RotaryEmbedding for fp32 input by @zhangdonghao-zdh in #11843
 - Init attention backend for Intel XPU by @airMeng in #10656
 - Use trtllm_mla decode kernel for draft extend in speculative decoding by @Qiaolin-Yu in #11664
 - [router] release router 0.2.1 by @slin1237 in #11885
 - [AMD] Update wave-lang to 3.8.0 by @xintin in #11878
 - init support for KTransformers Heterogeneous Computing by @Atream in #11487
 - [FEATURE] Add OpenAI-Compatible LoRA Adapter Selection by @neelabhsinha in #11570
 - [fix] fix ci uv install dependency by @HanHan009527 in #11895
 - Support Thinking Budget (via custom_logit_processor for OpenAI API) [Fix #6572] by @whybeyoung in #11416
 - Simplify multi-tokenizer by @zhengkezhou1 in #11295
 - [CI] disable glm4.1v and fix the flashinfer installation by @ShangmingCai in #11902
 - vlm: enforce pybase64 for image and str encode/decode by @b8zhong in #10700
 - [smol] [perf] Inverse perm improvement by @vincentzed in #11482
 - [quantization][MoE] fix the check for 
tp_size/moe_ep_size/moe_intermediate_size/weight_block_size_nby @kevin85421 in #11702 - [CI] Fix b200 flashinfer installation by @ShangmingCai in #11915
 - Fix flush cache API for spec v2 by @hnyls2002 in #11918
 - [NVIDIA] Add new SMs support for Spark & Thor by @Kh4L in #11287
 - Update sgl-kernel and remove fast hadamard depedency by @Fridge003 in #11844
 - Rename flashmla kernel options of nsa backend for better readability by @Fridge003 in #11876
 - chore: upgrade flashinfer 0.4.1 by @zhyncs in #11933
 - [BugFix][Qwen3-VL]: add metadata for video in qwen3-vl by @ZhengWG in #11377
 - [Auto Sync] Update forward_batch_info.py (20251021) by @zhyncs in #11934
 - Fix openai input_text type compatibility by @key4ng in #11935
 - fix: resolve flashinfer 0.4.1 import by @zhyncs in #11940
 - [router][grpc] Support 
v1/responsesAPI by @CatherineSue in #11926 - [router] Add gRPC E2E test suite by @key4ng in #11790
 - [router][grpc] Fix background tasks stored with wrong id by @CatherineSue in #11945
 - [lint] improve ruff check by @hnyls2002 in #11922
 - [sgl-kernel] support flashmla libtorch by @FlamingoPg in #11717
 - [NVIDIA] upstream FA4 and fix cccl path by @johnnynunez in #11929
 - Enable native ModelOpt quantization support (3/3) by @Edwardf0t1 in #10154
 - Fix mooncake dispatcher by @UNIDY2002 in #11908
 - [2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank by @HanHan009527 in #10606
 - [model] Support POINTSV15Chat model by @josephydu in #9651
 - Fix flaky hicache test with mooncake backend by @ShangmingCai in #11953
 - [Fix] Remove unused import from triton_kernels_moe.py by @FlamingoPg in #11967
 - [router] Support multiple worker URLs for OpenAI router by @key4ng in #11723
 - [Documentation] add doc for deterministic inference by @zminglei in #11956
 - [6/n]decouple quantization implementation from vLLM dependency by @Hongbosherlock in #10750
 - [BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… by @yuho8818 in #11977
 - Revert "Recapture cuda graph after model weight update to resolve IMA error " by @merrymercy in #11980
 - [NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel by @jiahanc in #11563
 - [router] create worker removal step and clean up worker manager by @slin1237 in #11921
 - Implement BGE-M3 Sparse Embeddings in SGLang by @approximated-intelligence in #10869
 - [Doc] Update deterministic inference flag in server_arguments.md by @Fridge003 in #11978
 - [grpc] Support gRPC standard health check by @CatherineSue in #11955
 - [AMD] Support a new flag to disable quant on parallelLinear layer if required by @yichiche in #11811
 - [ROCm] Remove vLLM rope dependency & use AITER impl by @b8zhong in #11322
 - [NVIDIA] Build CUDA 13 by @johnnynunez in #11299
 - Bump grace blackwell DeepEP version by @fzyzcjy in #11990
 - [CPU] misc updates by @ZailiWang in #11906
 - fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 by @zheng1 in #11965
 - [CPU] Optimize FP16 decode_attention_cpu by @blzheng in #10652
 - Allow to disable batch decoding. by @LorrinWWW in #11944
 - Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend by @cicirori in #11985
 - aiter update to v0.1.6.post1 by @HaiShaw in #12004
 - Support overlap-spec-v2 with trtllm_mla attention backend by @Qiaolin-Yu in #11821
 - Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 by @netanel-haber in #11866
 - [router] Add comprehensive E2E tests for Response API by @key4ng in #11988
 - [Router] Consolidate ConnectionMode enum to core module by @YouNeedCryDear in #11937
 - Move memory runtime checker to mixin class by @hnyls2002 in #12014
 - Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" by @hnyls2002 in #12015
 - [Fix] memory leak by overlap + retract by @cctry in #11981
 - [Feature] Support loading weights from ckpt engine worker by @stmatengss in #11755
 - [router] change ci names and update log level in ci by @slin1237 in #12021
 - Feature/nano v2 offline modelopt fp8 and nvfp4 by @netanel-haber in #12018
 - [Auto Sync] Update test_deterministic_utils.py (20251023) by @merrymercy in #12022
 - ci: fix night-ci with push retry mechanism by @mickqian in #11765
 - [router][CI] Clean up imports and prints statements in sgl-router/py_test by @CatherineSue in #12024
 - Add AWQ quantization support for NPU. by @ErvinXie in #10158
 - model: support deepseek-ocr by @mickqian in #11891
 - Log iteration # for prefill and decode by @nvcastet in #9366
 - Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" by @b8zhong in #12028
 - Fix mamba radix cache eviction logic in 
alloc_req_slotsby @rogeryoungh in #11616 - Update Github action title for kernel build by @Kangyan-Zhou in #12029
 - [router] Add builder pattern for RouterConfig with zero duplication by @slin1237 in #12030
 - Fixed aarch64 flash-mla by @nvjullin in #12009
 - chore: bump SGLang version to 0.5.4 by @sglang-bot in #12027
 
New Contributors
- @xuwenyihust made their first contribution in #11302
 - @ziruiliu made their first contribution in #10969
 - @scottjlee made their first contribution in #11144
 - @Liu-congo made their first contribution in #11360
 - @lulor made their first contribution in #8919
 - @antoine-roux made their first contribution in #10577
 - @QiuMike made their first contribution in #11465
 - @ai-jz made their first contribution in #11419
 - @neelabhsinha made their first contribution in #11413
 - @UNIDY2002 made their first contribution in #10423
 - @zhooooong made their first contribution in #11541
 - @pdasgup made their first contribution in #11730
 - @anvdn made their first contribution in #11519
 - @Kindyaa made their first contribution in #11710
 - @petricevich made their first contribution in #11555
 - @tazjin made their first contribution in #11328
 - @Paiiiiiiiiiiiiii made their first contribution in #11652
 - @2015aroras made their first contribution in #11396
 - @zhangdonghao-zdh made their first contribution in #11843
 - @xintin made their first contribution in #11878
 - @zhengkezhou1 made their first contribution in #11295
 - @Kh4L made their first contribution in #11287
 - @yuho8818 made their first contribution in #11977
 - @jiahanc made their first contribution in #11563
 - @approximated-intelligence made their first contribution in #10869
 - @zheng1 made their first contribution in #11965
 - @ErvinXie made their first contribution in #10158
 - @rogeryoungh made their first contribution in #11616
 - @nvjullin made their first contribution in #12009
 
Full Changelog: v0.5.3...v0.5.4