v0.10.0rc2
Pre-release
      Pre-release
    
        
          ·
          
            3031 commits
          
          to main
          since this release
        
        
        
What's Changed
- [Model] use AutoWeightsLoader for bart by @calvin0327 in #18299
 - [Model] Support VLMs with transformers backend by @zucchini-nlp in #20543
 - [bugfix] fix syntax warning caused by backslash by @1195343015 in #21251
 - [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in #21243
 - [Docs] Add RFC Meeting to Issue Template by @simon-mo in #21279
 - Add the instruction to run e2e validation manually before release by @huydhn in #21023
 - [Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in #21280
 - [Model][1/N] Support multiple poolers at model level by @DarkLight1337 in #21227
 - [Docs] Fix hardcoded links in docs by @hmellor in #21287
 - [Docs] Make tables more space efficient in 
supported_models.mdby @hmellor in #21291 - [Misc] unify variable for LLM instance by @andyxning in #20996
 - Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in #19815
 - [Misc] Add sliding window to flashinfer test by @WoosukKwon in #21282
 - [CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in #21289
 - [BugFix] make utils.current_stream thread-safety (#21252) by @simpx in #21253
 - [Misc] Add dummy maverick test by @minosfuture in #21199
 - [Attention] Clean up iRoPE in V1 by @LucasWilkinson in #21188
 - [DP] Fix Prometheus Logging by @robertgshaw2-redhat in #21257
 - Fix bad lm-eval fork by @mgoin in #21318
 - [perf] Speed up align sum kernels by @hj-mistral in #21079
 - [v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in #21283
 - [XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in #21021
 - [Doc] Fix CPU doc format by @bigPYJ1151 in #21316
 - [Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in #21338
 - Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in #21334
 - [Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in #21281
 - [V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in #21300
 - [Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21208
 - Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in #21302
 - [Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in #21311
 - [Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in #21312
 - Update fp4 quantize API by @wenscarl in #21327
 - [Feature][eplb] add verify ep or tp or dp by @lengrongfu in #21102
 - Add arcee model by @alyosha-swamy in #21296
 - [Bugfix] Fix eviction cached blocked logic by @simon-mo in #21357
 - [Misc] Remove deprecated args in v0.10 by @kebe7jun in #21349
 - [Core] Optimize update checks in LogitsProcessor by @Jialin in #21245
 - [benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in #21209
 - [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in #21222
 - [Misc] unify variable for LLM instance v2 by @andyxning in #21356
 - [perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in #21116
 - [feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in #20447
 - [Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in #21083
 - Adds parallel model weight loading for runai_streamer by @bbartels in #21330
 - [feat] Enable mm caching for transformers backend by @zucchini-nlp in #21358
 - Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in #21384
 - Add tokenization_kwargs to encode for embedding model truncation by @Receiling in #21033
 - [Bugfix] Decode Tokenized IDs to Strings for 
hf_processorinllm.chat()withmodel_impl=transformersby @ariG23498 in #21353 - [CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in #21375
 - Fix Flashinfer Allreduce+Norm enable disable calculation based on 
fi_allreduce_fusion_max_token_numby @xinli-git in #21325 - [Model] Add Qwen3CoderToolParser by @ranpox in #21396
 - [Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in #21406
 - [BugFix] Fix ray import error mem cleanup bug by @joerunde in #21381
 - [CI/Build] Fix model executor tests by @DarkLight1337 in #21387
 - [Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in #21393
 - Simplify weight loading in Transformers backend by @hmellor in #21382
 - [BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in #21391
 - [BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in #21414
 - [Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in #21420
 - Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in #21409
 - [Bugfix] Fix nightly transformers CI failure by @Isotr0py in #21427
 - [Core] Add basic unit test for maybe_evict_cached_block by @Jialin in #21400
 - [Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in #21315
 - add clear messages for deprecated models by @youkaichao in #21424
 - [Bugfix] ensure tool_choice is popped when 
tool_choice:nullis passed in json payload by @gcalmettes in #19679 - Fixed typo in profiling logs by @sergiopaniego in #21441
 - [Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in #21440
 - [Sampler] Introduce logprobs mode for logging by @houseroad in #21398
 - Mamba V2 Test not Asserting Failures. by @fabianlim in #21379
 - [Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in #21246
 - [Docs] Clean up v1/metrics.md by @windsonsea in #21449
 - [Model] add Hunyuan V1 Dense Model support. by @kzjeef in #21368
 - [V1] Check all pooling tasks during profiling by @DarkLight1337 in #21299
 - [Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in #21364
 - [Tests] Add tests for headless internal DP LB by @njhill in #21450
 - [Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in #20577
 - Add test case for compiling multiple graphs by @sarckk in #21044
 - [TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in #21418
 - [Core] Add 
reload_weightsRPC method by @22quinn in #20096 - [V1] Fix local chunked attention always disabled by @sarckk in #21419
 - [V0 Deprecation] Remove Prompt Adapters by @mgoin in #20588
 - [Core] Freeze gc during cuda graph capture to speed up init by @mgoin in #21146
 - feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in #20793
 - [Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding by @deven-labovitch in #21374
 - [Misc] Add dummy maverick test to CI by @minosfuture in #21324
 - [XPU][UT] increase intel xpu CI test scope by @Liangliang-Ma in #21492
 - [Bugfix] Fix casing warning by @MatthewBonanni in #21468
 - [Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process by @david6666666 in #21437
 - [BugFix]: Batch generation from prompt_embeds fails for long prompts by @KazusatoOoko in #21390
 - [BugFix] Fix KVConnector TP worker aggregation by @njhill in #21473
 - [DP] Internal Load Balancing Per Node [
one-pod-per-node] by @robertgshaw2-redhat in #21238 - Dump input metadata on crash for async scheduling by @WoosukKwon in #21258
 - [BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses by @yinghai in #21211
 - Add think chunk by @juliendenize in #21333
 
New Contributors
- @chaojun-zhang made their first contribution in #21021
 - @alyosha-swamy made their first contribution in #21296
 - @bbartels made their first contribution in #21330
 - @Receiling made their first contribution in #21033
 - @ariG23498 made their first contribution in #21353
 - @xinli-git made their first contribution in #21325
 - @ranpox made their first contribution in #21396
 - @ericehanley made their first contribution in #21391
 - @sergiopaniego made their first contribution in #21441
 - @hardikkgupta made their first contribution in #20793
 - @deven-labovitch made their first contribution in #21374
 - @MatthewBonanni made their first contribution in #21468
 - @david6666666 made their first contribution in #21437
 - @KazusatoOoko made their first contribution in #21390
 
Full Changelog: v0.10.0rc1...v0.10.0rc2