Skip to content

Commit cdd11a6

Browse files
amd-callummjikunshangmmangkadNickLucchenjhill
authored
Merge upstream vLLM code into gfx11 (#983)
* [XPU] add gptq(int4) support (#37844) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [UX] Add a persistent cache for FlashInfer autotuning (#42537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * add cutedsl dsv4 indexer fp8 kernel (#42899) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> * [ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Docs] Add SVG images for pooling models. (#42626) Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [XPU] Use custom op collective behavior (#41354) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Aligning tokwise pooler heads for consistency (#43041) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Docs] Reorganize online serving docs. (#41907) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Consolidate beam search by BeamSearchMixin. (#42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KVConnector][DSV4] HMA support for Mooncake store connector (#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Misc][MM] Remove redundant code in CLIPAttention (#43046) Signed-off-by: shen-shanshan <467638484@qq.com> * [CI] Add MTP + PD disagg test for Qwen3.5 (#42677) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * [Perf][4/n] Eliminate various GPU<->CPU syncs (#42347) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [XPU] update xpu graph usage (#43043) Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> * [Model] Openvla support (#42654) Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> * [Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Misc] add humming to dependencies (#42540) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> * [feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * [Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> * [CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Move language models tests (hybrid) back to L4 (#43129) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Model] Support post-norm architecture for EAGLE-3 supeculators (#42764) Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> * Fix error in Dynamic NTK scaling (#41277) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [CPU][DOC] Fix installation commands for Arm CPUs (#43115) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> * [bug] fix WeightTransferConfig.backend to allow for all strings (#43121) Signed-off-by: ahao-anyscale <ahao@anyscale.com> * [MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Cohere] Enable Cohere MoE (#43143) Signed-off-by: Terrencezzj <terrence@cohere.ai> * [Perf][Bugfix] Update dflash aux layer indexing (#40727) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * add enqueue all option to throughput benchmark (#42975) Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Perf] Avoid forward scan for async output placeholders (#42938) * [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111) Signed-off-by: mgoin <mgoin64@gmail.com> * [KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ci] Revert model executor test back to L4 (#43188) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099) Signed-off-by: NickLucche <nlucches@redhat.com> * [Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097) Signed-off-by: NickLucche <nlucches@redhat.com> * [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * Enable mermaid diagrams in the docs (#43192) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [GDN] Enable FI Blackwell GDN prefill kernel (#40717) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [XPU][CI] Add 2 server model test files in Intel GPU CI (#42499) Signed-off-by: zengxian <xiangdong.zeng@intel.com> * [Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330) Signed-off-by: hallerite <git@hallerite.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Doc] Sync CLI guide with actual help modes and launch subcommand (#40326) Signed-off-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> * [Feature] Support manually enabling the cumem allocator (#33648) Signed-off-by: Kebe <mail@kebe7jun.com> * [Spec Decode] Support non-MTP speculation for NemotronH (#43130) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * Remove additional dead code as a follow-up to #42889 (#43144) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> * [Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452) Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215) Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [R3] Add routed experts to openai entrypoint (#38939) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: NickLucche <nlucches@redhat.com> * Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082) Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * [Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> * [Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add dllehr-amd to CODEOWNERS and committers list (#42772) Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> * [Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [ROCm] Add QuickReduce min-size override and codec threshold (#41675) Signed-off-by: <> * [CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255) Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * update GPU json file based on h200 recipes (#43262) Signed-off-by: louie-tsai <louie.tsai@intel.com> * [Minor] Bigger overlap for FI AR (#43103) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329) Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082) Signed-off-by: haosdent <haosdent@gmail.com> * [Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973) Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: <> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [Frontend][RFC] Rust front-end integration (#40848) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905) Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> * [Benchmark] Add num-warmup to vllm bench throughput (#43245) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [CI] De-flake test_models for bigscience/bloom-560m (#43197) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU] add setuptools-rust for xpu dependency (#43287) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * Update KDA chunk prefill decay to use exp2 semantics (#43195) Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> * Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223) Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> * [Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * [CI] Pin protoc binary in rust-build stages (#43292) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943) Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CI] Fix CPU tests failing on `tl.exp2` import (#43311) Signed-off-by: haosdent <haosdent@gmail.com> * [Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079) Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> * [Perf] `zeros` -> `empty` to remove additional fill (#42988) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Core] Add native ModelExpress load format (#43105) Signed-off-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038) Signed-off-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Frontend] Rework fastokens integration (#43168) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968) Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BugFix] Use correct logprobs for `logprob_token_ids` (#43125) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873) Signed-off-by: Lanze Liu <lanzetech@gmail.com> * [Rust Frontend] Move code from `vllm-frontend-rs` (#43283) Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Eric Curtin <eric.curtin@docker.com> Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Will.hou <willamhou@ceresman.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Eric Curtin <eric.curtin@docker.com> Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Will.hou <willamhou@ceresman.com> Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history. * [CI] Fix dockerfile dependency graph failure for pre-commit (#43378) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855) Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> * [ROCm] Add XGMI backend for MoRI Connector (#41753) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [ROCm][CI] add warmup to mem_util test before measurement (#43236) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Frontend] Add truncation side to OpenAI endpoints (#43260) Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> * [Frontend] DP Supervisor (#40841) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Make CuMemAllocator free callback stream-aware (#43020) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [XPU] Enable multiple key kernels for sparse attention (#37888) Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064) Signed-off-by: haosdent <haosdent@gmail.com> * [Model] Use `AutoWeightsLoader` for Voyage (#42972) Signed-off-by: Furkan Fidan <dev@yufufi.com> * [Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213) Signed-off-by: tc-mb <tianchi_cai@icloud.com> * [CI] Fix test_lora_with_spec_decode on V2 model runner (#43314) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296) Signed-off-by: haosdent <haosdent@gmail.com> * Correcting the mock classes for MM GC tests (#43321) Signed-off-by: Weida Hong <wdhongtw@google.com> * [BugFix] Fix setuptools-rust dep in requirements files (#43377) Signed-off-by: Nick Hill <nickhill123@gmail.com> * Fix the docker build failure in tpu-inference (#43360) Signed-off-by: mrjunwan-lang <mrjunwan@google.com> * [Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CPU] Experimentally enable Triton and MRV2 (#43225) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Attention] Mamba attention module refactor (#41126) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Replace assert with proper exceptions for security and validation in pooling (#43286) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001) Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> * [BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118) Signed-off-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Keyi Li <likey6688@gmail.com> * [LoRA] Reduce memory of 2D weights when EP is set (#42737) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [EPLB] Change default EPLB communicator (#43110) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> * [CI] Fix AMD docker build tests (#43329) Signed-off-by: haosdent <haosdent@gmail.com> * Add NVFP4 MOE support for Deepseek V4. (#42209) Signed-off-by: Shiyang Chen <shiychen@nvidia.com> * [Multimodal] Simplify ViT CUDA graph interfaces (#41234) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650) Signed-off-by: zhanda <zhandazhu@gmail.com> Co-authored-by: Shang Wang <shangw@nvidia.com> * [KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149) * [Frontend] Simplify AuthenticationMiddleware path extraction (#43426) Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [RFC][EPLB][#32028] Remove dead torch.accelerator.synchronize() from sync path (#40733) Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> * [Bugfix] Detect wrong libcute_dsl_runtime.so variant in FlashInfer GDN (#43427) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [Bugfix] Clear error message for FP8 torchao quantization on unsupported GPUs (#36854) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * mhc_post - remove sts & add vectorized copies (#43437) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566) Signed-off-by: Juhi Mittal <juhim@nvidia.com> * [Model Runner V2] Support sharing kv cache layers (#35045) Signed-off-by: Nick Hill <nickhill123@gmail.com> * DSv4 fused Q-norm kernel grid refactor (#42353) * [Perf] Optimize hidden state extraction logic (#37374) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [XPU]fix: add XPU platform guards to DeepSeek-V4 ops (#42950) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * elastic_ep: stage/commit MoE quant method on reconfigure (#40881) Signed-off-by: Itay Alroy <ialroy@nvidia.com> * [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend (#38822) * Add `model` to `WeightTransferEngine.__init__` (#42922) Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSV4] More multi-stream enablement for c4a (#42925) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Stabilize runner teardown between sampler tests (#43023) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm][CI] Stabilize Granite tool-use and test URL construction (#43017) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix] Auto-raise max_num_batched_tokens for prefix-LM multimodal models (#43051) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> * [ROCm][CI] Fix ROCm LoRA Transformers fallback with full CUDA graphs (#41577) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU]feat: enable FP8 block-scaled quantization on XPU (#42952) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [XPU] reudce host overhead of XPU MOE (#42915) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Misc] Added missing return type annotations to improve mypy and IDE tooling (#43383) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Bugfix] Fix native Triton top-k/top-p kernel assumes contiguous logi… (#42739) Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> * [ModelOpt] Support Qwen3.5/3.6 VLM quantized prefix mapping (#42546) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * Keep scheduler alive for delayed KV connector frees (#43433) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix(eagle3): read norm_before_fc from eagle_config for NVIDIA checkpoint (#42143) Signed-off-by: FERRARIZHENG <popkart06@gmail.com> * [Kernel] Batch invariant NVFP4 linear using cutlass (#39912) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Remove benchmarks test group and shard long test groups (#41669) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Frontend] Fix input_audio parsing when uuid is present (#43414) Signed-off-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] build-rocm-wheels.yml: reduce MAX_JOBS to prevent OOM Signed-off-by: <callumm@amd.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: george <george@inferact.ai> Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: junyanxu <junyanxu5513@gmail.com> Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: ZhanqiuHu <zhu@redhat.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Terrencezzj <terrence@cohere.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com> Signed-off-by: hallerite <git@hallerite.com> Signed-off-by: Rui Wang <raygorous@gmail.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: <> Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Signed-off-by: louie-tsai <louie.tsai@intel.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Zheng Luo <zheluo@nvidia.com> Signed-off-by: Ylang Tsou <ylangt@google.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Furkan Fidan <dev@yufufi.com> Signed-off-by: tc-mb <tianchi_cai@icloud.com> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mrjunwan-lang <mrjunwan@google.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Ma Jian <jian1.ma@intel.com> Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> Signed-off-by: Keyi Li <likey6688@gmail.com> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: Shiyang Chen <shiychen@nvidia.com> Signed-off-by: zhanda <zhandazhu@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: FERRARIZHENG <popkart06@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: ffggs <314137448@qq.com> Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: <callumm@amd.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: gnovack <gnovack@amazon.com> Co-authored-by: george <george@inferact.ai> Co-authored-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Junyan Xu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com> Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com> Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: Izik Golan <47969623+izikgo@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: xiangdong <40376367+zxd1997066@users.noreply.github.com> Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Ray Wang <roguerui6@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: haosdent <haosdent@gmail.com> Co-authored-by: meena-at-work <80416898+meena-at-work@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Ace Eldeib <alexeldeib@gmail.com> Co-authored-by: Louie Tsai <louie.tsai@intel.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: sonusflow <git@sonusflow.pl> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: 손세정 <maze0717@g.skku.edu> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: Yiyang "Ian" Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: ylangtsou <149562838+ylangtsou@users.noreply.github.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: fangyuchu <fangyuchu@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: ruizhang <rza21.bc@gmail.com> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Furkan F <id+git@yufufi.com> Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: mrjunwan-lang <mrjunwan@google.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Tobias Wasner <wasnertobias@users.noreply.github.com> Co-authored-by: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com> Co-authored-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: sychen52 <41452870+sychen52@users.noreply.github.com> Co-authored-by: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com> Co-authored-by: Shang Wang <shangw@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: Juhi Mittal <39641197+juhi10071998@users.noreply.github.com> Co-authored-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com> Co-authored-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Xiaogang Zhou <zhou16386@163.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: GuangYaoZheng <popkart06@gmail.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com>
1 parent 19107d0 commit cdd11a6

2,106 files changed

Lines changed: 275949 additions & 63846 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.buildkite/ci_config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ run_all_patterns:
88
- "CMakeLists.txt"
99
- "requirements/common.txt"
1010
- "requirements/cuda.txt"
11+
- "requirements/kv_connectors.txt"
1112
- "requirements/build/cuda.txt"
1213
- "requirements/test/cuda.txt"
1314
- "setup.py"

.buildkite/hardware_tests/cpu.yaml

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,19 @@ steps:
1212
- vllm/_custom_ops.py
1313
- tests/kernels/attention/test_cpu_attn.py
1414
- tests/kernels/moe/test_cpu_fused_moe.py
15+
- tests/kernels/moe/test_cpu_quant_fused_moe.py
1516
- tests/kernels/test_onednn.py
1617
- tests/kernels/test_awq_int4_to_int8.py
18+
- tests/kernels/quantization/test_cpu_fp8_scaled_mm.py
1719
commands:
1820
- |
19-
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
21+
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
2022
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
2123
pytest -x -v -s tests/kernels/moe/test_cpu_fused_moe.py
24+
pytest -x -v -s tests/kernels/moe/test_cpu_quant_fused_moe.py
2225
pytest -x -v -s tests/kernels/test_onednn.py
23-
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py"
26+
pytest -x -v -s tests/kernels/test_awq_int4_to_int8.py
27+
pytest -x -v -s tests/kernels/quantization/test_cpu_fp8_scaled_mm.py"
2428
2529
- label: CPU-Compatibility Tests
2630
depends_on: []
@@ -50,30 +54,45 @@ steps:
5054
pytest -x -v -s tests/models/language/generation -m cpu_model
5155
pytest -x -v -s tests/models/language/pooling -m cpu_model"
5256
57+
- label: CPU-ModelRunnerV2 Tests
58+
depends_on: []
59+
device: intel_cpu
60+
no_plugin: true
61+
soft_fail: true
62+
source_file_dependencies:
63+
- vllm/v1/worker/cpu/
64+
- vllm/v1/worker/gpu/
65+
commands:
66+
- |
67+
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
68+
uv pip install git+https://github.com/triton-lang/triton-cpu.git@270e696d
69+
VLLM_USE_V2_MODEL_RUNNER=1 pytest -x -v -s tests/models/language/generation/test_granite.py -m cpu_model"
70+
5371
- label: CPU-Quantization Model Tests
5472
depends_on: []
5573
device: intel_cpu
5674
no_plugin: true
5775
source_file_dependencies:
5876
- csrc/cpu/
5977
- vllm/model_executor/layers/quantization/cpu_wna16.py
60-
- vllm/model_executor/layers/quantization/gptq_marlin.py
78+
- vllm/model_executor/layers/quantization/auto_gptq.py
6179
- vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_int8.py
6280
- vllm/model_executor/layers/quantization/kernels/scaled_mm/cpu.py
6381
- vllm/model_executor/layers/quantization/kernels/mixed_precision/cpu.py
82+
- vllm/model_executor/layers/fused_moe/experts/cpu_moe.py
6483
- tests/quantization/test_compressed_tensors.py
6584
- tests/quantization/test_cpu_wna16.py
6685
commands:
6786
- |
68-
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 20m "
87+
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 30m "
6988
pytest -x -v -s tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs
7089
pytest -x -v -s tests/quantization/test_cpu_wna16.py"
7190
72-
- label: CPU-Distributed Tests
91+
- label: CPU-Distributed Tests (PP+TP)
7392
depends_on: []
7493
device: intel_cpu
7594
no_plugin: true
76-
source_file_dependencies:
95+
source_file_dependencies: &cpu_distributed_deps
7796
- csrc/cpu/shm.cpp
7897
- vllm/v1/worker/cpu_worker.py
7998
- vllm/v1/worker/gpu_worker.py
@@ -82,10 +101,21 @@ steps:
82101
- vllm/platforms/cpu.py
83102
- vllm/distributed/parallel_state.py
84103
- vllm/distributed/device_communicators/cpu_communicator.py
104+
- .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh
105+
commands:
106+
- |
107+
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
108+
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh tp_pp"
109+
110+
- label: CPU-Distributed Tests (DP+TP)
111+
depends_on: []
112+
device: intel_cpu
113+
no_plugin: true
114+
source_file_dependencies: *cpu_distributed_deps
85115
commands:
86116
- |
87117
bash .buildkite/scripts/hardware_ci/run-cpu-test.sh 10m "
88-
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh"
118+
bash .buildkite/scripts/hardware_ci/run-cpu-distributed-smoke-test.sh dp_tp"
89119
90120
- label: CPU-Multi-Modal Model Tests %N
91121
depends_on: []

.buildkite/hardware_tests/intel.yaml

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,3 @@ steps:
88
commands:
99
- bash .buildkite/scripts/hardware_ci/run-hpu-test.sh
1010

11-
- label: "Intel GPU Test"
12-
depends_on: []
13-
soft_fail: true
14-
device: intel_gpu
15-
no_plugin: true
16-
commands:
17-
- bash .buildkite/scripts/hardware_ci/run-xpu-test.sh

.buildkite/image_build/image_build.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,7 @@ export BUILDKITE_COMMIT
192192
export PARENT_COMMIT
193193
export IMAGE_TAG
194194
export IMAGE_TAG_LATEST
195+
export COMMIT="${COMMIT:-${BUILDKITE_COMMIT}}"
195196
export CACHE_FROM
196197
export CACHE_FROM_BASE_BRANCH
197198
export CACHE_FROM_MAIN

.buildkite/image_build/image_build.yaml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,48 @@ steps:
66
timeout_in_minutes: 600
77
commands:
88
- if [[ "$BUILDKITE_BRANCH" == "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $IMAGE_TAG $IMAGE_TAG_LATEST; else .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $IMAGE_TAG; fi
9+
# Non-root smoke 1: the default (root) image must still be importable
10+
# under a non-root UID via `--user 2000:0`. Validates the `vllm` passwd
11+
# entry + group-0-writable /home/vllm + uv path cleanup from #31959.
12+
# Uses `import vllm` rather than `vllm serve --help` because the latter
13+
# instantiates `VllmConfig` which requires a GPU attached to the
14+
# container.
15+
- docker run --rm --user 2000:0 --entrypoint python3 "$IMAGE_TAG" -c "import vllm; print(vllm.__version__)"
16+
# Non-root smoke 2: assert the non-root enabling invariants are baked
17+
# into the image. Runs as UID 2000:0 via a shell so we can verify
18+
# filesystem perms + passwd/group file state + wrapper presence without
19+
# triggering vLLM's GPU-requiring config-init path. The opt-in
20+
# `vllm-openai-nonroot` target adds only `USER vllm`, `WORKDIR
21+
# /home/vllm`, and an `ENTRYPOINT` override on top of these invariants;
22+
# its build correctness is reviewed at the Dockerfile level. Wrapper
23+
# logic is covered separately by the pre-commit hook
24+
# `test-nonroot-entrypoint` (see .pre-commit-config.yaml).
25+
- |
26+
docker run --rm --user 2000:0 --entrypoint /bin/sh "$IMAGE_TAG" -ec '
27+
if ! getent passwd 2000 | grep -q ^vllm:; then
28+
echo FAIL: UID 2000 != vllm
29+
exit 1
30+
fi
31+
if ! id -gn 2>/dev/null | grep -qx root; then
32+
echo FAIL: GID 0 not root group
33+
exit 1
34+
fi
35+
touch /home/vllm/.smoke && rm /home/vllm/.smoke
36+
touch /opt/uv/cache/.smoke && rm /opt/uv/cache/.smoke
37+
if ! test -x /usr/local/bin/vllm-nonroot-entrypoint.sh; then
38+
echo FAIL: wrapper missing
39+
exit 1
40+
fi
41+
if ! test -w /etc/passwd; then
42+
echo FAIL: /etc/passwd not group-writable
43+
exit 1
44+
fi
45+
if ! test -w /etc/group; then
46+
echo FAIL: /etc/group not group-writable
47+
exit 1
48+
fi
49+
echo non-root invariants OK
50+
'
951
retry:
1052
automatic:
1153
- exit_status: -1 # Agent was lost
@@ -56,3 +98,21 @@ steps:
5698
limit: 2
5799
- exit_status: -10 # Agent was lost
58100
limit: 2
101+
102+
- label: ":docker: Build arm64 image"
103+
key: arm64-image-build
104+
depends_on: []
105+
source_file_dependencies:
106+
- ".buildkite/image_build/image_build.yaml"
107+
- ".buildkite/image_build/image_build_arm64.sh"
108+
- "docker/Dockerfile"
109+
commands:
110+
- .buildkite/image_build/image_build_arm64.sh $REGISTRY $REPO $BUILDKITE_COMMIT
111+
env:
112+
DOCKER_BUILDKIT: "1"
113+
retry:
114+
automatic:
115+
- exit_status: -1 # Agent was lost
116+
limit: 2
117+
- exit_status: -10 # Agent was lost
118+
limit: 2
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/bin/bash
2+
set -e
3+
4+
if [[ $# -lt 3 ]]; then
5+
echo "Usage: $0 <registry> <repo> <commit>"
6+
exit 1
7+
fi
8+
9+
REGISTRY=$1
10+
REPO=$2
11+
BUILDKITE_COMMIT=$3
12+
13+
# authenticate with AWS ECR
14+
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true
15+
16+
# skip build if image already exists
17+
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-arm64) ]]; then
18+
echo "Image not found, proceeding with build..."
19+
else
20+
echo "Image found"
21+
exit 0
22+
fi
23+
24+
# build (Grace/GH200 is the arm64 GPU target; sm_90)
25+
docker build --file docker/Dockerfile \
26+
--platform linux/arm64 \
27+
--build-arg max_jobs=16 \
28+
--build-arg nvcc_threads=4 \
29+
--build-arg torch_cuda_arch_list="9.0" \
30+
--build-arg USE_SCCACHE=1 \
31+
--build-arg buildkite_commit="$BUILDKITE_COMMIT" \
32+
--tag "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-arm64 \
33+
--target test \
34+
--progress plain .
35+
36+
# push
37+
docker push "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-arm64

.buildkite/image_build/image_build_hpu.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ REPO=$2
1111
BUILDKITE_COMMIT=$3
1212

1313
# authenticate with AWS ECR
14-
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
14+
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true
1515

1616
# skip build if image already exists
1717
if [[ -z $(docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-hpu) ]]; then

.buildkite/image_build/image_build_xpu.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ REPO=$2
1111
BUILDKITE_COMMIT=$3
1212

1313
# authenticate with AWS ECR
14-
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY"
15-
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com
14+
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin "$REGISTRY" || true
15+
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com || true
1616

1717
# skip build if image already exists
1818
if ! docker manifest inspect "$REGISTRY"/"$REPO":"$BUILDKITE_COMMIT"-xpu &> /dev/null; then

.buildkite/intel_jobs/lora_intel.yaml

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,18 @@ steps:
1818
- >-
1919
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
2020
'cd tests &&
21+
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
2122
pytest -v -s lora/test_layers.py &&
2223
pytest -v -s lora/test_lora_checkpoints.py &&
23-
(pytest -v -s lora/test_lora_functions.py --deselect="tests/lora/test_lora_functions.py::test_lora_functions_sync" --deselect="tests/lora/test_lora_functions.py::test_lora_functions_async" || true) &&
24+
pytest -v -s lora/test_lora_functions.py &&
2425
pytest -v -s lora/test_lora_huggingface.py &&
2526
pytest -v -s lora/test_lora_manager.py &&
2627
pytest -v -s lora/test_lora_utils.py &&
2728
pytest -v -s lora/test_peft_helper.py &&
2829
pytest -v -s lora/test_resolver.py &&
2930
pytest -v -s lora/test_utils.py &&
30-
(pytest -v -s lora/test_add_lora.py --deselect="tests/lora/test_add_lora.py::test_add_lora" || true) &&
31-
(pytest -v -s lora/test_worker.py --deselect="tests/lora/test_worker.py::test_worker_apply_lora" || true)'
31+
pytest -v -s lora/test_add_lora.py &&
32+
pytest -v -s lora/test_worker.py'
3233
3334
- label: LoRA Fused/MoE Kernels
3435
timeout_in_minutes: 45
@@ -46,8 +47,9 @@ steps:
4647
- >-
4748
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
4849
'cd tests &&
50+
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
4951
pytest -v -s lora/test_fused_moe_lora_kernel.py &&
50-
pytest -v -s lora/test_moe_lora_align_sum.py'
52+
pytest -v -s lora/test_moe_lora_align_sum.py --deselect="tests/lora/test_moe_lora_align_sum.py::test_moe_lora_align_block_size_mixed_base_and_lora[1]"'
5153
5254
- label: LoRA Punica Kernels
5355
timeout_in_minutes: 45
@@ -65,8 +67,9 @@ steps:
6567
- >-
6668
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
6769
'cd tests &&
70+
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
6871
set -o pipefail &&
69-
pytest -v -s lora/test_punica_ops.py --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-2-2049-64-32-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype1-2-64000-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-1-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-1-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-8-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype0-3-2049-128-8-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-8-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype1-1-2049-256-128-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-3-64256-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-2-29696-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-3-49408-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-2-16384-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-2-51328-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype1-1-102656-32-4-4]"'
72+
pytest -v -s lora/test_punica_ops.py --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-3-43264-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype1-1-2049-64-128-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-1-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-1-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-256-8-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype0-3-2049-128-8-16]" --deselect="tests/lora/test_punica_ops.py::test_kernels[shrink-0-xpu:0-dtype0-1-2049-128-8-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels[expand-0-xpu:0-dtype1-1-2049-256-128-32]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-3-64256-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-2-29696-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype1-3-49408-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[shrink-0-xpu:0-dtype0-2-16384-32-4-4]" --deselect="tests/lora/test_punica_ops.py::test_kernels_hidden_size[expand-0-xpu:0-dtype0-2-51328-32-4-4]"'
7073
7174
- label: LoRA Punica FP8/XPU Ops
7275
timeout_in_minutes: 45
@@ -84,6 +87,7 @@ steps:
8487
- >-
8588
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
8689
'cd tests &&
90+
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
8791
pytest -v -s lora/test_punica_ops_fp8.py &&
8892
pytest -v -s lora/test_punica_xpu_ops.py'
8993
@@ -103,10 +107,12 @@ steps:
103107
- >-
104108
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
105109
'cd tests &&
110+
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
106111
(pytest -v -s lora/test_mixtral.py --deselect="tests/lora/test_mixtral.py::test_mixtral_lora[4]" || true) &&
107112
pytest -v -s lora/test_quant_model.py --deselect="tests/lora/test_quant_model.py::test_quant_model_lora[model0]" --deselect="tests/lora/test_quant_model.py::test_quant_model_lora[model1]" --deselect="tests/lora/test_quant_model.py::test_quant_model_tp_equality[model0]" &&
108-
pytest -v -s lora/test_qwen35_densemodel_lora.py &&
109-
pytest -v -s lora/test_transformers_model.py'
113+
pytest -v -s lora/test_transformers_model.py &&
114+
pytest -v -s lora/test_chatglm3_tp.py &&
115+
pytest -s -v lora/test_minicpmv_tp.py'
110116
111117
- label: LoRA Multimodal
112118
timeout_in_minutes: 45
@@ -124,7 +130,6 @@ steps:
124130
- >-
125131
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
126132
'cd tests &&
133+
export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
127134
pytest -v -s lora/test_default_mm_loras.py &&
128-
(pytest -v -s lora/test_qwen3_unembed.py || true) &&
129-
(pytest -v -s lora/test_qwenvl.py || true) &&
130135
pytest -v -s lora/test_whisper.py'

.buildkite/intel_jobs/misc_intel.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ steps:
4949
bash .buildkite/scripts/hardware_ci/run-intel-test.sh
5050
'export VLLM_WORKER_MULTIPROC_METHOD=spawn &&
5151
cd tests &&
52-
pytest -v -s v1/logits_processors &&
52+
pytest -v -s v1/logits_processors --ignore=v1/logits_processors/test_custom_online.py --ignore=v1/logits_processors/test_custom_offline.py &&
5353
pytest -v -s v1/test_oracle.py &&
5454
pytest -v -s v1/test_request.py &&
5555
pytest -v -s v1/test_outputs.py'

0 commit comments

Comments
 (0)