Commit cdd11a6
Merge upstream vLLM code into gfx11 (#983)
* [XPU] add gptq(int4) support (#37844)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* [UX] Add a persistent cache for FlashInfer autotuning (#42537)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
* [Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* add cutedsl dsv4 indexer fp8 kernel (#42899)
Signed-off-by: george <george@inferact.ai>
Co-authored-by: george <george@inferact.ai>
* [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289)
Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist <noreply@google.com>
* [ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671)
Signed-off-by: junyanxu <junyanxu5513@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [Docs] Add SVG images for pooling models. (#42626)
Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [XPU] Use custom op collective behavior (#41354)
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* [Misc] Aligning tokwise pooler heads for consistency (#43041)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
* [Docs] Reorganize online serving docs. (#41907)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Frontend] Consolidate beam search by BeamSearchMixin. (#42946)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
* [Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* [bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117)
Signed-off-by: hao-aaron <ahao@anyscale.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [KVConnector][DSV4] HMA support for Mooncake store connector (#42828)
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
* [Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* [Misc][MM] Remove redundant code in CLIPAttention (#43046)
Signed-off-by: shen-shanshan <467638484@qq.com>
* [CI] Add MTP + PD disagg test for Qwen3.5 (#42677)
Signed-off-by: ZhanqiuHu <zhu@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
* [Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887)
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
* [Perf][4/n] Eliminate various GPU<->CPU syncs (#42347)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [XPU] update xpu graph usage (#43043)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
* [Model] Openvla support (#42654)
Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
* [Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
* [Misc] add humming to dependencies (#42540)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
* [feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* [Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
* [Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
* [CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [ci] Move language models tests (hybrid) back to L4 (#43129)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
* [Model] Support post-norm architecture for EAGLE-3 supeculators (#42764)
Signed-off-by: Doğaç Eldenk <dogacel@gmail.com>
* Fix error in Dynamic NTK scaling (#41277)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
* [CPU][DOC] Fix installation commands for Arm CPUs (#43115)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
* [bug] fix WeightTransferConfig.backend to allow for all strings (#43121)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
* [MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Cohere] Enable Cohere MoE (#43143)
Signed-off-by: Terrencezzj <terrence@cohere.ai>
* [Perf][Bugfix] Update dflash aux layer indexing (#40727)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
* add enqueue all option to throughput benchmark (#42975)
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [Perf] Avoid forward scan for async output placeholders (#42938)
* [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111)
Signed-off-by: mgoin <mgoin64@gmail.com>
* [KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076)
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
* [ci] Revert model executor test back to L4 (#43188)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
* [Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
* Enable mermaid diagrams in the docs (#43192)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [GDN] Enable FI Blackwell GDN prefill kernel (#40717)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
* [XPU][CI] Add 2 server model test files in Intel GPU CI (#42499)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
* [Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330)
Signed-off-by: hallerite <git@hallerite.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [Doc] Sync CLI guide with actual help modes and launch subcommand (#40326)
Signed-off-by: Rui Wang <raygorous@gmail.com>
Co-authored-by: Rui Wang <raygorous@gmail.com>
* [Feature] Support manually enabling the cumem allocator (#33648)
Signed-off-by: Kebe <mail@kebe7jun.com>
* [Spec Decode] Support non-MTP speculation for NemotronH (#43130)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
* Remove additional dead code as a follow-up to #42889 (#43144)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
* [Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452)
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* [Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215)
Signed-off-by: j9smith <j.smith9103@outlook.com>
Signed-off-by: Joel Smith <j.smith9103@outlook.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
* [Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
* [MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
* [R3] Add routed experts to openai entrypoint (#38939)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186)
Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
* Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082)
Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* [Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
* [Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
* Add dllehr-amd to CODEOWNERS and committers list (#42772)
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com>
* [Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135)
Signed-off-by: mgoin <mgoin64@gmail.com>
* [Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
* [ROCm] Add QuickReduce min-size override and codec threshold (#41675)
Signed-off-by: <>
* [CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255)
Signed-off-by: Ace Eldeib <aeldeib@coreweave.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
* [Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
* update GPU json file based on h200 recipes (#43262)
Signed-off-by: louie-tsai <louie.tsai@intel.com>
* [Minor] Bigger overlap for FI AR (#43103)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
* [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329)
Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
* [CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082)
Signed-off-by: haosdent <haosdent@gmail.com>
* [Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
* [ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973)
Signed-off-by: AAISSJ <maze0717@g.skku.edu>
Signed-off-by: <>
Signed-off-by: sejung-son <sejung.son@nhn.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local>
Co-authored-by: sejung-son <sejung.son@nhn.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
* [Frontend][RFC] Rust front-end integration (#40848)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
* [Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905)
Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
* [Benchmark] Add num-warmup to vllm bench throughput (#43245)
Signed-off-by: Yifan Zong <yzong@redhat.com>
* [Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601)
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
* [CI] De-flake test_models for bigscience/bloom-560m (#43197)
Signed-off-by: haosdent <haosdent@gmail.com>
* [XPU] add setuptools-rust for xpu dependency (#43287)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
* Update KDA chunk prefill decay to use exp2 semantics (#43195)
Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
* Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223)
Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
* [Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
* [CI] Pin protoc binary in rust-build stages (#43292)
Signed-off-by: haosdent <haosdent@gmail.com>
* [XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266)
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943)
Signed-off-by: velonica0 <like@mail.nankai.edu.cn>
Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
* [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172)
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [CI] Fix CPU tests failing on `tl.exp2` import (#43311)
Signed-off-by: haosdent <haosdent@gmail.com>
* [Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079)
Signed-off-by: anish <anishesg@users.noreply.github.com>
Signed-off-by: Your Name <ak8686@princeton.edu>
Signed-off-by: anish <145943060+anishesg@users.noreply.github.com>
Co-authored-by: anish <anishesg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
* [Perf] `zeros` -> `empty` to remove additional fill (#42988)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [Core] Add native ModelExpress load format (#43105)
Signed-off-by: Zheng Luo <zheluo@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038)
Signed-off-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
* [Frontend] Rework fastokens integration (#43168)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968)
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: zWaNg3 <389750525@qq.com>
Co-authored-by: zWaNg3 <389750525@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [BugFix] Use correct logprobs for `logprob_token_ids` (#43125)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873)
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
* [Rust Frontend] Move code from `vllm-frontend-rs` (#43283)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com>
Signed-off-by: Will.hou <1205157517@qq.com>
Signed-off-by: Will.hou <willamhou@ceresman.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Eric Curtin <eric.curtin@docker.com>
Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com>
Co-authored-by: Will.hou <1205157517@qq.com>
Co-authored-by: Will.hou <willamhou@ceresman.com>
Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history.
* [CI] Fix dockerfile dependency graph failure for pre-commit (#43378)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855)
Signed-off-by: Chengze Fan <chengze@meta.com>
Signed-off-by: Chengze Fan <fancz2002@gmail.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
* [ROCm] Add XGMI backend for MoRI Connector (#41753)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* [ROCm][CI] add warmup to mem_util test before measurement (#43236)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
* [Frontend] Add truncation side to OpenAI endpoints (#43260)
Signed-off-by: Rui Zhang <rza21.bc@gmail.com>
Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net>
Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net>
* [Frontend] DP Supervisor (#40841)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
* [Bugfix] Make CuMemAllocator free callback stream-aware (#43020)
Signed-off-by: zixi-qi <zixi@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com>
* [XPU] Enable multiple key kernels for sparse attention (#37888)
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* [CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064)
Signed-off-by: haosdent <haosdent@gmail.com>
* [Model] Use `AutoWeightsLoader` for Voyage (#42972)
Signed-off-by: Furkan Fidan <dev@yufufi.com>
* [Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213)
Signed-off-by: tc-mb <tianchi_cai@icloud.com>
* [CI] Fix test_lora_with_spec_decode on V2 model runner (#43314)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296)
Signed-off-by: haosdent <haosdent@gmail.com>
* Correcting the mock classes for MM GC tests (#43321)
Signed-off-by: Weida Hong <wdhongtw@google.com>
* [BugFix] Fix setuptools-rust dep in requirements files (#43377)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* Fix the docker build failure in tpu-inference (#43360)
Signed-off-by: mrjunwan-lang <mrjunwan@google.com>
* [Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [CPU] Experimentally enable Triton and MRV2 (#43225)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
* [Attention] Mamba attention module refactor (#41126)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
* [XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951)
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* [Misc] Replace assert with proper exceptions for security and validation in pooling (#43286)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
* [Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001)
Signed-off-by: Tobias Wasner <wasnertobias@gmail.com>
* [BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118)
Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
* [LoRA] Reduce memory of 2D weights when EP is set (#42737)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
* [EPLB] Change default EPLB communicator (#43110)
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
* [CI] Fix AMD docker build tests (#43329)
Signed-off-by: haosdent <haosdent@gmail.com>
* Add NVFP4 MOE support for Deepseek V4. (#42209)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
* [Multimodal] Simplify ViT CUDA graph interfaces (#41234)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
* [Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650)
Signed-off-by: zhanda <zhandazhu@gmail.com>
Co-authored-by: Shang Wang <shangw@nvidia.com>
* [KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149)
* [Frontend] Simplify AuthenticationMiddleware path extraction (#43426)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [RFC][EPLB][#32028] Remove dead torch.accelerator.synchronize() from sync path (#40733)
Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com>
Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com>
* [Bugfix] Detect wrong libcute_dsl_runtime.so variant in FlashInfer GDN (#43427)
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
* [Bugfix] Clear error message for FP8 torchao quantization on unsupported GPUs (#36854)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* mhc_post - remove sts & add vectorized copies (#43437)
Signed-off-by: george <george@inferact.ai>
Co-authored-by: george <george@inferact.ai>
* [Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566)
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
* [Model Runner V2] Support sharing kv cache layers (#35045)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
* DSv4 fused Q-norm kernel grid refactor (#42353)
* [Perf] Optimize hidden state extraction logic (#37374)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [XPU]fix: add XPU platform guards to DeepSeek-V4 ops (#42950)
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* elastic_ep: stage/commit MoE quant method on reconfigure (#40881)
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
* [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend (#38822)
* Add `model` to `WeightTransferEngine.__init__` (#42922)
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [DSV4] More multi-stream enablement for c4a (#42925)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
* [ROCm][CI] Stabilize runner teardown between sampler tests (#43023)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [ROCm][CI] Stabilize Granite tool-use and test URL construction (#43017)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Bugfix] Auto-raise max_num_batched_tokens for prefix-LM multimodal models (#43051)
Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>
* [ROCm][CI] Fix ROCm LoRA Transformers fallback with full CUDA graphs (#41577)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [XPU]feat: enable FP8 block-scaled quantization on XPU (#42952)
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* [XPU] reudce host overhead of XPU MOE (#42915)
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
* [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
* [Misc] Added missing return type annotations to improve mypy and IDE tooling (#43383)
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
* [Bugfix] Fix native Triton top-k/top-p kernel assumes contiguous logi… (#42739)
Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com>
Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com>
* [ModelOpt] Support Qwen3.5/3.6 VLM quantized prefix mapping (#42546)
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
* Keep scheduler alive for delayed KV connector frees (#43433)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
* fix(eagle3): read norm_before_fc from eagle_config for NVIDIA checkpoint (#42143)
Signed-off-by: FERRARIZHENG <popkart06@gmail.com>
* [Kernel] Batch invariant NVFP4 linear using cutlass (#39912)
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
* [ROCm][CI] Remove benchmarks test group and shard long test groups (#41669)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Bugfix][Frontend] Fix input_audio parsing when uuid is present (#43414)
Signed-off-by: ffggs <314137448@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787)
Signed-off-by: Hua Huang <huah@nvidia.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
* [Docs] Fix stale version number in token_embed.md (#43488)
Signed-off-by: holegots <ikun3.1415927@gmail.com>
* [Docs] Fix stale version number in token_classify.md (#43489)
Signed-off-by: holegots <ikun3.1415927@gmail.com>
* [MoE] Migrate W4A8 CT to oracle kernel setup (#42680)
Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
* [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392)
* [ROCm][Critical] Fix the GDN import bug (#43486)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* Revert "[Misc] add humming to dependencies" (#43492)
* [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
* [Model Runner v2] Force v1 runner for tests (#43233)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
* [KV Connector] Keep MooncakeStore full hits block-aligned (#43494)
Signed-off-by: Dao Le <daole@inferact.ai>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
* [kv_offload]: Add DSv4 support (#43142)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
* [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
* Tuning script and configs for Triton Mamba SSU kernel (#43083)
Signed-off-by: Banani Ghosh <bg2502@nyu.edu>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Co-authored-by: Banani Ghosh <bg2502@nyu.edu>
* File system secondary tier implemented in python (#41735)
Signed-off-by: Rotem Shavitt <rshavitt@gmail.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
* [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
* fix: MoE model using shared routed experts crashes on AMD GPUs (#42373)
Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io>
* [Docs] Reorganize offline inference docs. (#43552)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275)
Signed-off-by: TheDuyIT <nduy250299@gmail.com>
Signed-off-by: dtnguyen <dtnguyen@nvidia.com>
Co-authored-by: Claude <noreply@anthropic.com>
* [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296)
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
* [Doc] Add section on escalating stalled contributions (#43568)
Signed-off-by: esmeetu <jasonailu87@gmail.com>
* Reduce memory usage for granite_speech. (#42933)
Signed-off-by: Yihuki <wangbovbvb@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [KV Connector] Handle Mooncake finish after preemption (#43281)
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
* [Misc] Print accuracy value for PD tests even on success (#43583)
Signed-off-by: NickLucche <nlucches@redhat.com>
* [Kernel] Remove NormGateLinear (#43554)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
* [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028)
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [LoRA] Add one shot triton kernel For MoE LoRA (#42290)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
* [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516)
Signed-off-by: Dao Le <daole@inferact.ai>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
* [KV Connector] Propagate MooncakeStore load failures (#42788)
Signed-off-by: Dao Le <Dao007forever@gmail.com>
* [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194)
Signed-off-by: Yan Ma <yan.ma@intel.com>
* [Frontend] Split the offline inference APIs and utils. (#43553)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579)
Signed-off-by: QingZhou-YangHY <3868850350@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* [GDN] GDN Prefill kernel for SM100 (#43273)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
* [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032)
Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com>
* Upgrade tpu-inference to v0.20.0 (#43394)
* Add CuTe DSL sparse compressor support (#43584)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
* [chores][log] change registry log from `warning` to `debug` (#43045)
Signed-off-by: Hank <hcc.mayday@gmail.com>
* [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482)
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
* [KV Transfer] Enable HMA by default for connectors that support it (#41847)
Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
* [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
* [Misc] Support interleaved custom image benchmark datasets (#43636)
Signed-off-by: ThibaultCastells <thib.castells@icloud.com>
* [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402)
Signed-off-by: linzm1007 <linzm1007@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [Model] Use AutoWeightsLoader for InternLM2 (#38278)
Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com>
Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
* [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646)
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
* Fix CuPy runtime deps and restore humming (#43530)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603)
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [ROCm][CI] Extend ROCm quick reduce coverage (#40990)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162)
* [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [MoE Refactor] W4a8 int8 oracle (#42789)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
* [ROCm] Remove MegaMoE integration in deepseek v4 (#43629)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* Add LM head quantization support for ModelOpt (#42124)
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
* [Doc] Add line limit to AGENTS.md (#43635)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
* [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* [CI] Soft-fail AMD entrypoints mirror tests (#43709)
Signed-off-by: Kevin Luu <kevin@inferact.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410)
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
* [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627)
Signed-off-by: Zhewen Li <zhewen@inferact.ai>
Co-authored-by: Zhewen Li <zhewen@inferact.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [ci] Add arm64 ci image (#41303)
Signed-off-by: khluu <khluu000@gmail.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
* [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
* [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
* [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677)
Signed-off-by: Xin Yang <xyangx@amazon.com>
* [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
* [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
* [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [DSv4] Refactor compressor & Fix ROCm compatibility (#43710)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
* Fix test_aot_compile for torch 2.12 (#43695)
Signed-off-by: Angela Yi <yiangela7@gmail.com>
* [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694)
Signed-off-by: aoshen524 <aoshen524@gmail.com>
Signed-off-by: Ao Shen <aoshen@inferact.ai>
Co-authored-by: aoshen524 <aoshen524@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177)
Signed-off-by: nholmber <nholmber@users.noreply.github.com>
* [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [Frontend] Add MiniCPM5 XML tool call parser (#43175)
Signed-off-by: zhangtao <zhangtao2@modelbest.cn>
Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn>
Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
* [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833)
Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
* [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550)
Signed-off-by: Aditya Singh <adisin650@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662)
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
* [Docs] Fix MLA prefill backend default docs (#43697)
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
* [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
* [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401)
Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
* [misc] Bump cutedsl version to 4.5.2 (#43745)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
* [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155)
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
* [Docs] Fix the duplicate doc icon issue (#43546)
Signed-off-by: chunyang.wen <chunyang.wen@gmail.com>
* Fix early CUDA init (#43791)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751)
Signed-off-by: jatseng-ai <jatseng@amd.com>
* [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
* [Quantization] Fix Humming RoutedExperts import (#43540)
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
* [CI] build-rocm-wheels.yml: reduce MAX_JOBS to prevent OOM
Signed-off-by: <callumm@amd.com>
---------
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: george <george@inferact.ai>
Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Signed-off-by: junyanxu <junyanxu5513@gmail.com>
Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: ZhanqiuHu <zhu@redhat.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: Doğaç Eldenk <dogacel@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Terrencezzj <terrence@cohere.ai>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Philip Maybank <pmaybank@amd.com>
Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
Signed-off-by: hallerite <git@hallerite.com>
Signed-off-by: Rui Wang <raygorous@gmail.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
Signed-off-by: j9smith <j.smith9103@outlook.com>
Signed-off-by: Joel Smith <j.smith9103@outlook.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: haosdent <haosdent@gmail.com>
Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: <>
Signed-off-by: Ace Eldeib <aeldeib@coreweave.com>
Signed-off-by: louie-tsai <louie.tsai@intel.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: AAISSJ <maze0717@g.skku.edu>
Signed-off-by: sejung-son <sejung.son@nhn.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
Signed-off-by: Isotr0py <Isotr0py@outlook.com>
Signed-off-by: velonica0 <like@mail.nankai.edu.cn>
Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
Signed-off-by: anish <anishesg@users.noreply.github.com>
Signed-off-by: Your Name <ak8686@princeton.edu>
Signed-off-by: anish <145943060+anishesg@users.noreply.github.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Zheng Luo <zheluo@nvidia.com>
Signed-off-by: Ylang Tsou <ylangt@google.com>
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: zWaNg3 <389750525@qq.com>
Signed-off-by: Lanze Liu <lanzetech@gmail.com>
Signed-off-by: Chengze Fan <chengze@meta.com>
Signed-off-by: Chengze Fan <fancz2002@gmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Rui Zhang <rza21.bc@gmail.com>
Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: zixi-qi <zixi@inferact.ai>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Furkan Fidan <dev@yufufi.com>
Signed-off-by: tc-mb <tianchi_cai@icloud.com>
Signed-off-by: Weida Hong <wdhongtw@google.com>
Signed-off-by: mrjunwan-lang <mrjunwan@google.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Signed-off-by: Tobias Wasner <wasnertobias@gmail.com>
Signed-off-by: Keyi Li <likey6688@gmail.com>
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
Signed-off-by: zhanda <zhandazhu@gmail.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com>
Signed-off-by: Juhi Mittal <juhim@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Signed-off-by: mayuyuace <qiming1.zhang@intel.com>
Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Signed-off-by: FERRARIZHENG <popkart06@gmail.com>
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Signed-off-by: ffggs <314137448@qq.com>
Signed-off-by: Hua Huang <huah@nvidia.com>
Signed-off-by: holegots <ikun3.1415927@gmail.com>
Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Dao Le <daole@inferact.ai>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Banani Ghosh <bg2502@nyu.edu>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Rotem Shavitt <rshavitt@gmail.com>
Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io>
Signed-off-by: TheDuyIT <nduy250299@gmail.com>
Signed-off-by: dtnguyen <dtnguyen@nvidia.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Signed-off-by: Yihuki <wangbovbvb@gmail.com>
Signed-off-by: Zhewen Li <zhewenli@inferact.ai>
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: QingZhou-YangHY <3868850350@qq.com>
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com>
Signed-off-by: Hank <hcc.mayday@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
Signed-off-by: ThibaultCastells <thib.castells@icloud.com>
Signed-off-by: linzm1007 <linzm1007@126.com>
Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com>
Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com>
Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Kevin Luu <kevin@inferact.ai>
Signed-off-by: Zhewen Li <zhewen@inferact.ai>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Signed-off-by: khluu <khluu000@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Angela Yi <yiangela7@gmail.com>
Signed-off-by: aoshen524 <aoshen524@gmail.com>
Signed-off-by: Ao Shen <aoshen@inferact.ai>
Signed-off-by: nholmber <nholmber@users.noreply.github.com>
Signed-off-by: zhangtao <zhangtao2@modelbest.cn>
Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn>
Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
Signed-off-by: Aditya Singh <adisin650@gmail.com>
Signed-off-by: Injae Ryou <injaeryou@gmail.com>
Signed-off-by: chunyang.wen <chunyang.wen@gmail.com>
Signed-off-by: jatseng-ai <jatseng@amd.com>
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
Signed-off-by: <callumm@amd.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: gnovack <gnovack@amazon.com>
Co-authored-by: george <george@inferact.ai>
Co-authored-by: Qiuyang Yue <yueqiuyang1389@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist <noreply@google.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: Junyan Xu <junyanxu5513@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com>
Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com>
Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Aaron Hao <ahao@anyscale.com>
Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com>
Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Co-authored-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Dao007forever <dao007forever@gmail.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
Co-authored-by: Doğaç Eldenk <dogacel@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: pmaybank <113125070+pmaybank@users.noreply.github.com>
Co-authored-by: Izik Golan <47969623+izikgo@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Artem Perevedentsev <aperevedents@nvidia.com>
Co-authored-by: xiangdong <40376367+zxd1997066@users.noreply.github.com>
Co-authored-by: hallerite <git@hallerite.com>
Co-authored-by: Ray Wang <roguerui6@gmail.com>
Co-authored-by: Rui Wang <raygorous@gmail.com>
Co-authored-by: Kebe <mail@kebe7jun.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Joel Smith <j.smith9103@outlook.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: haosdent <haosdent@gmail.com>
Co-authored-by: meena-at-work <80416898+meena-at-work@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: akii96 <aakif.nawaz@amd.com>
Co-authored-by: Ace Eldeib <alexeldeib@gmail.com>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: sonusflow <git@sonusflow.pl>
Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: 손세정 <maze0717@g.skku.edu>
Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local>
Co-authored-by: sejung-son <sejung.son@nhn.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com>
Co-authored-by: yzong-rh <yzong@redhat.com>
Co-authored-by: Yiyang "Ian" Liu <yiyangliu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: zexplorerhj <zhjoneson@163.com>
Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com>
Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com>
Co-authored-by: Isotr0py <Isotr0py@outlook.com>
Co-authored-by: velonica0 <47554626+velonica0@users.noreply.github.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Francesco Fusco <ffu@zurich.ibm.com>
Co-authored-by: anish <145943060+anishesg@users.noreply.github.com>
Co-authored-by: anish <anishesg@users.noreply.github.com>
Co-authored-by: Zheng Luo <zheluo@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: ylangtsou <149562838+ylangtsou@users.noreply.github.com>
Co-authored-by: Ylang Tsou <ylangt@google.com>
Co-authored-by: fangyuchu <fangyuchu@qq.com>
Co-authored-by: zWaNg3 <389750525@qq.com>
Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com>
Co-authored-by: Chengze Fan <fancz2002@gmail.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: ruizhang <rza21.bc@gmail.com>
Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net>
Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com>
Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com>
Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Furkan F <id+git@yufufi.com>
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com>
Co-authored-by: Weida Hong <wdhongtw@google.com>
Co-authored-by: mrjunwan-lang <mrjunwan@google.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Tobias Wasner <wasnertobias@users.noreply.github.com>
Co-authored-by: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: sychen52 <41452870+sychen52@users.noreply.github.com>
Co-authored-by: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com>
Co-authored-by: Shang Wang <shangw@nvidia.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>
Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com>
Co-authored-by: Juhi Mittal <39641197+juhi10071998@users.noreply.github.com>
Co-authored-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com>
Co-authored-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com>
Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com>
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com>
Co-authored-by: Xiaogang Zhou <zhou16386@163.com>
Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com>
Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
Co-authored-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com>
Co-authored-by: GuangYaoZheng <popkart06@gmail.com>
Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Co-authored-by: ffggs <314137448@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Hua Huang <huangh1994@outlook.com>
Co-authored-by: Holegots <fuergaosi@gmail.com>
Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: danisereb <daserebrenik@nvidia.com>
Co-authored-by: Banani Ghosh <bg2502@nyu.edu>
Co-authored-by: Rotem Shavitt <rshavitt@gmail.com>
Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com>
Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com>
Co-authored-by: Roy Wang <jasonailu87@gmail.com>
Co-authored-by: Yihuki <wangbovbvb@gmail.com>
Co-authored-by: Zhewen Li <zhewenli@meta.com>
Co-authored-by: Zhewen Li <zhewenli@inferact.ai>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com>
Co-authored-by: Sting Lin <sting.lin@cienet.com>
Co-authored-by: Jie Fang <jief@nvidia.com>
Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com>
Co-authored-by: Yubo Wang <yubowang2019@gmail.com>
Co-authored-by: Ethan Feng <ethan.fengch@gmail.com>
Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com>
Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com>
Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Zhewen Li <zhewen@inferact.ai>
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Angela Yi <yiangela7@gmail.com>
Co-authored-by: aoshen02 <aoshen@inferact.ai>
Co-authored-by: aoshen524 <aoshen524@gmail.com>
Co-authored-by: Nico Holmberg <nico.holmberg@amd.com>
Co-authored-by: zhangtao2-1 <478679312@qq.com>
Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com>
Co-authored-by: Injae Ryou <injaeryou@gmail.com>
Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com>
Co-authored-by: jatseng-ai <jatseng@amd.com>
Co-authored-by: Minh Vu <vuhoangminh97@gmail.com>1 parent 19107d0 commit cdd11a6
2,106 files changed
Lines changed: 275949 additions & 63846 deletions
File tree
- .buildkite
- hardware_tests
- image_build
- intel_jobs
- lm-eval-harness
- performance-benchmarks/tests
- scripts
- hardware_ci
- lib
- scheduled_integration_test
- tool_call
- test_areas
- .github
- workflows
- scripts
- benchmarks
- attention_benchmarks
- configs
- kernels
- cpu
- multi_turn
- cmake
- external_projects
- csrc
- attention
- core
- cpu
- sgl-kernels
- cutlass_extensions
- libtorch_stable
- attention
- mla
- cutlass_sm100_mla
- device
- kernel
- mamba
- quantization
- awq
- fp4
- fused_kernels
- gguf
- gptq_allspark
- gptq
- hadamard/hadacore
- w8a8
- cutlass
- c3x
- fp8
- int8
- moe
- marlin_moe_wna16
- quantization
- fused_kernels
- marlin
- w8a8/fp8
- docker
- entrypoints
- docs
- assets
- contributing
- models/pooling_models
- training
- benchmarking
- cli
- launch
- configuration
- contributing
- ci
- model
- deployment
- frameworks
- integrations
- design
- examples
- features
- quantization
- speculative_decoding
- getting_started
- installation
- governance
- mkdocs/hooks
- models
- extensions
- hardware_supported_models
- pooling_models
- serving
- integrations
- online_serving
- training
- weight_transfer
- usage
- examples
- applications
- chatbot
- rag
- deployment
- chart-helm
- templates
- tests
- disaggregated
- disaggregated_encoder
- disaggregated_serving
- ec_both_encoder
- example_connector
- flexkv_connector
- kv_load_failure_recovery_offline
- lmcache
- disagg_prefill_lmcache_v1
- configs
- mooncake_connector
- p2p_nccl_xpyd
- features
- automatic_prefix_caching
- batch_invariance
- context_extension
- data_parallel
- kv_events
- logits_processor
- lora
- openai_batch
- pause_resume
- profiling
- prompt_embed
- reset_kv
- sharded_state
- speculative_decoding
- structured_outputs
- torchrun
- generate
- multimodal
- qwen2_5_omni
- qwen3_omni
- observability/dashboards
- online_serving
- pooling
- embed/openai_embedding_long_text
- score/template
- ray_serving
- elastic_ep
- reasoning
- rl
- speech_to_text
- lid
- openai
- realtime
- tool_calling
- requirements
- build
- test
- rust
- proto
- src
- chat
- examples
- src
- backend
- multimodal
- output
- default
- harmony
- parser
- reasoning
- tool
- renderer
- deepseek_v32
- fixtures
- deepseek_v4
- fixtures
- hf
- tests
- templates
- vllm_examples
- cmd
- examples
- src
- cli
- engine-core-client
- examples
- src
- client
- coordinator
- protocol
- logprobs
- tests
- llm
- examples
- src
- tests
- managed-engine
- src
- metrics
- src
- reasoning-parser
- src
- server
- examples
- src
- grpc
- middleware
- routes
- inference
- generate
- openai
- chat_completions
- completions
- utils
- text
- src
- backend
- hf
- output
- tokenizer
- benches
- src
- tool-parser
- benches
- utils
- src
- deepseek_dsml
- deepseek_json
- glm_xml
- json
- tests
- benchmarks
- compile
- correctness_e2e
- fullgraph
- fusions_e2e
- h100
- passes
- distributed
- ir
- config
- distributed
- engine
- entrypoints
- llm
- openai
- chat_completion
- completion
- parser
- responses
- pooling
- scoring
- serve
- disagg
- instrumentator
- tokenize
- speech_to_text
- correctness
- realtime
- transcription
- translation
- weight_transfer
- evals
- gpt_oss/configs
- gsm8k/configs
- moe-refactor
- mrcr
- configs
- ir
- kernels
- attention
- core
- helion
- ir
- mamba
- moe
- modular_kernel_tools
- quantization
- lora
- model_executor
- layers
- model_loader
- tensorizer_loader
- models
- language
- generation
- pooling
- multimodal
- generation
- vlm_utils
- pooling
- processing
- quantization
- multimodal
- parser
- plugins_tests
- quantization
- reasoning
- renderers
- spec_decode
- tokenizers_
- fixtures/deepseek_v4
- tool_parsers
- tool_use
- tools
- utils_
- v1
- attention
- core
- cudagraph
- determinism
- distributed
- e2e
- general
- spec_decode
- ec_connector/integration
- engine
- kv_connector
- nixl_integration
- unit
- offloading_connector
- kv_offload
- cpu
- logits_processors
- metrics
- sample
- simple_kv_offload
- spec_decode
- streaming_input
- structured_output
- worker
- weight_loading
- tools
- pre_commit
- vllm-tpu
- vllm
- assets
- benchmarks
- datasets
- lib
- compilation
- passes
- fusion
- ir
- utility
- config
- cute_utils
- device_allocator
- distributed
- device_communicators
- elastic_ep
- eplb
- kv_transfer
- kv_connector
- v1
- mooncake
- store
- moriio
- nixl
- offloading
- p2p
- weight_transfer
- engine
- entrypoints
- anthropic
- cli
- generate
- beam_search
- openai
- chat_completion
- completion
- engine
- generate
- models
- parser
- responses
- speech_to_text
- pooling
- base
- classify
- embed
- pooling
- scoring
- serve
- disagg
- lora
- render
- rlhf
- speech_to_text
- base
- realtime
- transcription
- translation
- inputs
- ir
- ops
- kernels
- helion
- configs/silu_mul_fp8
- ops
- triton
- lora
- layers
- ops/triton_ops
- punica_wrapper
- model_executor
- kernels
- linear
- mixed_precision
- mxfp4
- nvfp4
- scaled_mm
- mhc
- layers
- attention
- fla/ops
- fused_moe
- configs
- experts
- oracle
- prepare_finalize
- router
- runner
- mamba
- gdn
- ops
- configs/selective_state_update
- cpu
- gdn_chunk_cutedsl
- minimax_rms_norm
- pooler
- seqwise
- tokwise
- quantization
- compressed_tensors
- compressed_tensors_moe
- schemes
- online
- quark
- schemes
- turboquant
- utils
- rotary_embedding
- model_loader
- reload
- models
- transformers
- warmup
- models
- deepseek_v4
- amd
- common
- ops
- nvidia
- ops
- multimodal
- media
- processing
- parser
- platforms
- reasoning
- renderers
- tokenizers
- tool_parsers
- transformers_utils
- chat_templates
- configs
- speculators
- processors
- triton_utils
- utils
- v1
- attention
- backends
- mla
- prefill
- ops
- core
- sched
- engine
- executor
- kv_offload
- cpu
- policies
- tiering
- example
- fs
- worker
- metrics
- pool
- sample
- logits_processor
- ops
- simple_kv_offload
- spec_decode
- structured_output
- worker
- cpu
- gpu
- model_states
- sample
- spec_decode
- eagle
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
| 18 | + | |
17 | 19 | | |
18 | 20 | | |
19 | | - | |
| 21 | + | |
20 | 22 | | |
21 | 23 | | |
| 24 | + | |
22 | 25 | | |
23 | | - | |
| 26 | + | |
| 27 | + | |
24 | 28 | | |
25 | 29 | | |
26 | 30 | | |
| |||
50 | 54 | | |
51 | 55 | | |
52 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
53 | 71 | | |
54 | 72 | | |
55 | 73 | | |
56 | 74 | | |
57 | 75 | | |
58 | 76 | | |
59 | 77 | | |
60 | | - | |
| 78 | + | |
61 | 79 | | |
62 | 80 | | |
63 | 81 | | |
| 82 | + | |
64 | 83 | | |
65 | 84 | | |
66 | 85 | | |
67 | 86 | | |
68 | | - | |
| 87 | + | |
69 | 88 | | |
70 | 89 | | |
71 | 90 | | |
72 | | - | |
| 91 | + | |
73 | 92 | | |
74 | 93 | | |
75 | 94 | | |
76 | | - | |
| 95 | + | |
77 | 96 | | |
78 | 97 | | |
79 | 98 | | |
| |||
82 | 101 | | |
83 | 102 | | |
84 | 103 | | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
85 | 115 | | |
86 | 116 | | |
87 | 117 | | |
88 | | - | |
| 118 | + | |
89 | 119 | | |
90 | 120 | | |
91 | 121 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
| 195 | + | |
195 | 196 | | |
196 | 197 | | |
197 | 198 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
9 | 51 | | |
10 | 52 | | |
11 | 53 | | |
| |||
56 | 98 | | |
57 | 99 | | |
58 | 100 | | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
15 | | - | |
| 14 | + | |
| 15 | + | |
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | | - | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
30 | | - | |
31 | | - | |
| 31 | + | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
| |||
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| 50 | + | |
49 | 51 | | |
50 | | - | |
| 52 | + | |
51 | 53 | | |
52 | 54 | | |
53 | 55 | | |
| |||
65 | 67 | | |
66 | 68 | | |
67 | 69 | | |
| 70 | + | |
68 | 71 | | |
69 | | - | |
| 72 | + | |
70 | 73 | | |
71 | 74 | | |
72 | 75 | | |
| |||
84 | 87 | | |
85 | 88 | | |
86 | 89 | | |
| 90 | + | |
87 | 91 | | |
88 | 92 | | |
89 | 93 | | |
| |||
103 | 107 | | |
104 | 108 | | |
105 | 109 | | |
| 110 | + | |
106 | 111 | | |
107 | 112 | | |
108 | | - | |
109 | | - | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
110 | 116 | | |
111 | 117 | | |
112 | 118 | | |
| |||
124 | 130 | | |
125 | 131 | | |
126 | 132 | | |
| 133 | + | |
127 | 134 | | |
128 | | - | |
129 | | - | |
130 | 135 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
0 commit comments