Skip to content

Commit d48df58

Browse files
sfc-gh-mhidayetoglub8zhonggau-nernstWoosukKwonchun37
authored
v0.8.1_ulysses_shapeshifter -> v0.8.4 ulysses (#13)
* [V1] Fix: make sure `k_index` is int64 for `apply_top_k_only` (vllm-project#15907) Signed-off-by: Brayden Zhong <[email protected]> * [Bugfix] Fix imports for MoE on CPU (vllm-project#15841) Signed-off-by: Thien Tran <[email protected]> * [V1][Minor] Enhance SpecDecoding Metrics Log in V1 (vllm-project#15902) Signed-off-by: Woosuk Kwon <[email protected]> * [Doc] Update rocm.inc.md (vllm-project#15917) Signed-off-by: chun37 <[email protected]> * [V1][Bugfix] Fix typo in MoE TPU checking (vllm-project#15927) Signed-off-by: Roger Wang <[email protected]> * [Benchmark]Fix error message (vllm-project#15866) Signed-off-by: wangli <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Misc] Replace print with logger (vllm-project#15923) Signed-off-by: chaunceyjiang <[email protected]> * [CI/Build] Further clean up LoRA tests (vllm-project#15920) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Fix cache block size calculation for CPU MLA (vllm-project#15848) Signed-off-by: Thien Tran <[email protected]> * [Build/CI] Update lm-eval to 0.4.8 (vllm-project#15912) Signed-off-by: Chris Thi <[email protected]> * [Kernel] Add more dtype support for GGUF dequantization (vllm-project#15879) Signed-off-by: lukas.bluebaum <[email protected]> * [core] Add tags parameter to wake_up() (vllm-project#15500) Signed-off-by: Eric <[email protected]> * [V1] Fix json_object support with xgrammar (vllm-project#15488) Signed-off-by: Russell Bryant <[email protected]> * Add minimum version for `huggingface_hub` to enable Xet downloads (vllm-project#15873) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix][Benchmarks] Ensure `async_request_deepspeed_mii` uses the OpenAI choices key (vllm-project#15926) Signed-off-by: Brayden Zhong <[email protected]> * [CI] Remove duplicate entrypoints-test (vllm-project#15940) Signed-off-by: Kay Yan <[email protected]> * [Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. (vllm-project#15938) Signed-off-by: chaunceyjiang <[email protected]> * [Metrics] Hide deprecated metrics (vllm-project#15458) Signed-off-by: Mark McLoughlin <[email protected]> * [Frontend] Implement Tool Calling with `tool_choice='required'` (vllm-project#13483) Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Matt, Matthias <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: mgoin <[email protected]> * [CPU][Bugfix] Using custom allreduce for CPU backend (vllm-project#15934) Signed-off-by: jiang1.li <[email protected]> * [Model] use AutoWeightsLoader in model load_weights (vllm-project#15770) Signed-off-by: rongfu.leng <[email protected]> * [Misc] V1 LoRA support CPU offload (vllm-project#15843) Signed-off-by: Jee Jee Li <[email protected]> * Restricted cmake to be less than version 4 as 4.x breaks the build of… (vllm-project#15859) Signed-off-by: Nishidha Panpaliya <[email protected]> * [misc] instruct pytorch to use nvml-based cuda check (vllm-project#15951) Signed-off-by: youkaichao <[email protected]> * [V1] Support Mistral3 in V1 (vllm-project#15950) Signed-off-by: mgoin <[email protected]> * Fix `huggingface-cli[hf-xet]` -> `huggingface-cli[hf_xet]` (vllm-project#15969) Signed-off-by: Harry Mellor <[email protected]> * [V1][TPU] TPU-optimized top-p implementation (avoids scattering). (vllm-project#15736) Signed-off-by: Hyesoo Yang <[email protected]> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> * [TPU] optimize the all-reduce performance (vllm-project#15903) Signed-off-by: Chengji Yao <[email protected]> * [V1][TPU] Do not compile sampling more than needed (vllm-project#15883) Signed-off-by: NickLucche <[email protected]> * [ROCM][KERNEL] Paged attention for V1 (vllm-project#15720) Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> * fix: better error message for get_config close vllm-project#13889 (vllm-project#15943) Signed-off-by: yihong0618 <[email protected]> * [bugfix] add seed in torchrun_example.py (vllm-project#15980) Signed-off-by: youkaichao <[email protected]> * [ROCM][V0] PA kennel selection when no sliding window provided (vllm-project#15982) Signed-off-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> * [Benchmark] Add AIMO Dataset to Benchmark (vllm-project#15955) Signed-off-by: Ziji Shi <[email protected]> Signed-off-by: StevenShi-23 <[email protected]> * [misc] improve error message for "Failed to infer device type" (vllm-project#15994) Signed-off-by: youkaichao <[email protected]> * [Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (vllm-project#15367) Signed-off-by: wwl2755 <[email protected]> * [doc] update contribution link (vllm-project#15922) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * fix: tiny fix make format.sh excutable (vllm-project#16015) Signed-off-by: yihong0618 <[email protected]> * [SupportsQuant] Bert, Blip, Blip2, Bloom (vllm-project#15573) Signed-off-by: Kyle Sayers <[email protected]> * [SupportsQuant] Chameleon, Chatglm, Commandr (vllm-project#15952) Signed-off-by: Kyle Sayers <[email protected]> * [Neuron][kernel] Fuse kv cache into a single tensor (vllm-project#15911) Signed-off-by: Liangfu Chen <[email protected]> * [Minor] Fused experts refactor (vllm-project#15914) Signed-off-by: Bill Nell <[email protected]> * [Misc][Performance] Advance tpu.txt to the most recent nightly torch … (vllm-project#16024) * Re-enable the AMD Testing for the passing tests. (vllm-project#15586) Signed-off-by: Alexei V. Ivanov <[email protected]> * [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <[email protected]> * [TPU] Switch Test to Non-Sliding Window (vllm-project#15981) Signed-off-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [Bugfix] Fix function names in test_block_fp8.py (vllm-project#16033) Signed-off-by: Bill Nell <[email protected]> * [ROCm] Tweak the benchmark script to run on ROCm (vllm-project#14252) * [Misc] improve gguf check (vllm-project#15974) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [TPU][V1] Remove ragged attention kernel parameter hard coding (vllm-project#16041) Signed-off-by: Chengji Yao <[email protected]> * doc: add info for macos clang errors (vllm-project#16049) Signed-off-by: yihong0618 <[email protected]> * [V1][Spec Decode] Avoid logging useless nan metrics (vllm-project#16023) Signed-off-by: Mark McLoughlin <[email protected]> * [Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (vllm-project#15939) Signed-off-by: Jonghyun Choe <[email protected]> * [Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (vllm-project#15945) Signed-off-by: zhenwei <[email protected]> * [Bugfix][kernels] Fix half2float conversion in gguf kernels (vllm-project#15995) Signed-off-by: Isotr0py <[email protected]> * [Benchmark][Doc] Update throughput benchmark and README (vllm-project#15998) Signed-off-by: StevenShi-23 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [CPU] Change default block_size for CPU backend (vllm-project#16002) Signed-off-by: jiang1.li <[email protected]> * [Distributed] [ROCM] Fix custom allreduce enable checks (vllm-project#16010) Signed-off-by: ilmarkov <[email protected]> Co-authored-by: ilmarkov <[email protected]> * [ROCm][Bugfix] Use platform specific FP8 dtype (vllm-project#15717) Signed-off-by: Gregory Shtrasberg <[email protected]> * [ROCm][Bugfix] Bring back fallback to eager mode removed in vllm-project#14917, but for ROCm only (vllm-project#15413) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Bugfix] Fix default behavior/fallback for pp in v1 (vllm-project#16057) Signed-off-by: mgoin <[email protected]> * [CI] Reorganize .buildkite directory (vllm-project#16001) Signed-off-by: kevin <[email protected]> * [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (vllm-project#15906) Signed-off-by: Nick Hill <[email protected]> * [V1] Scatter and gather placeholders in the model runner (vllm-project#15712) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Roger Wang <[email protected]> * Revert "[V1] Scatter and gather placeholders in the model runner" (vllm-project#16075) * [Kernel][Minor] Re-fuse triton moe weight application (vllm-project#16071) Signed-off-by: Bill Nell <[email protected]> * [Bugfix][TPU] Fix V1 TPU worker for sliding window (vllm-project#16059) Signed-off-by: Michael Goin <[email protected]> * [V1][Spec Decode] Update N-gram Proposer Interface (vllm-project#15750) Signed-off-by: Woosuk Kwon <[email protected]> * [Misc] Auto detect bitsandbytes pre-quantized models (vllm-project#16027) Signed-off-by: Tristan Leclercq <[email protected]> * [CI] Fix benchmark script level (vllm-project#16089) * fix: support clang17 for macos and fix the real libomp (vllm-project#16086) Signed-off-by: yihong0618 <[email protected]> * [doc] fix 404 (vllm-project#16082) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Revert "doc: add info for macos clang errors (vllm-project#16049)" (vllm-project#16091) Signed-off-by: yihong0618 <[email protected]> * Fix some capitalisations in generated examples doc titles (vllm-project#16094) Signed-off-by: Harry Mellor <[email protected]> * [Misc] format output for encoder_decoder.py (vllm-project#16095) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Remove redundant code (vllm-project#16098) Signed-off-by: chaunceyjiang <[email protected]> * [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine (vllm-project#15946) Signed-off-by: Jinzhen Lin <[email protected]> * [Model] use AutoWeightsLoader for phi, gemma, deepseek (vllm-project#16088) Signed-off-by: Jonghyun Choe <[email protected]> * [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 (vllm-project#16112) Signed-off-by: Lu Fang <[email protected]> * [Benchmark] Add sampling parameters to benchmark_serving. (vllm-project#16022) Signed-off-by: Hyesoo Yang <[email protected]> * [Frontend] Fix typo in tool chat templates for llama3.2 and toolace (vllm-project#14501) Signed-off-by: Ben Jackson <[email protected]> * [CI][V1] Fix passing `tokenizer` as kwarg to `validate_guidance_grammar` (vllm-project#16117) Signed-off-by: Roger Wang <[email protected]> * [Misc] refactor example eagle (vllm-project#16100) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc][Bugfix] Add missing EOF in k8s deploy doc (vllm-project#16025) * [Misc] Improve model redirect to accept json dictionary (vllm-project#16119) Signed-off-by: Isotr0py <[email protected]> * [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 (vllm-project#16103) Signed-off-by: rongfu.leng <[email protected]> * [Bugfix] LoRA : Fix the order in which the kernels process LoRAs (vllm-project#16040) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] add hf_token to EngineArgs (vllm-project#16093) Signed-off-by: paolovic <[email protected]> Co-authored-by: paolovic <[email protected]> * [Misc] update requires-python in pyproject.toml (vllm-project#16116) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [TPU] Update PyTorch/XLA (vllm-project#16130) Signed-off-by: Chengji Yao <[email protected]> * [V1][Minor] Optimize get_cached_block (vllm-project#16135) * Fix requires-python (vllm-project#16132) * [Metrics] Add bucket for `request_latency`, `time_to_first_token` and `time_per_output_token` (vllm-project#15202) Signed-off-by: Kay Yan <[email protected]> * [V1][Minor] Minor simplification for get_computed_blocks (vllm-project#16139) Signed-off-by: Woosuk Kwon <[email protected]> * [Misc] Update Mistral-3.1 example (vllm-project#16147) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings (vllm-project#16129) Signed-off-by: Isotr0py <[email protected]> * [CI] Set max transformers version for Ultravox model test (vllm-project#16149) Signed-off-by: Roger Wang <[email protected]> * doc: fix some typos in doc (vllm-project#16154) Signed-off-by: yihong0618 <[email protected]> * [VLM] Florence-2 supports online serving (vllm-project#16164) Signed-off-by: Isotr0py <[email protected]> * [V1][Structured Output] Add `supports_structured_output()` method to Platform (vllm-project#16148) Signed-off-by: shen-shanshan <[email protected]> * [Model] Add Qwen3 and Qwen3MoE (vllm-project#15289) Signed-off-by: YamPengLi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Misc] improve example mlpspeculator and llm_engine_example (vllm-project#16175) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Doc]Update image to latest version (vllm-project#16186) Signed-off-by: WangErXiao <[email protected]> * Upstream Llama4 Support to Main (vllm-project#16113) Signed-off-by: Aston Zhang <[email protected]> Signed-off-by: Chris Thi <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Xiaodong Wang <[email protected]> Signed-off-by: Yang Chen <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Re-enable support for `ChatGLMForConditionalGeneration` (vllm-project#16187) Signed-off-by: DarkLight1337 <[email protected]> * [V1] Revert the default `max_num_seqs` to V0 values for most hardware (vllm-project#16158) Signed-off-by: DarkLight1337 <[email protected]> * Print the warning only once (vllm-project#16193) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Misc] Human-readable `max-model-len` cli arg (vllm-project#16181) Signed-off-by: NickLucche <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Misc] Move Llama 4 projector call into encoder execution (vllm-project#16201) * [Bugfix] Fix guidance backend for Qwen models (vllm-project#16210) Signed-off-by: Benjamin Chislett <[email protected]> * [V1][BugFix] Exit properly if engine core fails during startup (vllm-project#16137) Signed-off-by: Nick Hill <[email protected]> * [Misc] add description attribute in CLI (vllm-project#15921) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix][V0] XGrammar structured output supports Enum (vllm-project#15878) Signed-off-by: Leon Seidel <[email protected]> * Torchao (vllm-project#14231) Signed-off-by: drisspg <[email protected]> * [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping (vllm-project#16031) Signed-off-by: mgoin <[email protected]> * [core] do not send error across process (vllm-project#16174) Signed-off-by: youkaichao <[email protected]> * [Misc] Update compressed-tensors to version 0.9.3 (vllm-project#16196) Signed-off-by: Miles Williams <[email protected]> * Update BASE_IMAGE to 2.22 release of Neuron (vllm-project#16218) * [V1] Scatter and gather placeholders in the model runner (vllm-project#16076) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> * [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 (vllm-project#16161) * Add warning for Attention backends that do not support irope yet (vllm-project#16212) * [Bugfix] Do not skip "empty" parts of chats that are parsable (vllm-project#16219) Signed-off-by: mgoin <[email protected]> * [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version (vllm-project#16194) Signed-off-by: Isotr0py <[email protected]> * [torch.compile][TPU] Make @support_torch_compile work for XLA backend (vllm-project#15782) Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> * [V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill (vllm-project#15837) Signed-off-by: mgoin <[email protected]> * [Misc] Merge the logs of pp layers partitions (vllm-project#16225) Signed-off-by: Kebe <[email protected]> * [Docs] Add Slides from Singapore Meetup (vllm-project#16213) Signed-off-by: simon-mo <[email protected]> * [Misc] format and refactor some examples (vllm-project#16252) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Add warning for multimodal data in LLM.beam_search (vllm-project#16241) Signed-off-by: Alex-Brooks <[email protected]> * [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (vllm-project#16203) Signed-off-by: rongfu.leng <[email protected]> * [BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (vllm-project#16247) Signed-off-by: Tianyuan Wu <[email protected]> * [Bugfix] Remove triton do_bench fast_flush arg (vllm-project#16256) Signed-off-by: Kebe <[email protected]> * Update to transformers==4.51.1 (vllm-project#16257) Signed-off-by: Harry Mellor <[email protected]> * [New Model]: jinaai/jina-embeddings-v3 (vllm-project#16120) * [Misc] Avoid stripping meaningful whitespace from `nvidia-smi topo -m` output in collect_env.py (vllm-project#16272) Signed-off-by: imkero <[email protected]> * [Bugfix] Proper input validation for multi-modal encoder-decoder models (vllm-project#16156) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Handle `process_weights_after_loading` for `QKVCrossParallelLinear` (vllm-project#15328) Signed-off-by: Isotr0py <[email protected]> * Add warning that content below line in template will be removed (vllm-project#16276) Signed-off-by: Harry Mellor <[email protected]> * [BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (vllm-project#16209) Signed-off-by: Lucas Wilkinson <[email protected]> * [Bugfix] fix deepseek fp16 scale bug (vllm-project#14809) Signed-off-by: Jinzhen Lin <[email protected]> Co-authored-by: mgoin <[email protected]> * [V1] Update structured output offline inference example (vllm-project#15721) Signed-off-by: Russell Bryant <[email protected]> * [CI/Build] Fix CI LoRA failure (vllm-project#16270) Signed-off-by: Jee Jee Li <[email protected]> * Add support to modelopt quantization of Mixtral model (vllm-project#15961) Signed-off-by: Yue <[email protected]> * [Model] Add smolvlm support (vllm-project#16017) Signed-off-by: chaunceyjiang <[email protected]> * [Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs (vllm-project#16198) Signed-off-by: tjtanaa <[email protected]> Signed-off-by: kliuae <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: kliuae <[email protected]> * [Bugfix] fix gettid method is not define (vllm-project#16084) Signed-off-by: rongfu.leng <[email protected]> * [Feature] Estimate max-model-len use available KV cache memory (vllm-project#16168) Signed-off-by: rongfu.leng <[email protected]> * [Core] Upgrade to xgrammar 0.1.18, add cache size limit (vllm-project#16283) Signed-off-by: Russell Bryant <[email protected]> * [CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding (vllm-project#16221) Signed-off-by: mgoin <[email protected]> * [TPU] Update PyTorch/XLA (vllm-project#16288) Signed-off-by: Chengji Yao <[email protected]> * [BugFix] Fix fusion test and add them to CI (vllm-project#16287) Signed-off-by: luka <[email protected]> * [Misc] Fix test_sharded_state_loader.py(vllm-project#16004) (vllm-project#16005) Signed-off-by: lvfei.lv <[email protected]> * [Bugfix] Avoid transferring cached multi-modal items from P0 to P1 (vllm-project#16273) Signed-off-by: DarkLight1337 <[email protected]> * Update label-tpu mergify and remove removal bot (vllm-project#16298) * [BugFix] logger is not callable (vllm-project#16312) Signed-off-by: yihong0618 <[email protected]> * [BugFix] llama4 qknorm should be not shared across head (vllm-project#16311) Signed-off-by: Lu Fang <[email protected]> * update neuron config (vllm-project#16289) Signed-off-by: Ajay Vohra <[email protected]> * [BugFix] fix some typos found by typos. (vllm-project#16314) Signed-off-by: yihong0618 <[email protected]> * [Model] Add `SupportsMultiModal.get_language_model` interface (vllm-project#16007) Signed-off-by: NickLucche <[email protected]> * [Bugfix][Frontend] respect provided default guided decoding backend (vllm-project#15476) Signed-off-by: Guillaume Calmettes <[email protected]> * Revert "Update label-tpu mergify and remove removal bot" (vllm-project#16350) * [Bugfix] Fix profiling.py (vllm-project#16202) Signed-off-by: zh Wang <[email protected]> * [Bugfix] catch AssertionError in MistralTokenizer as ValueError (vllm-project#16344) Signed-off-by: Guillaume Calmettes <[email protected]> * [CI]Fix hpu docker and numpy version for CI (vllm-project#16355) Signed-off-by: Chendi Xue <[email protected]> * Fix `benchmark_throughput.py --backend=hf` (vllm-project#16352) Signed-off-by: mgoin <[email protected]> * [Build/CI] Add tracing deps to vllm container image (vllm-project#15224) Signed-off-by: Russell Bryant <[email protected]> * [Hardware] add platform-specific request validation api (vllm-project#16291) Signed-off-by: Joe Runde <[email protected]> * [Misc] refactor Structured Outputs example (vllm-project#16322) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues (vllm-project#16275) Signed-off-by: Chengji Yao <[email protected]> * Add GLM-4-0414 support (vllm-project#16338) Signed-off-by: lvfei.lv <[email protected]> Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: yihong0618 <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Ajay Vohra <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Co-authored-by: Accelerator1996 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: yihong <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: ajayvohra2005 <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> * [Bugfix]: do not shutdown server if `skip_special_use=False` for MistralTokenizer (vllm-project#14094) Signed-off-by: Guillaume Calmettes <[email protected]> * [Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral (vllm-project#16325) Signed-off-by: Aaron Ang <[email protected]> * [TPU] Fix dummy loading OOM (vllm-project#16372) Signed-off-by: Chengji Yao <[email protected]> * [bugfix] Avoid the time consumption caused by creating dummy videos. (vllm-project#16371) * [CI][Bugfix] Pin triton version for CPU (vllm-project#16384) Signed-off-by: Roger Wang <[email protected]> * [misc] use tqdm.auto where appropriate (vllm-project#16290) Signed-off-by: Benjamin Kitor <[email protected]> * [Bugfix][TPU] Fix TPU validate_request (vllm-project#16369) Signed-off-by: Michael Goin <[email protected]> * fix sonnet dataset sample when prefix len is very small (vllm-project#16379) Signed-off-by: Chenyaaang <[email protected]> * [Model] use AutoWeightsLoader for deepseek_v2, internlm2 (vllm-project#16383) Signed-off-by: Aaron Ang <[email protected]> * [Misc] Update transformers version limits of multi-modal tests (vllm-project#16381) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix validation error for text-only Mllama 3.2 (vllm-project#16377) Signed-off-by: DarkLight1337 <[email protected]> * [Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models (vllm-project#16038) Signed-off-by: mgoin <[email protected]> * [doc] add download model tips (vllm-project#16389) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Update Numba to 0.61.2 (vllm-project#16376) Signed-off-by: cyy <[email protected]> * [Model] Remove image mm limit for LLaMa4 (vllm-project#16365) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * [doc] update the wrong link (vllm-project#16401) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [CI] Add auto update workflow for Dockerfile graph (vllm-project#11879) Signed-off-by: wineandchord <[email protected]> * Fix the torch version parsing logic (vllm-project#15857) * [VLM] Remove `BaseProcessingInfo.get_mm_max_tokens_per_item` (vllm-project#16408) Signed-off-by: DarkLight1337 <[email protected]> * [TPU][V1] Use `language_model` interface for getting text backbone in MM (vllm-project#16410) Signed-off-by: NickLucche <[email protected]> * Improve configs - `ParallelConfig` (vllm-project#16332) Signed-off-by: Harry Mellor <[email protected]> * [V1] Set structured output backend to `auto` by default (vllm-project#15724) Signed-off-by: Russell Bryant <[email protected]> * [V1][Spec Decode] Eagle Model loading (vllm-project#16035) Signed-off-by: LiuXiaoxuanPKU <[email protected]> * [Bugfix] Fix bug when dataset is json (vllm-project#15899) Signed-off-by: Chenyaaang <[email protected]> * [Model] Reduce redundant computations in mamba2 blocks for Bamba-9B (vllm-project#15423) Signed-off-by: Chih-Chieh-Yang <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> * [V1] Zero-copy tensor/ndarray serialization/transmission (vllm-project#13790) Signed-off-by: Nick Hill <[email protected]> * [VLM] Avoid unnecessary dummy multimodal data during processing (vllm-project#16416) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix output token length check logic (vllm-project#16419) Signed-off-by: look <[email protected]> * [TPU][V1] Disable per-request seed/Generator (vllm-project#16172) Signed-off-by: NickLucche <[email protected]> * Fix range_ratio Bug in RandomDataset (vllm-project#16126) Signed-off-by: jadewang21 <[email protected]> * check input length of sonnet samples (vllm-project#16423) Signed-off-by: alexey-belyakov <[email protected]> * update benchmark_serving_structured_output to include auto backend (vllm-project#16438) Signed-off-by: Chenyaaang <[email protected]> * [Llama4] Enable attention temperature tuning by default for long context (>32k) (vllm-project#16439) Signed-off-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> * Update supported_hardware.md for TPU INT8 (vllm-project#16437) * [Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test (vllm-project#16424) Signed-off-by: Isotr0py <[email protected]> * [CPU][Bugfix] Fix CPU docker issues (vllm-project#16454) Signed-off-by: jiang.li <[email protected]> * [Bugfix] Don't set an upper bound on repetition penalty (vllm-project#16403) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Nick Hill <[email protected]> * Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" (vllm-project#16453) * [Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner (vllm-project#15990) Signed-off-by: Jee Jee Li <[email protected]> * Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (vllm-project#16447) Signed-off-by: mgoin <[email protected]> * [Misc] Raise error for V1 not supporting Long LoRA. (vllm-project#16415) Signed-off-by: Jee Jee Li <[email protected]> * [Misc] update api_client example (vllm-project#16459) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Don't install triton on `ppc64le` platform (vllm-project#16470) Signed-off-by: Harry Mellor <[email protected]> * [Kernel] support merge_attn_states CUDA kernel, 3x speedup (vllm-project#16173) Signed-off-by: DefTruth <[email protected]> * [Bugfix] Fix bugs of running Quark quantized models (vllm-project#16236) Signed-off-by: chaow <[email protected]> * [Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU (vllm-project#12779) Signed-off-by: Tomasz Zielinski <[email protected]> * Fix erroneous "model doesn't support compile" warning (vllm-project#16486) Signed-off-by: rzou <[email protected]> * [TPU][V1] Make `--disable_chunked_mm_input` mandatory for serving MM models (vllm-project#16483) Signed-off-by: NickLucche <[email protected]> * [Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (vllm-project#16366) Signed-off-by: mgoin <[email protected]> * [Doc] Document InternVL3 support (vllm-project#16495) Signed-off-by: Isotr0py <[email protected]> * [Bugfix] handle alignment of encoder_seq_lens in mllama.py (vllm-project#14784) Signed-off-by: Travis Johnson <[email protected]> * Improve configs - `LoadConfig` (vllm-project#16422) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Added chat templates for LLaMa4 pythonic tool calling (vllm-project#16463) Signed-off-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: Kai Wu <[email protected]> * [Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (vllm-project#16488) * Update openai_compatible_server.md (vllm-project#16507) Signed-off-by: Christian Sears <[email protected]> * [Bugfix] clean up duplicated code (vllm-project#16485) Signed-off-by: Gogs <[email protected]> Co-authored-by: Gogs <[email protected]> * Bugfix for PixtralHF models without spatial_merge_size (vllm-project#16513) Signed-off-by: mgoin <[email protected]> * [Doc] Fix link to vLLM blog (vllm-project#16519) Signed-off-by: Yuan Tang <[email protected]> * [CI][Bugfix] Add mistral_tool_use to Ci (vllm-project#16517) Signed-off-by: mgoin <[email protected]> * [BugFix] Handle non-contiguous tensors properly when serializing (vllm-project#16492) Signed-off-by: Nick Hill <[email protected]> * [Doc] Update Llama4 Model Names in Supported Models (vllm-project#16509) Signed-off-by: Ye (Charlotte) Qi <[email protected]> * Optimized topk for topk=1 (Llama-4) (vllm-project#16512) Signed-off-by: mgoin <[email protected]> * [Feature][V1] Add xgrammar to support minLength, maxLength with test (vllm-project#16516) Signed-off-by: Leon Seidel <[email protected]> * [Frontend] support matryoshka representation / support embedding API dimensions (vllm-project#16331) * fix: spelling (vllm-project#16466) Signed-off-by: Tianer Zhou <[email protected]> * [Misc] Update chat utils tests (vllm-project#16520) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Openai transcription client example use same Whisper model (vllm-project#16487) Signed-off-by: NickLucche <[email protected]> * [V1] Enable multi-input by default (vllm-project#15799) Signed-off-by: DarkLight1337 <[email protected]> * [MISC] Make GroupCoordinator compatible with out-of-tree devices (vllm-project#16464) Signed-off-by: [email protected] <[email protected]> * [Misc] Delete redundant code (vllm-project#16530) Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Isotr0py <[email protected]> * Fix syntaxWarning: invalid escape sequence '\s' (vllm-project#16532) Signed-off-by: Jie Fu <[email protected]> * [Perf] Optimize Preparing Inputs for GPU Model Runner (vllm-project#16484) Signed-off-by: snowcharm <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [Bugfix] Validate logit biases to prevent out of vocab ids crashing engine (vllm-project#16529) Signed-off-by: Ryan McConville <[email protected]> * [V1][Spec Decode] KV cache slots for eagle heads (vllm-project#16370) Signed-off-by: LiuXiaoxuanPKU <[email protected]> * Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (vllm-project#16537) Signed-off-by: mgoin <[email protected]> * [Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py (vllm-project#16556) * [Core][V0] Enable regex support with xgrammar (vllm-project#13228) Signed-off-by: Russell Bryant <[email protected]> * capture only SP * batch_size <= max_batch_size case to cover small max_batch_size --------- Signed-off-by: Brayden Zhong <[email protected]> Signed-off-by: Thien Tran <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: chun37 <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Chris Thi <[email protected]> Signed-off-by: lukas.bluebaum <[email protected]> Signed-off-by: Eric <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Kay Yan <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Liangfu Chen <[email protected]> Signed-off-by: Matt, Matthias <[email protected]> Signed-off-by: jiang1.li <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Nishidha Panpaliya <[email protected]> Signed-off-by: youkaichao <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Hyesoo Yang <[email protected]> Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: yihong0618 <[email protected]> Signed-off-by: Ziji Shi <[email protected]> Signed-off-by: StevenShi-23 <[email protected]> Signed-off-by: wwl2755 <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Bill Nell <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: Xiongfei Wei <[email protected]> Signed-off-by: Robert Shaw <[email protected]> Signed-off-by: Jonghyun Choe <[email protected]> Signed-off-by: zhenwei <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Michael Goin <[email protected]> Signed-off-by: Tristan Leclercq <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Ben Jackson <[email protected]> Signed-off-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: paolovic <[email protected]> Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: YamPengLi <[email protected]> Signed-off-by: WangErXiao <[email protected]> Signed-off-by: Aston Zhang <[email protected]> Signed-off-by: drisspg <[email protected]> Signed-off-by: Jon Swenson <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Xiaodong Wang <[email protected]> Signed-off-by: Yang Chen <[email protected]> Signed-off-by: Ye (Charlotte) Qi <[email protected]> Signed-off-by: Yong Hoon Shin <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Leon Seidel <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: Miles Williams <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Kebe <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Tianyuan Wu <[email protected]> Signed-off-by: imkero <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Yue <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: kliuae <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: lvfei.lv <[email protected]> Signed-off-by: Ajay Vohra <[email protected]> Signed-off-by: Guillaume Calmettes <[email protected]> Signed-off-by: zh Wang <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Aaron Ang <[email protected]> Signed-off-by: Benjamin Kitor <[email protected]> Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: cyy <[email protected]> Signed-off-by: wineandchord <[email protected]> Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: look <[email protected]> Signed-off-by: jadewang21 <[email protected]> Signed-off-by: alexey-belyakov <[email protected]> Signed-off-by: jiang.li <[email protected]> Signed-off-by: DefTruth <[email protected]> Signed-off-by: chaow <[email protected]> Signed-off-by: Tomasz Zielinski <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Christian Sears <[email protected]> Signed-off-by: Gogs <[email protected]> Signed-off-by: Yuan Tang <[email protected]> Signed-off-by: Tianer Zhou <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Jie Fu <[email protected]> Signed-off-by: snowcharm <[email protected]> Signed-off-by: Ryan McConville <[email protected]> Co-authored-by: Brayden Zhong <[email protected]> Co-authored-by: Thien Tran <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: chun <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Li Wang <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Chris Thi <[email protected]> Co-authored-by: LukasBluebaum <[email protected]> Co-authored-by: Eric Tang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Kay Yan <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Matthias Matt <[email protected]> Co-authored-by: Liangfu Chen <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Nishidha <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Hyesoo Yang <[email protected]> Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b.c.tpu-prod-env-large-adhoc.internal> Co-authored-by: Chengji Yao <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: yihong <[email protected]> Co-authored-by: Ziji Shi (Steven) <[email protected]> Co-authored-by: wwl2755 <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: Kyle Sayers <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: yarongmu-google <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: iefgnoix <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: Jonghyun Choe <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Tristan Leclercq <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Ben Jackson <[email protected]> Co-authored-by: Paul Schweigert <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: paolovic <[email protected]> Co-authored-by: paolovic <[email protected]> Co-authored-by: Martin Hoyer <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: YamPengLi <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Robin <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: leon-seidel <[email protected]> Co-authored-by: Driss Guessous <[email protected]> Co-authored-by: Miles Williams <[email protected]> Co-authored-by: Satyajith Chilappagari <[email protected]> Co-authored-by: Jennifer Zhao <[email protected]> Co-authored-by: zxfan-cpu <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: Kebe <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: TY-AMD <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Kero Liang <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: yueshen2016 <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: kliuae <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Accelerator1996 <[email protected]> Co-authored-by: ajayvohra2005 <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: zh Wang <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Yuxuan Zhang <[email protected]> Co-authored-by: Aaron Ang <[email protected]> Co-authored-by: Jintao <[email protected]> Co-authored-by: Benjamin Kitor <[email protected]> Co-authored-by: Chenyaaang <[email protected]> Co-authored-by: cyyever <[email protected]> Co-authored-by: Ye (Charlotte) Qi <[email protected]> Co-authored-by: wineandchord <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Chih-Chieh Yang <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: look <[email protected]> Co-authored-by: WWW <[email protected]> Co-authored-by: Alexey Belyakov <[email protected]> Co-authored-by: DefTruth <[email protected]> Co-authored-by: chaow-amd <[email protected]> Co-authored-by: Tomasz Zielinski <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Kai Wu <[email protected]> Co-authored-by: Christian Sears <[email protected]> Co-authored-by: Gogs <[email protected]> Co-authored-by: Yuan Tang <[email protected]> Co-authored-by: Tianer Zhou <[email protected]> Co-authored-by: Huazhong Ji <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: SnowCharm <[email protected]> Co-authored-by: Ryan McConville <[email protected]>
1 parent b56ba3f commit d48df58

File tree

779 files changed

+46075
-15651
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

779 files changed

+46075
-15651
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
2+
model_name: "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.31
8+
- name: "exact_match,flexible-extract"
9+
value: 0.47
10+
limit: 1319
11+
num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-small.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
44
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
55
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
66
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
7-
Minitron-4B-Base-FP8.yaml
7+
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
88
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
99
Qwen2-1.5B-Instruct-FP8W8.yaml
1010
Meta-Llama-3-8B-QQQ.yaml

.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,24 @@ set -x
1010
set -o pipefail
1111

1212
check_gpus() {
13-
# check the number of GPUs and GPU type.
14-
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
13+
if command -v nvidia-smi; then
14+
# check the number of GPUs and GPU type.
15+
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
16+
elif command -v amd-smi; then
17+
declare -g gpu_count=$(amd-smi list | grep 'GPU' | wc -l)
18+
fi
19+
1520
if [[ $gpu_count -gt 0 ]]; then
1621
echo "GPU found."
1722
else
1823
echo "Need at least 1 GPU to run benchmarking."
1924
exit 1
2025
fi
21-
declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
26+
if command -v nvidia-smi; then
27+
declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
28+
elif command -v amd-smi; then
29+
declare -g gpu_type=$(amd-smi static -g 0 -a | grep 'MARKET_NAME' | awk '{print $2}')
30+
fi
2231
echo "GPU type is $gpu_type"
2332
}
2433

@@ -90,9 +99,15 @@ kill_gpu_processes() {
9099

91100

92101
# wait until GPU memory usage smaller than 1GB
93-
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
94-
sleep 1
95-
done
102+
if command -v nvidia-smi; then
103+
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
104+
sleep 1
105+
done
106+
elif command -v amd-smi; then
107+
while [ "$(amd-smi metric -g 0 | grep 'USED_VRAM' | awk '{print $2}')" -ge 1000 ]; do
108+
sleep 1
109+
done
110+
fi
96111

97112
# remove vllm config file
98113
rm -rf ~/.config/vllm
@@ -361,7 +376,7 @@ main() {
361376
# get the current IP address, required by benchmark_serving.py
362377
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
363378
# turn of the reporting of the status of each request, to clean up the terminal output
364-
export VLLM_LOG_LEVEL="WARNING"
379+
export VLLM_LOGGING_LEVEL="WARNING"
365380

366381
# prepare for benchmarking
367382
cd benchmarks || exit 1

.buildkite/nightly-benchmarks/tests/serving-tests.json

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,12 @@
6363
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
6464
"disable_log_requests": "",
6565
"tensor_parallel_size": 4,
66-
"swap_space": 16,
67-
"speculative_model": "turboderp/Qwama-0.5B-Instruct",
68-
"num_speculative_tokens": 4,
69-
"speculative_draft_tensor_parallel_size": 1
66+
"swap_space": 16,
67+
"speculative_config": {
68+
"model": "turboderp/Qwama-0.5B-Instruct",
69+
"num_speculative_tokens": 4,
70+
"draft_tensor_parallel_size": 1
71+
}
7072
},
7173
"client_parameters": {
7274
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",

.buildkite/release-pipeline.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,21 @@ steps:
33
agents:
44
queue: cpu_queue_postmerge
55
commands:
6-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain ."
6+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
77
- "mkdir artifacts"
88
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
9-
- "bash .buildkite/upload-wheels.sh"
9+
- "bash .buildkite/scripts/upload-wheels.sh"
1010
env:
1111
DOCKER_BUILDKIT: "1"
1212

1313
- label: "Build wheel - CUDA 12.1"
1414
agents:
1515
queue: cpu_queue_postmerge
1616
commands:
17-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
17+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
1818
- "mkdir artifacts"
1919
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
20-
- "bash .buildkite/upload-wheels.sh"
20+
- "bash .buildkite/scripts/upload-wheels.sh"
2121
env:
2222
DOCKER_BUILDKIT: "1"
2323

@@ -31,10 +31,10 @@ steps:
3131
agents:
3232
queue: cpu_queue_postmerge
3333
commands:
34-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
34+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
3535
- "mkdir artifacts"
3636
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
37-
- "bash .buildkite/upload-wheels.sh"
37+
- "bash .buildkite/scripts/upload-wheels.sh"
3838
env:
3939
DOCKER_BUILDKIT: "1"
4040

@@ -48,7 +48,7 @@ steps:
4848
queue: cpu_queue_postmerge
4949
commands:
5050
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
51-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
51+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
5252
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
5353

5454
- label: "Build and publish TPU release image"
@@ -57,7 +57,7 @@ steps:
5757
agents:
5858
queue: tpu_queue_postmerge
5959
commands:
60-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f Dockerfile.tpu ."
60+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f docker/Dockerfile.tpu ."
6161
- "docker push vllm/vllm-tpu:nightly"
6262
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
6363
plugins:
@@ -82,7 +82,7 @@ steps:
8282
queue: cpu_queue_postmerge
8383
commands:
8484
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
85-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
85+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
8686
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
8787
env:
8888
DOCKER_BUILDKIT: "1"

.buildkite/run-openvino-test.sh

Lines changed: 0 additions & 16 deletions
This file was deleted.

.buildkite/run-tpu-v1-test.sh

Lines changed: 0 additions & 36 deletions
This file was deleted.

.buildkite/run-amd-test.sh renamed to .buildkite/scripts/hardware_ci/run-amd-test.sh

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -105,19 +105,33 @@ fi
105105
if [[ $commands == *" entrypoints/openai "* ]]; then
106106
commands=${commands//" entrypoints/openai "/" entrypoints/openai \
107107
--ignore=entrypoints/openai/test_audio.py \
108-
--ignore=entrypoints/openai/test_chat.py \
109108
--ignore=entrypoints/openai/test_shutdown.py \
110109
--ignore=entrypoints/openai/test_completion.py \
111110
--ignore=entrypoints/openai/test_sleep.py \
112111
--ignore=entrypoints/openai/test_models.py \
112+
--ignore=entrypoints/openai/test_lora_adapters.py \
113+
--ignore=entrypoints/openai/test_return_tokens_as_ids.py \
114+
--ignore=entrypoints/openai/test_root_path.py \
115+
--ignore=entrypoints/openai/test_tokenization.py \
113116
--ignore=entrypoints/openai/test_prompt_validation.py "}
114117
fi
115118

116119
#ignore certain Entrypoints/llm tests
117-
if [[ $commands == *" && pytest -v -s entrypoints/llm/test_guided_generate.py"* ]]; then
118-
commands=${commands//" && pytest -v -s entrypoints/llm/test_guided_generate.py"/" "}
120+
if [[ $commands == *" entrypoints/llm "* ]]; then
121+
commands=${commands//" entrypoints/llm "/" entrypoints/llm \
122+
--ignore=entrypoints/llm/test_chat.py \
123+
--ignore=entrypoints/llm/test_accuracy.py \
124+
--ignore=entrypoints/llm/test_init.py \
125+
--ignore=entrypoints/llm/test_generate_multiple_loras.py \
126+
--ignore=entrypoints/llm/test_prompt_validation.py "}
119127
fi
120128

129+
#Obsolete currently
130+
##ignore certain Entrypoints/llm tests
131+
#if [[ $commands == *" && pytest -v -s entrypoints/llm/test_guided_generate.py"* ]]; then
132+
# commands=${commands//" && pytest -v -s entrypoints/llm/test_guided_generate.py"/" "}
133+
#fi
134+
121135
# --ignore=entrypoints/openai/test_encoder_decoder.py \
122136
# --ignore=entrypoints/openai/test_embedding.py \
123137
# --ignore=entrypoints/openai/test_oot_registration.py
@@ -134,9 +148,10 @@ if [[ $commands == *"--shard-id="* ]]; then
134148
# assign shard-id for each shard
135149
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
136150
echo "Shard ${GPU} commands:$commands_gpu"
151+
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
137152
docker run \
138-
--device /dev/kfd --device /dev/dri \
139-
--network host \
153+
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
154+
--network=host \
140155
--shm-size=16gb \
141156
--rm \
142157
-e HIP_VISIBLE_DEVICES="${GPU}" \
@@ -163,9 +178,10 @@ if [[ $commands == *"--shard-id="* ]]; then
163178
fi
164179
done
165180
else
181+
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
166182
docker run \
167-
--device /dev/kfd --device /dev/dri \
168-
--network host \
183+
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
184+
--network=host \
169185
--shm-size=16gb \
170186
--rm \
171187
-e HIP_VISIBLE_DEVICES=0 \

.buildkite/run-cpu-test-ppc64le.sh renamed to .buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,5 @@ trap remove_docker_container EXIT
1010
remove_docker_container
1111

1212
# Try building the docker image
13-
docker build -t cpu-test -f Dockerfile.ppc64le .
13+
docker build -t cpu-test -f docker/Dockerfile.ppc64le .
1414

.buildkite/run-cpu-test.sh renamed to .buildkite/scripts/hardware_ci/run-cpu-test.sh

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,19 @@ set -ex
88
CORE_RANGE=${CORE_RANGE:-48-95}
99
NUMA_NODE=${NUMA_NODE:-1}
1010

11-
# Try building the docker image
12-
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
13-
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .
14-
1511
# Setup cleanup
16-
remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
12+
remove_docker_container() {
13+
set -e;
14+
docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true;
15+
docker image rm cpu-test-"$BUILDKITE_BUILD_NUMBER" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 || true;
16+
}
1717
trap remove_docker_container EXIT
1818
remove_docker_container
1919

20+
# Try building the docker image
21+
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$BUILDKITE_BUILD_NUMBER" --target vllm-test -f docker/Dockerfile.cpu .
22+
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 --target vllm-test -f docker/Dockerfile.cpu .
23+
2024
# Run the image, setting --shm-size=4g for tensor parallel.
2125
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
2226
--cpuset-mems="$NUMA_NODE" --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
@@ -36,8 +40,8 @@ function cpu_tests() {
3640
# Run basic model test
3741
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
3842
set -e
39-
pip install -r vllm/requirements/test.txt
40-
pip install -r vllm/requirements/cpu.txt
43+
pytest -v -s tests/kernels/test_cache.py -m cpu_model
44+
pytest -v -s tests/kernels/test_mla_decode_cpu.py -m cpu_model
4145
pytest -v -s tests/models/decoder_only/language -m cpu_model
4246
pytest -v -s tests/models/embedding/language -m cpu_model
4347
pytest -v -s tests/models/encoder_decoder/language -m cpu_model

.buildkite/run-gh200-test.sh renamed to .buildkite/scripts/hardware_ci/run-gh200-test.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@ python3 use_existing_torch.py
99

1010
# Try building the docker image
1111
DOCKER_BUILDKIT=1 docker build . \
12+
--file docker/Dockerfile \
1213
--target vllm-openai \
1314
--platform "linux/arm64" \
1415
-t gh200-test \
1516
--build-arg max_jobs=66 \
1617
--build-arg nvcc_threads=2 \
18+
--build-arg RUN_WHEEL_CHECK=false \
1719
--build-arg torch_cuda_arch_list="9.0+PTX" \
1820
--build-arg vllm_fa_cmake_gpu_arches="90-real"
1921

@@ -23,6 +25,6 @@ trap remove_docker_container EXIT
2325
remove_docker_container
2426

2527
# Run the image and test offline inference
26-
docker run -e HF_TOKEN -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
28+
docker run -e HF_TOKEN -e VLLM_WORKER_MULTIPROC_METHOD=spawn -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
2729
python3 examples/offline_inference/basic/generate.py --model meta-llama/Llama-3.2-1B
2830
'

.buildkite/run-hpu-test.sh renamed to .buildkite/scripts/hardware_ci/run-hpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
set -ex
66

77
# Try building the docker image
8-
docker build -t hpu-test-env -f Dockerfile.hpu .
8+
docker build -t hpu-test-env -f docker/Dockerfile.hpu .
99

1010
# Setup cleanup
1111
# certain versions of HPU software stack have a bug that can

.buildkite/run-neuron-test.sh renamed to .buildkite/scripts/hardware_ci/run-neuron-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ else
3535
date "+%s" > /tmp/neuron-docker-build-timestamp
3636
fi
3737

38-
docker build -t "${image_name}" -f Dockerfile.neuron .
38+
docker build -t "${image_name}" -f docker/Dockerfile.neuron .
3939

4040
# Setup cleanup
4141
remove_docker_container() {

0 commit comments

Comments
 (0)