Skip to content

Releases: vllm-project/vllm-omni

v0.19.0rc1

04 Apr 07:54
191b9a8

Choose a tag to compare

v0.19.0rc1 Pre-release
Pre-release

Highlights

This release features 71 commits since v0.18.0.

vLLM-Omni v0.19.0rc1 is a rebase-and-production-readiness release candidate aligned with upstream vLLM v0.19.0. It strengthens the runtime and serving stack, expands speech/TTS and diffusion/video capabilities, improves production behavior for Bagel and Wan pipelines, and broadens deployment coverage across new platforms and distributed execution modes.

Key Improvements

  • Rebased to upstream vLLM v0.19.0, while continuing runtime cleanup and stage execution refactors that improve orchestration and production robustness. (#2475, #2006)
  • Expanded speech and TTS serving, including new OmniVoice two-stage support, CosyVoice3 online serving, and multiple Qwen3-TTS / Fish Speech quality and latency fixes. (#2463, #2431, #2108, #2446, #2378, #2358)
  • Improved diffusion and video generation workflows across Bagel, Wan2.2, FLUX.2-dev, and LTX-2, with lower latency, better forwarding behavior, and stronger production correctness. (#2398, #2422, #2397, #2381, #2459, #2393, #2433, #2260)
  • Broadened deployment coverage, adding MUSA platform support, improving XPU readiness, and extending distributed diffusion features such as HSDP and CFG parallelism. (#2337, #2428, #2029, #2021, #1751)

Core Architecture & Runtime

  • Rebased the project to upstream vLLM v0.19.0, keeping vLLM-Omni aligned with the latest upstream runtime behavior and APIs. (#2475)
  • Continued the stage/runtime refactor by moving stage-side inference into dedicated subprocess-based clients and procs, simplifying orchestration and improving isolation for both AR and diffusion stages. (#2006)
  • Added session-based streaming audio input with a realtime WebSocket path for Qwen3-Omni-style workflows, enabling incremental audio input and streamed transcription/output flows. (#2208)
  • Added a nightly wheel release index, making it easier to validate and consume nightly builds in testing and pre-release workflows. (#2345)

Model Support

  • Added OmniVoice two-stage TTS serving support, bringing zero-shot multilingual speech generation into the vLLM-Omni serving stack. (#2463)
  • Added and stabilized CosyVoice3 online serving through /v1/audio/speech, including stage config fixes and CI coverage. (#2431)
  • Added LTX-2 distilled two-stage inference for both text-to-video and image-to-video production workflows. (#2260)
  • Added Wan 2.1 VACE support for conditional video generation workflows, including multiple conditioning modes. (#1885)

Audio, Speech & Omni Production Optimization

  • Improved Qwen3-TTS repeated custom-voice serving by introducing an in-memory voice cache for reference-audio artifacts, reducing warm-request latency for repeated voices. (#2108)
  • Fixed a Fish Speech structured voice-clone conditioning regression so cloned voice quality is restored in the prefill path. (#2446)
  • Fixed Qwen3-TTS chunk-boundary handling, case-insensitive speaker lookup, and demo-serving issues to make TTS behavior more reliable in real deployments. (#2378, #2358, #2372)
  • Added better benchmark support for Qwen3-TTS Base and VoiceDesign models so serving and HF benchmark paths correctly reflect task-specific request formats. (#2411)

Diffusion, Image & Video Generation

  • Improved Wan2.2 runtime efficiency by optimizing rotary embedding behavior and skipping unnecessary cross-attention Ulysses SP paths where appropriate. (#2393, #2459)
  • Strengthened Bagel production behavior with earlier KV-ready forwarding, fixes for delayed decoding in AR/DiT workflows, proper single-stage img2img routing, and a dedicated single-stage config. (#2398, #2422, #2397, #2381)
  • Added Bagel thinking mode in multi-stage serving, expanding interactive and reasoning-style generation workflows. (#2447)
  • Fixed FLUX.2-dev guidance handling so guidance scale is applied correctly during generation. (#2433)
  • Added a synchronous /v1/videos/sync endpoint for latency-sensitive benchmarking and direct-response video generation workflows. (#2049)

Quantization & Memory Efficiency

  • Added offline AutoRound W4A16 support for diffusion models, improving deployability for memory-constrained setups. (#1777)
  • Fixed layer-wise offload incompatibility with HSDP, improving compatibility between memory-saving and distributed execution paths. (#2021)

Platforms, Distributed Execution & Hardware Coverage

  • Added MUSA platform support for Moore Threads GPUs, expanding vLLM-Omni beyond the existing CUDA/ROCm/NPU/XPU coverage. (#2337)
  • Improved XPU readiness for speech serving by removing CUDA-only assumptions in Voxtral TTS components and adding an XPU stage config. (#2428)
  • Expanded distributed diffusion support with HSDP for Qwen-Image-series, Z-Image, and GLM-Image, and added CFG parallel support for HunyuanImage3.0. (#2029, #1751)
  • Fixed distributed gather behavior for non-contiguous tensors, improving correctness in CFG-parallel and related distributed paths. (#2367)

CI, Benchmarks & Documentation

  • Refreshed the diffusion documentation structure around feature compatibility, parallelism, cache acceleration, quantization, and serving examples, making the diffusion stack easier to navigate and adopt.
  • Expanded CI and E2E coverage for speech, diffusion, and video-serving scenarios, especially around CosyVoice3, Qwen3-TTS benchmarking, and Wan-family validation. (#2431, #2411, #2262)

Note

  • v0.19.0rc1 is a release candidate focused on validating the upstream rebase, the refreshed runtime architecture, and the expanded speech/diffusion/platform support before the final v0.19.0 release.
  • Some low-signal CI and documentation maintenance changes were intentionally merged into broader themes instead of being listed one-by-one, following the project’s recent release-note style.

What's Changed

  • [Bugfix][HunyuanImage3.0] Fix default guidance_scale from 1.0 to 4.0 and port GPU MoE ForwardContext fix from NPU by @nussejzz in #2142
  • [Feat] support quantization for Flux Kontext by @RuixiangMa in #2184
  • [Tests][Qwen3-Omni] Add performance test cases by @amy-why-3459 in #2011
  • [Docs] Modify the documentation description for streaming output by @amy-why-3459 in #2300
  • Fix: Enable /v1/models endpoint for pure diffusion mode by @majiayu000 in #805
  • [skip ci] [Docs]: add CI Failures troubleshooting guide for contributors by @lishunyang12 in #1259
  • Qwen3-Omni][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @LJH-LBJ in #2291
  • [Feature] [HunyuanImage3] Add TeaCache support for inference acceleration by @nussejzz in #1927
  • [Misc] Make gradio an optional dependency and upgrade to >=6.7.0 by @Lidang-Jiang in #2221
  • [ROCm] [CI] Migrate to use amd docker hub for ci by @tjtanaa in #2303
  • [Feat] add helios fp8 quantization by @lengrongfu in #1916
  • [Bugfix] fix: handle Qwen-Image-Layered layered RGBA output for jpeg edits by @david6666666 in #2297
  • [Doc] Add transformers version requirement in GLM-Image example doc by @chickeyton in #2265
  • [Bugfix] Fix Qwen3TTSConfig init order to be compatible with newer Tansformers(5.x) by @RuixiangMa in #2306
  • [Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #2195
  • [skip ci][Doc] Refine the Diffusion Features User Guide by @wtomin in #1928
  • [Bugfix] fix: return 400 for unsupported multi-image edits such as Qwen-Image-Layered by @david6666666 in #2298
  • [Bugfix] fix: validate layered image layers range by @david6666666 in #2334
  • [skip ci][Docs] reorganize multiple L4 test guidelines by @fhfuih in #2119
  • [Diffusion] Refactor CFG parallel for extensibility and performance by @TKONIY in #2063
  • Fix Qwen3-TTS Base on NPU running failed by @OrangePure in #2353
  • [Test] Fix 4 broken Qwen3-TTS async chunk unit tests by @linyueqian in #2351
  • [Test] Add qwen3-omni tests for audio_in_video and one word prompt by @yenuo26 in #2097
  • [CI] fix test: use minimum supported layered output count by @david6666666 in #2350
  • [CI]test: add wan22 i2v video similarity e2e by @david6666666 in #2262
  • [Bugfix] Fix case-sensitivity in Qwen3 TTS speaker name lookup by @reidliu41 in #2358
  • Fix Qwen3-TTS gradio demo by @noobHappylife in #2372
  • [skip ci] update release 0.18.0 by @hsliuustc0106 in #2380
  • [Bugfix] Update Whisper model loading to support multi-GPU ...
Read more

v0.18.0

28 Mar 03:30
f55ea28

Choose a tag to compare

Highlights

This release features 324 commits from 83 contributors, including 38 new contributors.

vLLM-Omni v0.18.0 is a major rebase and systems release that aligns the project with upstream vLLM v0.18.0, strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.

Key Improvements

  • Rebased to upstream vLLM v0.18.0, with follow-up updates to docs and dockerfiles, plus cleanup of patches that were no longer needed after the rebase. (#2037, #2038, #2062, #2271)
  • Refactored the serving entrypoint architecture, making the stack cleaner and easier to extend, while also laying groundwork for PD disaggregation, multimodal output decoupling, coordinator-based orchestration, and pipeline config cleanup. (#1908, #1863, #1816, #1465, #1115)
  • Strengthened audio, speech, and omni production serving, especially for Qwen3-TTS, Qwen3-Omni, MiMo-Audio, Fish Speech S2 Pro, and Voxtral TTS, with lower latency, better concurrency, more robust streaming, and improved online serving stability. (#1583, #1617, #1797, #1913, #1985, #1852, #1656, #1963, #2009, #2019, #2239, #1688, #1752, #1964, #2225, #1859, #2145, #2151, #2156, #2158)
  • Delivered substantial diffusion optimization, with scheduler/executor refactoring, faster startup, better cache-dit / TeaCache integration, broader TP/SP/HSDP support, and multiple correctness fixes for online and offline serving. (#1625, #1504, #1715, #1834, #1848, #1234, #2163, #1979, #2101, #2176)
  • Expanded model support across omni, speech, image, and video, including Helios, Helios-Mid / Distilled, MammothModa2, Fun CosyVoice3-0.5B-2512, FLUX.2-dev, FLUX.1-Kontext-dev, Hunyuan Image3 AR, Fish Speech S2 Pro, Voxtral TTS, DreamID-Omni, LTX-2, and HunyuanVideo-1.5. (#1604, #1648, #336, #498, #1629, #561, #759, #1798, #1803, #1855, #841, #1516)
  • Introduced a unified quantization framework and expanded quantization support across diffusion and image workloads, including INT8, FP8, and GGUF-related enablement. (#1764, #1470, #1640, #1755, #1473, #2180)
  • Improved RL and custom pipeline readiness, verl collaboration & Qwen-Image E2E RL, Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support. Including collective RPC support at the entrypoint, custom input/output support, async batching for Qwen-Image, and dedicated E2E coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)

Core Architecture & Runtime

  • Reworked the core serving architecture through the vLLM-Omni Entrypoint Refactoring, while also adding PD disaggregation scaffolding, coordinator support, multimodal output decoupling foundations, and cleaner model/pipeline configuration handling. (#1908, #1863, #1465, #1816, #1115, #1958, #2105)
  • Continued cleanup of runtime internals with stage/step pipeline refactors, dead-code cleanup, and improvements to async engine robustness and scheduler state handling. (#1368, #1579, #2153, #2028, #1893)

Model Support

  • Omni / speech / audio models: added or expanded support for MammothModa2, Fun CosyVoice3-0.5B-2512, Fish Speech S2 Pro, and Voxtral TTS. (#336, #498, #1798, #1803)
  • Image / diffusion models: added or expanded support for Hunyuan Image-3.0, FLUX.2-dev, FLUX.1-Kontext-dev, and continued improvements for Qwen-Image, Qwen-Image-Edit, Qwen-Image-Layered, LongCat-Image, GLM-Image, Bagel, and OmniGen2. (#759, #1629, #561, #1682, #2085, #1970, #2035, #1918, #1578, #1669, #1903, #1711, #1934)
  • Video models: added or expanded support for Helios, Helios-Mid / Distilled, DreamID-Omni, LTX-2, HunyuanVideo-1.5, and updated supported video-generation coverage for Wan2.1-T2V. (#1604, #1648, #1855, #841, #1516, #1920)

Audio, Speech & Omni Production Optimization

  • Qwen3-TTS received major optimization work, including lower TTFA, better high-concurrency throughput, improved Code Predictor / Code2Wav execution, websocket streaming audio output, async scheduling by default, voice upload support, optional ref_text, and long ref_audio handling fixes. (#1583, #1617, #1797, #1913, #1985, #1852, #1719, #1853, #1201, #1879, #2046, #2104)
  • Qwen3-Omni gained lower inter-packet latency, speaker-switching support, decode-alignment fixes, and multiple correctness fixes for answer quality and online serving stability. (#1656, #1963, #2009, #2019, #2239)
  • MiMo-Audio improved compatibility and production robustness with TP fixes, broader attention backend support, configurable chunk sizing, and documentation to prevent noise-only outputs under unsupported attention setups. (#1688, #1752, #1964, #2225, #2205)
  • Fish Speech S2 Pro and Voxtral TTS were productionized further with online serving, voice cloning, better TTFP / inference performance, multilingual demo support, lighter flow matching, and voice-embedding fixes. (#1798, #1859, #2145, #1803, #2045, #2056, #2067, #2151, #2156, #2158, #2023)
  • Added or improved speech-serving interfaces, including speech batch entrypoint, speaker embedding support for speech and voices APIs, proper HTTP status handling, and streaming wav response support. (#1701, #1227, #1687, #1819)

Diffusion, Image & Video Generation

  • Runtime refactor & benchmarking: Refactored the diffusion runtime with cleaner scheduler/executor boundaries, better request-state flow, unified profiling, and stronger benchmarking infrastructure. (#1625, #2099, #1757, #1917, #1995)
  • Performance & startup gains: Improved diffusion performance through multi-threaded weight loading for Wan2.2, reduced IPC overhead for single-stage serving, cache-dit upgrades, TeaCache support, and nightly performance improvements for Qwen-Image. (#1504, #1715, #1834, #1234, #1314, #1805, #2111)
  • Distributed scaling: Expanded distributed diffusion execution with broader TP/SP/HSDP support across Flux, GLM-Image, Hunyuan, and Bagel. (#1250, #1900, #1918, #2163, #1903)
  • Serving UX & API ergonomics: Improved serving usability with a progress bar for diffusion models, richer image-edit parameters such as layers and resolution, and extra request-body support for video APIs. (#1652, #2053, #1955)
  • Correctness & stability fixes: Fixed a wide range of diffusion correctness issues, including config misalignment between offline and online inference, TP/no-seed broken-image issues, GLM-Image stage/device bugs, and TeaCache incompatibilities. (#1979, #2176, #2137, #2101, #1894, #2025)

Quantization & Memory Efficiency

  • Added the Unified Quantization Framework as a core infrastructure upgrade for more consistent quantized execution across model families. (#1764)
  • Expanded quantization support for diffusion/image workloads, including INT8 for DiT (Z-Image and Qwen-Image), FP8 for Flux transformers, and GGUF adapter support for Qwen-Image. (#1470, #1640, #1755)
  • Improved compatibility between quantization and runtime features such as CPU offload, tensor parallelism, and Flux-family execution. (#1473, #1723, #1978, #2180)

RL, Serving & Integrations

  • verl collaboration & Qwen-Image E2E RL: Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support, custom input/output, async batching for Qwen-Image, and dedicated E2E CI coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
  • Rollout scaling for visual RL: Added rollout building blocks referenced by verl’s Qwen-Image integration plan, including async batching for Qwen-Image plus tensor-parallel and data-parallel support for diffusion serving. (#1593, #1713, #1706)
  • Deployment & ecosystem integrations: Improved deployment and ecosystem integration with a Helm chart for Kubernetes, ComfyUI video & LoRA support, and a rewritten async video API lifecycle. (#1337, #1596, #1665)

Platforms, Distributed Execution & Hardware Coverage

  • Continued improving portability across CUDA, ROCm, NPU, and XPU/Intel GPU environments, including rebase follow-ups, ROCm CI setup, Intel CI dispatch, Intel GPU docs, and NPU docker/docs refreshes. (#2017, #1984, #1721, #2154, #2271, #2091)
  • Expanded distributed execution coverage with T5 tensor parallelism, more model-level TP/SP/HSDP support, and better handling of visible GPUs and stage-device initialization. (#1881, #1250, #1900, #1918, #2163, #2025)

CI, Benchmarks & Documentation

  • Strengthened release engineering and CI with a release pipeline, richer nightly benchmark/report generation, L3/L4/L5 test layering, expanded model E2E coverage, and stronger diffusion test coverage. (#1726, #1831, #1995, #1514, #1799, #2086, #1869, #2085, #2087, #2132, #2129, #2023)
  • Improved benchmarking with Qwen3-TTS benchmark scripts, nightly Qwen3-TTS and Qwen-Image performance tracking, diffusion timing, random benchmark datasets, and T2I/I2I accuracy benchmark integration. (#1573, #1700, #1805, #2111, #1757, #1657, #1917)
  • Refreshed project docs across installation, omni/TTS docs, diffusion serving parameters, UAA documentation, developer guides, and governance. (#1762, #1693, #2051, #2130, #2148, #1889)

Note

  • GLM-Image requires manually upgrading the transformers version to >= 5.0.

What's Changed

  • 0.16.0 release by @ywang96 in #1576
  • [Refactor]: Phase1 for rebasing_addit...
Read more

v0.18.0rc1

21 Mar 14:29
6838533

Choose a tag to compare

v0.18.0rc1 Pre-release
Pre-release

Highlights

This release features approximately 120 commits across 120+ pull requests from 50+ contributors, including 13 new contributors.

Expanded Model Support

This release continues to grow the multimodal model ecosystem with several major additions:

  • Added FLUX.2-dev image generation model (#1629).
  • Added Bagel multistage img2img support (#1669).
  • Added HunyuanVideo-1.5 text-to-video and image-to-video support (#1516).
  • Added Voxtral TTS model (#1803, #2026, #2056).
  • Added Fish Speech S2 Pro with online serving and voice cloning (#1798).
  • Added Dreamid-Omni from ByteDance (#1855).
  • Extended NPU support for HunyuanImage3 diffusion model (#1689).
  • Added OmniGen2 transformer config loading for HF models (#1934).

Performance Improvements

Multiple optimizations improve throughput, latency, and runtime efficiency:

  • Qwen3-Omni code predictor re-prefill + SDPA to eliminate decode hot-path CPU round-trips (#2012).
  • Qwen3-TTS high-concurrency throughput & latency boost (#1852).
  • Qwen3-TTS Code2Wav triton SnakeBeta kernel and CUDA Graph support (#1797).
  • Qwen3-TTS CodePredictor torch.compile with reduce-overhead and dynamic=False (#1913).
  • Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls (#1985).
  • Simple dynamic TTFA based on Code2Wav load for Qwen3-TTS (#1714).
  • Enabled async_scheduling by default for Qwen3-TTS (#1853).
  • Fish Speech S2 Pro inference performance improvements (#1859).
  • Fix slow hasattr in CUDAGraphWrapper.getattr (#1982).
  • Diffusion timing profiling improvements (#1757).

Inference Infrastructure & Parallelism

New infrastructure capabilities improve scalability and production readiness:

  • Model Pipeline Configuration System refactor (Part 1) (#1115).
  • vLLM-Omni entrypoint refactoring for cleaner startup flow (#1908).
  • Expert parallel for diffusion MoE layers (#1323).
  • Sequence parallelism (SP) support for FLUX.2-klein (#1250) and HSDP for Flux family (#1900).
  • T5 Tensor Parallelism support (#1881).
  • LongCat Sequence Parallelism refactored to use SP Plan (#1772).
  • PD disaggregation scaffolding (Split #1303 Part 1) (#1863).
  • Coordinator module with unit tests (#1465).
  • Refactored pipeline stage/step pipeline (#1368).
  • Helm Chart to deploy vLLM-Omni on Kubernetes (#1337).

Text-to-Speech Improvements

Major TTS pipeline improvements for streaming, quality, and new models:

  • Streaming audio output via WebSocket for Qwen3-TTS (#1719).
  • Gradio demo for Qwen3-TTS online serving (#1231).
  • Added wav response_format when stream is true in /v1/audio/speech (#1819).
  • Fixed Base voice clone streaming quality and stop-token crash (#1945).
  • Fixed streaming initial chunk — removed dynamic initial chunk, compute only on initial request (#1930).
  • Preserved ref_code decoder context for Base ICL in Qwen3-TTS (#1731).
  • Restored voice upload API and profiler endpoints reverted by #1719 (#1879).
  • BugFix for CodePredictor CudaGraph Pool (#2059).

Quantization & Hardware Support

  • Int8 quantization support for DiT (Z-Image & Qwen-Image) (#1470).
  • Added cache-dit support for HunyuanImage3 (#1848) and Flux.2-dev (#1814).
  • Enabled CPU offloading and Cache-DiT together on diffusion models (#1723).
  • Upgraded cache-dit from 1.2.0 to 1.3.0 (#1834).
  • NPU upgrade to v0.17.0 (#1890).
  • Updated Bagel modeling to remove CUDA hardcode and added XPU stage_config (#1931).
  • Updated GpuMemoryMonitor to DeviceMemoryMonitor for all hardware (#1526).
  • ROCm bugfix for device environment issues and CI setup (#1984, #2017).
  • Intel CI dispatch in Buildkite folder (#1721).

Frontend & Serving

  • ComfyUI video & LoRA support (#1596).
  • Rewrote video API for async job lifecycle (#1665).
  • Fix /chat/completion not reading extra_body for diffusion models (#2042).
  • Fix online server returning multiple images (#2007).
  • Fix Ovis Image crash when guidance_scale is set without negative_prompt (#1956).
  • Fix config misalignment between offline and online diffusion inference (#1979).

Reliability, Tooling & Developer Experience

  • OmniStage.try_collect() patched with process alive checks (#1560) and Ray alive checks (#1561).
  • Nightly Buildkite Pytest test case statistics with HTML report by email (#1674).
  • Nightly Benchmark HTML generator and updated EXCEL generator (#1831).
  • Added multimodal processing correctness tests for Omni models (#1445).
  • Added Qwen3-TTS nightly performance benchmark (#1700) and benchmark scripts (#1573).
  • Added Governance section (#1889).
  • Rebase to vllm v0.18.0 (#2037, #2038).
  • Numerous bug fixes across models, configuration, parallelism, and CI pipelines.

What's Changed

  • [Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
  • [Bagel]: Support multistage img2img by @princepride in #1669
  • [BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
  • [Doc] CLI Args Naming Style Correction by @wtomin in #1750
  • [Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
  • [Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
  • Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint by @ekagra-ranjan in #1255
  • [Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
  • [CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
  • [Doc][skip-ci] Update installation instructions by @tzhouam in #1762
  • Revert "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" by @hsliuustc0106 in #1789
  • [BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
  • [feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
  • [Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
  • [Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
  • [Docs] update async chunk performance diagram by @R2-Y in #1741
  • [Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
  • [Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
  • [skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
  • update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
  • Add coordinator module and corresponding unit test by @NumberWan in #1465
  • [Model]: add FLUX.2-dev model by @nuclearwu in #1629
  • [skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
  • [Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
  • [Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
  • [Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
  • [Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
  • [Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
  • [Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
  • Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
  • [Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
  • Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
  • [Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
  • [Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
  • [Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
  • [Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
  • [Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever...
Read more

v0.17.0rc1

09 Mar 11:28
155856f

Choose a tag to compare

v0.17.0rc1 Pre-release
Pre-release

Highlights

This release features approximately 70 commits across 72 pull requests from 30+ contributors, including 12 new contributors.

Expanded Model Support

This release significantly expands the supported multimodal model ecosystem:

  • Added support for Helios models and Helios-Mid / Distilled variants (#1604, #1648).
  • Added Hunyuan Image3 AR generation support (#759).
  • Added LTX-2 text-to-video and image-to-video support (#841).
  • Added support for MammothModa2 (#336) and CosyVoice3-0.5B (#498).
  • Improved compatibility and fixes for Qwen3-Omni and LongCat models (#1602, #1485, #1631).

Performance Improvements

Multiple optimizations improve startup time, streaming latency, and runtime efficiency:

  • Accelerated diffusion model startup with multi-threaded weight loading (#1504).
  • Reduced inter-packet latency in async chunking for Qwen3-Omni streaming (#1656).
  • Reduced TTFA (time-to-first-audio) for Qwen3-TTS via flexible initial phases (#1583).
  • Optimized TTS code predictor execution by removing GPU synchronization bottlenecks (#1614).
  • Enabled torch.compile + CUDA Graph for TTS pipelines (#1617).
  • Reduced IPC overhead in single-stage diffusion serving for Wan2.2 (#1715).

Inference Infrastructure & Parallelism

New infrastructure improvements improve scalability and flexibility for multimodal serving:

  • Added CFG KV-cache transfer support for multi-stage pipelines (#1422).
  • Added CFG parallel mode for Bagel diffusion models (#1578, #1695).
  • Refactored tile/patch parallelism to simplify support for additional models (#1366).
  • Added VAE patch parallel CLI option for online diffusion serving (#1716).
  • Enabled async chunking for offline inference and configurable chunk parameters (#1415, #1423).
  • Added collective RPC API entrypoint and custom I/O support for RL workloads (#1646).

Text-to-Speech Improvements

Major improvements to the stability and flexibility of the TTS pipeline:

  • Added voice upload API for Qwen3-TTS (#1201).
  • Added flexible task_type configuration for Qwen3-TTS models (#1197).
  • Added non-async chunk mode and improved offline batching support (#1678, #1417).
  • Fixed several stability issues including predictor crashes, all-silence output, and Transformers 5.x compatibility (#1619, #1664, #1536).

Quantization & Hardware Support

  • Added FP8 quantization support for Flux transformers (#1640).
  • Improved NPU support, including MindIE-SD AdaLN compatibility (#1537).
  • Improved device abstraction by replacing hard-coded CUDA generators with platform-aware detection (#1677).
  • Updated XPU container configuration (#1545).

Reliability, Tooling & Developer Experience

  • Added progress bar support for diffusion models (#1652).
  • Introduced benchmark collection and reporting scripts in CI (#1307).
  • Added TTS developer guide and testing documentation (#1693, #1376).
  • Improved API robustness with better error handling and request validation (#1641, #1687).
  • Numerous bug fixes across models, kernels, and configuration handling (#1391, #1566, #1609, #1661).

What's Changed

  • 0.16.0 release by @ywang96 in #1576
  • [Refactor]: Phase1 for rebasing_additional_info by @divyanshsinghvi in #1394
  • [Feature]: Support cfg kv-cache transfer in multi-stage by @princepride in #1422
  • [BugFix] Fix load_weights error when loading HunyuanImage3.0 by @Semmer2 in #1598
  • [Bugfix] fix kernel error for qwen3-omni by @R2-Y in #1602
  • [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio by @qibaoyuan in #1570
  • [Bugfix] Import InputPreprocessor into Renderer by @lengrongfu in #1566
  • [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading by @SamitHuang in #1504
  • [Bugfix][Model] Fix LongCat Image Config Handling / Layer Creation by @alex-jw-brooks in #1485
  • [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context by @ZhanqiuHu in #1619
  • [Debug] Enable curl retry aligned with openai by @tzhouam in #1539
  • [Doc] Fix links in the configuration doc by @yuanheng-zhao in #1615
  • [CI] Add scripts for bechmark collection and email distribution. by @congw729 in #1307
  • [FEATURE] Tile/Patch parallelism refactor for easily support other models by @Bounty-hunter in #1366
  • [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation by @yuanheng-zhao in #1609
  • Make chunk_size and left_context_size configurable via YAML for async chunking by @LJH-LBJ in #1423
  • [Bugfix] Fix transformers 5.x compat issues in online TTS serving by @linyueqian in #1536
  • [Refactor] lora: reuse load_weights packed mapping by @dongbo910220 in #991
  • [Model]: support Helios from ByteDance by @princepride in #1604
  • [chore] add _repeated_blocks for regional compilation support by @RuixiangMa in #1642
  • [Bugfix] Add TTS request validation to prevent engine crashes by @linyueqian in #1641
  • [CI] Fix ASCII codes. by @congw729 in #1647
  • [Misc] update wechat by @david6666666 in #1649
  • docs: Announce vllm-omni-skills community project by @hsliuustc0106 in #1651
  • [Model] Add Hunyuan Image3 AR Support by @usberkeley in #759
  • [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases by @amy-why-3459 in #1628
  • [Bugfix] Fix Dtype Parsing by @alex-jw-brooks in #1391
  • [XPU] fix UMD version in docker file by @yma11 in #1545
  • add support for MammothModa2 model by @HonestDeng in #336
  • [Model] Fun cosy voice3-0.5-b-2512 by @divyanshsinghvi in #498
  • [Bugfix] Enable torch.compile for low noise model (transformer_2) by @lishunyang12 in #1541
  • [NPU] [Features] [Bugfix] Support mindiesd adaln by @jiangmengyu18 in #1537
  • [FP8 Quantization] Add FP8 quantization support for Flux transformer by @zzhuoxin1508 in #1640
  • Replace hard-coded cuda generator with current_omni_platform.device_type by @pi314ever in #1677
  • [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup by @alex-jw-brooks in #1631
  • [Misc] remove logits_processor_pattern this field, because vllm have … by @lengrongfu in #1675
  • [CI] Remove high concurrency tests before issue #1374 fixed. by @congw729 in #1683
  • [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk by @ZeldaHuang in #1656
  • [Feat][Qwen3TTS] reduce TTFA with flexible initial phase by @JuanPZuluaga in #1583
  • [Model] support LTX-2 text-to-video image-to-video by @david6666666 in #841
  • [BugFix] Return proper HTTP status for ErrorResponse in create_speech by @Lidang-Jiang in #1687
  • [Doc] Add the test guide document. [skip ci] by @yenuo26 in #1376
  • [UX] Add progress bar for diffusion models by @gcanlin in #1652
  • [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder by @ZhanqiuHu in #1664
  • [Feature] Support flexible task_type configuration for Qwen3-TTS models by @JackLeeHal in #1197
  • [Cleanup] Move cosyvoice3 tests to model subdirectory by @linyueqian in #1666
  • [Feature][Bagel] Add CFG parallel mode by @nussejzz in #1578
  • perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor by @dubin555 in #1614
  • [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph by @Sy0307 in #1617
  • [MiMo-Audio] Bugfix tp lg than 1 by @qibaoyuan in #1688
  • Add non-async chunk support for Qwen3-TTS by @linyueqian in #1678
  • [1/N][Refactor] Clean up dead code in output processor by @gcanl...
Read more

v0.16.0

28 Feb 08:33
3d9fa8d

Choose a tag to compare

Highlights

This release features approximately 121 commits (merged PRs) from ~60 contributors (24 new contributors).

vLLM-Omni v0.16.0 is a major alignment + capability release that rebases the project onto upstream vLLM v0.16.0 and significantly expands performance, distributed execution, and production readiness across Qwen3-Omni / Qwen3-TTS, Bagel, MiMo-Audio, GLM-Image and the Diffusion (DiT) image/video stack—while also improving platform coverage (CUDA / ROCm / NPU / XPU), CI quality, and documentation.

Key Improvements

  • Rebase to upstream vLLM v0.16.0: Tracks the latest vLLM runtime behavior and APIs while keeping Omni’s error handling aligned with upstream expectations. (#1357, #1122, plus follow-up fixes like #1401)
  • Qwen3-Omni performance + correctness: Performance optimizations (cuda graph, async-chunk, streaming output) make TTFP reduced 90% and RTF 0.22~0.45, plus precision and E2E metric correctness fixes. (#1378, #1352, #1288, #1018, #1292)
  • MiMo-Audio production Support: Performance optimizations (cuda graph, async-chunk, streaming output) improves the RTF ~ 0.2, 11x faster than baseline. #750
  • Qwen3-TTS production upgrades: Disaggregated inference pipeline support, streaming output, batched Code2Wav decoding, and CUDA Graph support for speech tokenizer decoding—plus multiple robustness fixes across task type handling and voice cloning. (#1161, #1438, #1426, #1205, #1317, #1554)
  • Bagel acceleration & scalability: Adds TP support, introduces CFG capabilities, and accelerates multi-branch CFG by merging branches into a single batch; includes KV transfer stability fixes. (#1293, #1310, #1429, #1437)
  • Diffusion distributed execution expansion: Adds/extends TP/SP/HSDP and reduces redundant communication overhead; improves pipeline parallelism options (e.g., VAE patch parallel) and correctness across multiple diffusion families. (#964, #1275, #1339, #756, #1428)
  • Quantization for DiT: Introduces FP8 quantization support and native GGUF quantization support for diffusion transformers, with code-path cleanups. (#1034, #1285, #1533)
  • Broader model coverage (audio + image): Adds MiMo-Audio-7B-Instruct support and performance improvements for GLM-Image pipelines. (#750, #920)

Diffusion, Image & Video Generation

  • New/expanded model coverage

    • HunyuanImage3 support and v0.16.0 follow-ups removing CUDA hardcoding + MOE fixes. (#1085, #1402, #1401)
    • OmniGen2 support. (#513)
    • nextstep_1 diffusion model (T2I-only). (#612)
  • Distributed & parallel execution

    • TP support additions/expansions for diffusion models (e.g., Wan 2.2, SD 3.5). (#964, #1336)
    • HSDP for diffusion models for improved scalability. (#1339)
    • VAE patch parallelism support (and enablement for SD3.5). (#756, #1428)
    • Sequence-parallel comm reduction by refining SP hook design. (#1275)
  • Performance & memory efficiency

    • Flux caching features (e.g., cache_dit) and CFG-parallel improvements for Flux.1-dev. (#1145, #1269)
    • Process-level memory calculation hooks for diffusion workloads. (#1276)
    • Platform-wide enablement of layerwise offload. (#1492)
  • Correctness & stability

    • Multiple pipeline stability and correctness fixes (seed handling, attention mask dtype/shape, tokenizer padding issues, init/download safety, model detect robustness, etc.). (e.g., #1249, #1248, #1349, #1241, #1213, #1254, #1562)

Audio, Speech & Omni (Qwen3-TTS / MiMo-Audio)

  • Qwen3-TTS feature set maturation

    • Disaggregated inference pipeline support for stage-based / split deployment. (#1161)
    • Streaming output for v1/audio/speech-style workflows. (#1438)
    • Code2Wav batched decoding and async-chunk batch inference enhancements. (#1426, #1246)
    • CUDA Graph support for the speech tokenizer decoder. (#1205)
  • Stability & quality

    • Fixes for task_type handling, snapshot/download behavior, configuration options, and voice clone corruption edge cases. (#1317, #1318, #1177, #1554, #1455)
    • More robust handling of multimodal outputs that attach audio payloads and related server-side audio data processing. (#1203, #1222)

Multimodal Model Improvements

  • Bagel

    • TP support for scaling across devices. (#1293)
    • CFG enablement and multi-branch CFG merged into a single batch to improve throughput and reduce per-branch overhead. (#1310, #1429)
    • KV transfer and stability fixes. (#1437)
  • GLM-Image

    • Performance improvements for GLM-Image workloads. (#920)
    • Additional image-serving hardening that benefits GLM-Image deployments (endpoint/pipeline validation and crash fixes in edge cases). (e.g., #1141, #1265, #1248)

Serving, APIs & Integrations

  • OpenAI-compatible video serving

    • Adds Wan2.2 T2V and I2V online serving via OpenAI /v1/videos API. (#1073)
    • Supports irregular output shapes for Wan2.2. (#1279)
  • Online serving robustness & usability

    • Unify CLI argument naming style and forward serve parameters more consistently to models. (#1309, #985)
    • Per-request generator_device for online image generation/edit flows. (#1183)
    • Fixes for image edit endpoint validation and RoPE crashes on explicit H/W. (#1141, #1265)
  • Ecosystem integration

    • ComfyUI integration for improved workflow adoption. (#1113)

Performance, Scheduling & Memory Accounting

  • Async chunk enhancements

    • Overlap chunk I/O and compute via async scheduling to reduce idle time in chunked pipelines. (#951)
    • Async-chunk refactors and shape mismatch fixes for stability. (#1151, #1195)
  • Metrics & benchmarking

    • Metrics structure optimization and multiple fixes for token/stream stats and E2E correctness (including Qwen3-Omni async-chunk E2E metric correctness). (#891, #1292, #1301, #1018)
    • Adds benchmarks for audio speech non-streaming and omni performance benchmark tests. (#1408, #1321)
  • Memory accounting

    • Process-scoped GPU memory accounting and diffusion-side process-level tracking improvements. (#1204, #1276)

Platform, Hardware Backends & Deployment

  • XPU / NPU / ROCm coverage improvements

    • XPU Dockerfile + docs, enable FLASH_ATTN on XPU, fix XPU UT coverage; disable diffusion compile on XPU where needed. (#1162, #1332, #1164, #1148)
    • NPU upgrade to v0.16.0 and recovery fixes for Qwen3-TTS. (#1375, #1564)
    • ROCm CI/docker updates to track vLLM v0.16.0 stable. (#1380, #1500)
  • Deployment & connectivity

    • Stage-based deployment CLI and UDS-based ZMQ address handling for stage serving. (#939, #1522)
    • RDMA connector support for high-performance interconnect scenarios. (#1019)
    • Platform-dependent package installation improvements. (#1046)

CI, Testing, Docs & Developer Experience

  • CI quality + coverage

    • Expanded test stratification design (L2/L3), nightly(L4) test runs, branch coverage fixes, and CI performance tuning. (#1272, #1333, #1120, #1283)
    • Improved CI stability (timeouts, reduced H100 usage, clearer logs). (#1460, #1543, #1463)
  • Docs & tutorials

    • Tutorials on models/pipelines/features, diffusion tutorial refinements, Qwen3-TTS docs consistency, quantization Q&A updates, and installation instructions for vLLM 0.16.0. (#1196, #1305, #1226, #1257, #1505)
    • Improved examples (e.g., image-to-video download steps). (#1258)
  • Tooling

    • Online profiling support and other developer ergonomics improvements. (#1136)

Stability & Bug Fixes (Across the Stack)

This release includes broad correctness and robustness fixes spanning:

  • Diffusion pipelines (dtype/shape, init crashes, model detection, seed and config handling)
  • Image edit / generation endpoints (format validation, RoPE crash, argument typing, seed handling)
  • Distributed execution (process group mapping accuracy, scheduler race conditions, kv transfer correctness)
  • General runtime hygiene (removing unnecessary ZMQ init, CLI naming normalization, upstream-aligned error handling)

What's Changed

Read more

v0.16.0rc1

13 Feb 11:25
75770c9

Choose a tag to compare

v0.16.0rc1 Pre-release
Pre-release

This pre-release is an alignment to the upstream vLLM v0.16.0.

Highlights

  • Rebase to Upstream vLLM v0.16.0: vLLM-Omni is now fully aligned with the latest vLLM v0.16.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1357).
  • Tensor Parallelism for Bagel & SD 3.5: Added Tensor Parallelism (TP) support for the Bagel model and Stable Diffusion 3.5, improving inference scalability for these diffusion workloads (#1293, #1336).
  • CFG Parallel Expansion: Extended Classifier-Free Guidance (CFG) parallel support to Bagel and FLUX.1-dev models, enabling faster guided generation (#1310, #1269).
  • Async Scheduling for Chunk IO Overlap: Introduced async scheduling to overlap chunk IO and computation across stages, reducing idle time and improving end-to-end throughput (#951).
  • Diffusion Sequence Parallelism Optimization: Removed redundant communication cost by refining the SP hook design, improving diffusion parallelism efficiency (#1275).
  • ComfyUI Integration: Added a full ComfyUI integration (ComfyUI-vLLM-Omni) as an official app, supporting image generation, multimodal comprehension, and TTS workflows via vLLM-Omni's online serving API (multiple files under apps/ComfyUI-vLLM-Omni/). (#1113)
  • Qwen3-Omni Cudagraph by Default: Enabled cudagraph for Qwen3-Omni by default for improved inference performance (#1352).

What's Changed

Features & Optimizations

Alignment & Integration

  • Unifying CLI Argument Naming Style by @wtomin in #1309
  • fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
  • [Debug] Add trigger to concurrent stage init by @tzhouam in #1274

Bug Fixes

  • [Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
  • [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
  • [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
  • [BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
  • [Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
  • [Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
  • Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
  • [XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295

Infrastructure (CI/CD) & Documentation

  • [CI] Run nightly tests. by @congw729 in #1333
  • [CI] Add env variable check for nightly CI by @congw729 in #1281
  • [CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
  • [CI] Add CI branch coverage calculation, fix statement coverage results by @yenuo26 in #1120
  • [Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
  • [Test] Add example test cases for omni online by @yenuo26 in #1086
  • [Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
  • [Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
  • [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
  • [DOC] Doc for CI test - Details about five level structure and some other files. by @congw729 in #1167
  • [Bugfix] Fix Doc link Error by @lishunyang12 in #1263
  • update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304

Remaining notes

  • nvidia-cublas-cu12 is pinned to 12.9.1.4 via force-reinstall in Dockerfile.ci, waiting for updates from vLLM main repo and PyTorch. pytorch/pytorch#174949
  • Qwen2.5-omni with mixed_modalities input only uses first frame of video, which originates from vLLM main repo: vllm-project/vllm#34506

New Contributors

Full Changelog: v0.15.0rc1...v0.16.0rc1

v0.15.0rc1

03 Feb 09:23
d6f93b0

Choose a tag to compare

v0.15.0rc1 Pre-release
Pre-release

This pre-release is a alignment to the upstream vLLM v0.15.0.

Highlights

  • Rebase to Upstream vLLM v0.15.0: vLLM-Omni is now fully aligned with the latest vLLM v0.15.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1159).
  • Tensor Parallelism for LongCat-Image: We have added Tensor Parallelism (TP) support for LongCat-Image and LongCat-Image-Edit models, significantly improving the inference speed and scalability of these vision-language models (#926).
  • TeaCache Optimization: Introduced Coefficient Estimation for TeaCache, further refining the efficiency of our caching mechanisms for optimized generation (#940).
  • Alignment & Stability:
    • Enhanced error handling logic to maintain consistency with upstream vLLM v0.14.0/v0.15.0 standards (#1122).
    • Integrated "Bagel" E2E Smoke Tests and refactored sequence parallel tests to ensure robust CI/CD and accurate performance benchmarking (#1074, #1165).
  • Update paper link: A intial paper to arxiv to give introductions to our design and some performance test results (#1169).

What's Changed

Features & Optimizations

Alignment & Integration

Infrastructure (CI/CD) & Documentation

New Contributors

Full Changelog: v0.14.0...v0.15.0rc1

v0.14.0

31 Jan 07:31
ed89c8b

Choose a tag to compare

Highlights

This release features approximately 180 commits from over 70 contributors (23 new contributors).

vLLM-Omni v0.14.0 is a feature-heavy release that expands Omni’s diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability.

Key Improvements:

  • Async chunk ([#727]): chunk pipeline overlap across stages to reduce idle time and improve end-to-end throughput/latency for staged execution.
  • Stage-based deployment for the Bagel model ([#726]): Multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage) aligning it with the vllm-omni architecture
  • Qwen3-TTS model family support ([#895]): Expands text-to-audio generation and supports online serving.
  • Diffusion LoRA Adapter Support (PEFT-compatible) ([#758]): Adds LoRA fine-tuning/adaptation for diffusion workflows with a PEFT-aligned interface.
  • DiT layerwise (blockwise) CPU offloading ([#858]): Fine-grained offloading to increase memory headroom for larger diffusion runs.
  • Hardware platforms + plugin system ([#774]): Establishes a more extensible platform capability layer for cleaner multi-backend development.

Diffusion & Image/Video Generation

  • Sequence Parallelism (SP) foundations + expansion: Adds a non-intrusive SP abstraction for diffusion models ([#779]), SP support in LongCatImageTransformer ([#721]), and SP support for Wan2.2 diffusion ([#966]).
  • CFG improvements and parallelization: CFG parallel support for Qwen-Image ([#444]), CFG parallel abstraction ([#851]), and online-serving CFG parameter support ([#824]).
  • Acceleration & execution plumbing: Torch compile support for diffusion ([#684]), GPU diffusion runner ([#822]), and diffusion executor ([#865]).
  • Caching and memory efficiency: TeaCache for Z-Image ([#817]) and TeaCache for Bagel ([#848]); plus CPU offloading for diffusion ([#497]) and DiT tensor parallel enablement for diffusion pipeline (Z-Image) ([#735]).
  • Model coverage expansion: Adds GLM-Image support ([#847]), FLUX family additions (e.g., FLUX.1-dev [#853], FLUX.2-klein [#809]) and related TP support ([#973]).
  • Quality/stability fixes for pipelines: Multiple diffusion pipeline correctness fixes (e.g., CFG parsing failure fix [#922], SD3 compatibility fix [#772], video saving bug under certain fps [#893], noisy output without a seed in Qwen Image [#1043]).

Audio & Speech (TTS / Text-to-Audio)

  • Text-to-audio model support: Stable Audio Open support for text-to-audio generation ([#331]).
  • Qwen3-TTS stack maturation: Model series support ([#895]), online serving support ([#968]), plus stabilization fixes such as profile-run hang resolution ([#1082]) and dependency additions for Qwen3-TTS support ([#981]).
  • Interoperability & correctness: Fixes and improvements across audio outputs and model input validation (e.g., StableAudio output standardization [#842], speaker/voices loading from config [#1079]).

Serving, APIs, and Frontend

  • Diffusion-mode service endpoints & compatibility: Adds /health and /v1/models endpoints for diffusion mode and fixes streaming compatibility ([#454]).
  • New/expanded image APIs: /v1/images/edit interface ([#1101]).
  • Online serving usability improvements: Enables tensor_parallel_size argument with online serving command ([#761]) and supports CFG parameters in online serving ([#824]).
  • Batching & request handling: Frontend/model support for batch requests (OmniDiffusionReq refinement) ([#797]).

Performance & Efficiency

  • Qwen3-Omni performance work: SharedFusedMoE integration ([#560]), fused QKV & projection optimizations (e.g., fuse QKV linear and gate_up proj [#734], Talker MTP optimization [#1005]).
  • Attention and kernel/backend tuning: Flash Attention attention-mask support ([#760]), FA3 backend defaults when supported ([#783]), and ROCm performance additions like AITER Flash Attention ([#941]).
  • Memory-aware optimizations: Conditional transformer loading for Wan2.2 to reduce memory usage ([#980]).

Hardware / Backends / CI Coverage

  • Broader backend support: XPU backend support ([#191]) plus the platform/plugin system groundwork ([#774]).
  • NPU & ROCm updates: NPU upgrade alignment ([#820], [#1114]) and ROCm CI expansion / optimization ([#542], [#885], [#1039]).
  • Test reliability / coverage: CI split to avoid timeouts ([#883]) and additional end-to-end / precision tests (e.g., chunk e2e tests [#956]).

Reliability, Correctness, and Developer Experience

  • Stability fixes across staged execution and serving: Fixes for stage config loading issues ([#860]), stage output mismatch in online batching ([#691]), and server readiness wait-time increase for slow model loads ([#1089]).
  • Profiling & benchmarking improvements: Diffusion profiler support ([#709]) plus benchmark additions (e.g., online benchmark [#780]).
  • Documentation refresh: Multiple diffusion docs refactors and new guides (e.g., profiling guide [#738], torch profiler guide [#570], diffusion docs refactor [#753], ROCm instructions updates [#678], [#905]).

What's Changed

  • [Docs] Fix diffusion module design doc by @SamitHuang in #645
  • [Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
  • [Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
  • Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
  • [Misc] Merge diffusion forward context by @iwzbi in #582
  • [Doc] User guide for torch profiler by @lishunyang12 in #570
  • [Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
  • [BugFix] token2wav code out of range by @Bounty-hunter in #655
  • [Doc] Update version 0.12.0 by @ywang96 in #662
  • [Docs] Update diffusion_acceleration.md by @SamitHuang in #659
  • [Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
  • [Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
  • [BUGFIX] Delete the CUDA context in the stage process. by @fake0fan in #661
  • [Misc] Fix docs display problem of streaming mode and other related issues by @Gaohan123 in #667
  • [Model] Add Stable Audio Open support for text-to-audio generation by @linyueqian in #331
  • [Doc] Update ROCm getting started instruction by @tjtanaa in #678
  • [Bugfix] Fix f-string formatting in image generation pipelines by @ApsarasX in #689
  • [Bugfix] Solve Ulysses-SP sequence length not divisible by SP degree (using padding and attention mask) by @wtomin in #672
  • omni entrypoint support tokenizer arg by @divyanshsinghvi in #572
  • [Bug fix] fix e2e_total_tokens and e2e_total_time_ms by @LJH-LBJ in #648
  • [BugFix] Explicitly release file locks during stage worker init by @yuanheng-zhao in #703
  • [BugFix] Fix stage engine outputs mismatch bug in online batching by @ZeldaHuang in #691
  • [core] add torch compile for diffusion by @ZJY0516 in #684
  • [BugFix] Remove duplicate width assignment in SD3 pipeline by @dongbo910220 in #708
  • [Feature] Support Qwen3 Omni talker cudagraph by @ZeldaHuang in #669
  • [Benchmark] DiT Model Benchmark under Mixed Workloads by @asukaqaq-s in #529
  • update design doc by @hsliuustc0106 in #711
  • [Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni by @gcanlin in #560
  • [Doc]: update vllm serve param and base64 data truncation by @nuclearwu in #718
  • [BugFix] Fix assuming all stage model have talker by @princepride in #730
  • [Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj by @gcanlin in #734
  • [Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) by @dongbo910220 in #735
  • [Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker by @LJH-LBJ in #697
  • [ROCm] [CI] Add More Tests by @tjtanaa in #542
  • [Docs] update design doc templated in RFC by @hsliuustc0106 in #746
  • Add description of code version for bug report by @yenuo26 in #745
  • [misc] fix rfc template by @hsliuustc0106 in https://github.com...
Read more

v0.14.0rc1

22 Jan 15:22
a9012a1

Choose a tag to compare

v0.14.0rc1 Pre-release
Pre-release

Highlights (vllm-omni v0.14.0rc1)

This release candidate includes approximately 90 commits from 35 contributors (12 new contributors).

This release candidate focuses on diffusion runtime maturity, Qwen-Omni performance, and expanded multimodal model support, alongside substantial improvements to serving ergonomics, profiling, ROCm/NPU enablement, and CI/docs quality. In addition, this is the first vllm-omni rc version with Day-0 alignment with vLLM upstream.

Model Support

  • TTS: Added support for the Qwen3-TTS(Day-0) model series. (#895)
  • Diffusion / image families: Added Flux.2-klein(Day-0) GLM-Image(Day-0), plus multiple qwen-image family correctness/perf improvements. (#809, #868, #847)
  • Bagel ecosystem: Added Bagel model support and Cache-DiT support. (#726, #736)
  • Text-to-audio: Added Stable Audio Open support for text-to-audio generation. (#331)

Key Improvements

  • Qwen-Omni performance & serving enhancements

    • Improved Qwen3-Omni throughput with vLLM SharedFusedMoE, plus additional kernel/graph optimizations:

      • SharedFusedMoE integration (#560)
      • QKV linear + gate_up projection fusion (#734)
      • Talker cudagraph support and MTP batch inference for Qwen3-Omni talker (#669, #722)
      • Optimized thinker-to-talker projection path (#825)
    • Improved online serving configurability:

      • omni entrypoint tokenizer argument support (#572)
      • Enable tensor_parallel_size for online serving command (#761)
      • Grouped omni arguments into OmniConfig for cleaner UX (#744)
  • Diffusion runtime & acceleration upgrades

    • Added sleep / wake_up / load_weights lifecycle controls for Omni Diffusion, improving operational flexibility for long-running services. (#376)
    • Introduced torch.compile support for diffusion to improve execution efficiency on supported setups. (#684)
    • Added a GPU Diffusion Runner and Diffusion executor, strengthening the core execution stack for diffusion workloads. (#822, #865)
    • Enabled TeaCache acceleration for Z-Image diffusion pipelines. (#817)
    • Defaulted to FA3 (FlashAttention v3) when supported, and extended FlashAttention to support attention masks. (#783, #760)
    • Added CPU offloading support for diffusion to broaden deployment options under memory pressure. (#497)
  • Parallelism and scaling for diffusion pipelines

    • Added CFG parallel support for Qwen-Image and introduced CFG parameter support in online serving. (#444, #824)
    • Enabled DiT tensor parallel for Z-Image diffusion pipeline and extended TP support for qwen-image with test refactors. (#735, #830)
    • Implemented Sequence Parallelism (SP) abstractions for diffusion, including SP support in LongCatImageTransformer. (#779, #721)

Stability, Tooling, and Platform

  • Correctness & robustness fixes across diffusion and staged execution:

    • Fixed diffusion model load failure when stage config is present (#860)
    • Fixed stage engine outputs mismatch under online batching (#691)
    • Fixed CUDA-context lifecycle issues and file-lock handling in stage workers (#661, #703)
    • Multiple model/pipeline fixes (e.g., SD3 compatibility, Wan2.2 warmup/scheduler, Qwen2.5-Omni stop behavior). (#772, #791, #804, #773)
  • Profiling & developer experience

    • Added Diffusion Profiler support, plus user guides for diffusion profiling and torch profiler usage. (#709, #738, #570)
  • ROCm / NPU / CI

    • Enhanced ROCm CI coverage, optimized ROCm Dockerfile build time, and refreshed ROCm getting-started documentation. (#542, #885, #678)
    • CI reliability improvements (pytest markers, split tests to avoid timeouts). (#719, #883)

Note:The NPU AR functionality is currently unavailable and will be supported in the official v0.14.0 release.

What's Changed

  • [Docs] Fix diffusion module design doc by @SamitHuang in #645
  • [Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
  • [Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
  • Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
  • [Misc] Merge diffusion forward context by @iwzbi in #582
  • [Doc] User guide for torch profiler by @lishunyang12 in #570
  • [Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
  • [BugFix] token2wav code out of range by @Bounty-hunter in #655
  • [Doc] Update version 0.12.0 by @ywang96 in #662
  • [Docs] Update diffusion_acceleration.md by @SamitHuang in #659
  • [Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
  • [Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
  • [BUGFIX] Delete the CUDA context in the sta...
Read more

v0.12.0rc1

05 Jan 11:17
e7eeb54

Choose a tag to compare

v0.12.0rc1 Pre-release
Pre-release

vLLM-Omni v0.12.0rc1 Pre-Release Notes Highlights

Highlights

This release features 187 commits from 45 contributors (34 new contributors)!

vLLM-Omni v0.12.0rc1 is a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm). It also rebases on vLLM v0.12.0 for better alignment with upstream (#335).

Breaking / Notable Changes

  • Unified diffusion stage naming & structure: cleaned up legacy Diffusion* paths and aligned on Generation*-style stages to reduce duplication (#211, #163).
  • Safer serialization: switched OmniSerializer from pickle to MsgPack (#310).
  • Dependency & packaging updates: e.g., bumped diffusers to 0.36.0 (#313) and refreshed Python/formatting baselines for the v0.12 release (#126).

Diffusion Engine: Architecture + Performance Upgrades

  • Core refactors for extensibility: diffusion model registry refactored to reuse vLLM’s ModelRegistry (#200), improved diffusion weight loading and stage abstraction (#157, #391).

  • Acceleration & parallelism features:

    • Cache-DiT with a unified cache backend interface (#250)
    • TeaCache integration and registry refactors (#179, #304, #416)
    • New/extended attention & parallelism options: Sage Attention (#243), Ulysses Sequence Parallelism (#189), Ring Attention (#273)
    • torch.compile optimizations for DiT and RoPE kernels (#317)

Serving: Stronger OpenAI Compatibility & Online Readiness

  • DALL·E-compatible image generation endpoint (/v1/images/generations) (#292), plus online serving fixes for image generation (#499).
  • Added OpenAI create speech endpoint (#305).
  • Per-request modality control (output modality selection) (#298) with API usage examples (#411).
  • Early support for streaming output (#367), request abort (#486), and request-id propagation in responses (#301).

Omni Pipeline: Multi-stage Orchestration & Observability

  • Improved inter-stage plumbing: customizable process between stages and reduced coupling on request_ids in model forward paths (#458).
  • Better observability and debugging: torch profiler across omni stages (#553), improved traceback reporting from background workers (#385), and logging refactors (#466).

Expanded Model Support (Selected)

  • Qwen-Omni / Qwen-Image family:

    • Qwen-Omni offline inference with local files (#167)
    • Qwen-Image-2512 support(#547)
    • Qwen-Image-Edit support (including multi-image input variants and newer releases, Qwen-Image-Edit Qwen-Image-Edit-2509 Qwen-Image-Edit-2511) (#196, #330, #321)
    • Qwen-Image-Layered model support (#381)
    • Multiple fixes for Qwen2.5/Qwen3-Omni batching, examples, and OpenAI sampling parameter compatibility (#451, #450, #249)
  • Diffusion / video ecosystem:

    • Z-Image support and kernel fusions (#149, #226)
    • Stable Diffusion 3 support (#439)
    • Wan2.2 T2V plus I2V/TI2V pipelines (#202, #329)
    • LongCat-Image and LongCat-Image-Edit support (#291, #392)
    • Ovis Image model addition (#263)
    • Bagel (diffusion-only) and image-edit support (#319, #588)

Platform & CI Coverage

  • ROCm / AMD: documented ROCm setup (#144) and added ROCm Dockerfile + AMD CI (#280).
  • NPU: added NPU CI workflow (#231) and expanded NPU support for key Omni models (e.g., Qwen3-Omni, Qwen-Image series) (#484, #463, #485), with ongoing cleanup of NPU-specific paths (#597).
  • CI and packaging improvements: diffusion CI, wheel compilation, and broader UT/E2E coverage (#174, #288, #216, #168).

What's Changed

Read more