Releases: vllm-project/vllm-omni
v0.19.0rc1
Highlights
This release features 71 commits since v0.18.0.
vLLM-Omni v0.19.0rc1 is a rebase-and-production-readiness release candidate aligned with upstream vLLM v0.19.0. It strengthens the runtime and serving stack, expands speech/TTS and diffusion/video capabilities, improves production behavior for Bagel and Wan pipelines, and broadens deployment coverage across new platforms and distributed execution modes.
Key Improvements
- Rebased to upstream vLLM v0.19.0, while continuing runtime cleanup and stage execution refactors that improve orchestration and production robustness. (#2475, #2006)
- Expanded speech and TTS serving, including new OmniVoice two-stage support, CosyVoice3 online serving, and multiple Qwen3-TTS / Fish Speech quality and latency fixes. (#2463, #2431, #2108, #2446, #2378, #2358)
- Improved diffusion and video generation workflows across Bagel, Wan2.2, FLUX.2-dev, and LTX-2, with lower latency, better forwarding behavior, and stronger production correctness. (#2398, #2422, #2397, #2381, #2459, #2393, #2433, #2260)
- Broadened deployment coverage, adding MUSA platform support, improving XPU readiness, and extending distributed diffusion features such as HSDP and CFG parallelism. (#2337, #2428, #2029, #2021, #1751)
Core Architecture & Runtime
- Rebased the project to upstream vLLM v0.19.0, keeping vLLM-Omni aligned with the latest upstream runtime behavior and APIs. (#2475)
- Continued the stage/runtime refactor by moving stage-side inference into dedicated subprocess-based clients and procs, simplifying orchestration and improving isolation for both AR and diffusion stages. (#2006)
- Added session-based streaming audio input with a realtime WebSocket path for Qwen3-Omni-style workflows, enabling incremental audio input and streamed transcription/output flows. (#2208)
- Added a nightly wheel release index, making it easier to validate and consume nightly builds in testing and pre-release workflows. (#2345)
Model Support
- Added OmniVoice two-stage TTS serving support, bringing zero-shot multilingual speech generation into the vLLM-Omni serving stack. (#2463)
- Added and stabilized CosyVoice3 online serving through
/v1/audio/speech, including stage config fixes and CI coverage. (#2431) - Added LTX-2 distilled two-stage inference for both text-to-video and image-to-video production workflows. (#2260)
- Added Wan 2.1 VACE support for conditional video generation workflows, including multiple conditioning modes. (#1885)
Audio, Speech & Omni Production Optimization
- Improved Qwen3-TTS repeated custom-voice serving by introducing an in-memory voice cache for reference-audio artifacts, reducing warm-request latency for repeated voices. (#2108)
- Fixed a Fish Speech structured voice-clone conditioning regression so cloned voice quality is restored in the prefill path. (#2446)
- Fixed Qwen3-TTS chunk-boundary handling, case-insensitive speaker lookup, and demo-serving issues to make TTS behavior more reliable in real deployments. (#2378, #2358, #2372)
- Added better benchmark support for Qwen3-TTS Base and VoiceDesign models so serving and HF benchmark paths correctly reflect task-specific request formats. (#2411)
Diffusion, Image & Video Generation
- Improved Wan2.2 runtime efficiency by optimizing rotary embedding behavior and skipping unnecessary cross-attention Ulysses SP paths where appropriate. (#2393, #2459)
- Strengthened Bagel production behavior with earlier KV-ready forwarding, fixes for delayed decoding in AR/DiT workflows, proper single-stage img2img routing, and a dedicated single-stage config. (#2398, #2422, #2397, #2381)
- Added Bagel thinking mode in multi-stage serving, expanding interactive and reasoning-style generation workflows. (#2447)
- Fixed FLUX.2-dev guidance handling so guidance scale is applied correctly during generation. (#2433)
- Added a synchronous
/v1/videos/syncendpoint for latency-sensitive benchmarking and direct-response video generation workflows. (#2049)
Quantization & Memory Efficiency
- Added offline AutoRound W4A16 support for diffusion models, improving deployability for memory-constrained setups. (#1777)
- Fixed layer-wise offload incompatibility with HSDP, improving compatibility between memory-saving and distributed execution paths. (#2021)
Platforms, Distributed Execution & Hardware Coverage
- Added MUSA platform support for Moore Threads GPUs, expanding vLLM-Omni beyond the existing CUDA/ROCm/NPU/XPU coverage. (#2337)
- Improved XPU readiness for speech serving by removing CUDA-only assumptions in Voxtral TTS components and adding an XPU stage config. (#2428)
- Expanded distributed diffusion support with HSDP for Qwen-Image-series, Z-Image, and GLM-Image, and added CFG parallel support for HunyuanImage3.0. (#2029, #1751)
- Fixed distributed gather behavior for non-contiguous tensors, improving correctness in CFG-parallel and related distributed paths. (#2367)
CI, Benchmarks & Documentation
- Refreshed the diffusion documentation structure around feature compatibility, parallelism, cache acceleration, quantization, and serving examples, making the diffusion stack easier to navigate and adopt.
- Expanded CI and E2E coverage for speech, diffusion, and video-serving scenarios, especially around CosyVoice3, Qwen3-TTS benchmarking, and Wan-family validation. (#2431, #2411, #2262)
Note
v0.19.0rc1is a release candidate focused on validating the upstream rebase, the refreshed runtime architecture, and the expanded speech/diffusion/platform support before the finalv0.19.0release.- Some low-signal CI and documentation maintenance changes were intentionally merged into broader themes instead of being listed one-by-one, following the project’s recent release-note style.
What's Changed
- [Bugfix][HunyuanImage3.0] Fix default guidance_scale from 1.0 to 4.0 and port GPU MoE ForwardContext fix from NPU by @nussejzz in #2142
- [Feat] support quantization for Flux Kontext by @RuixiangMa in #2184
- [Tests][Qwen3-Omni] Add performance test cases by @amy-why-3459 in #2011
- [Docs] Modify the documentation description for streaming output by @amy-why-3459 in #2300
- Fix: Enable /v1/models endpoint for pure diffusion mode by @majiayu000 in #805
- [skip ci] [Docs]: add CI Failures troubleshooting guide for contributors by @lishunyang12 in #1259
- Qwen3-Omni][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @LJH-LBJ in #2291
- [Feature] [HunyuanImage3] Add TeaCache support for inference acceleration by @nussejzz in #1927
- [Misc] Make gradio an optional dependency and upgrade to >=6.7.0 by @Lidang-Jiang in #2221
- [ROCm] [CI] Migrate to use amd docker hub for ci by @tjtanaa in #2303
- [Feat] add helios fp8 quantization by @lengrongfu in #1916
- [Bugfix] fix: handle Qwen-Image-Layered layered RGBA output for jpeg edits by @david6666666 in #2297
- [Doc] Add transformers version requirement in GLM-Image example doc by @chickeyton in #2265
- [Bugfix] Fix Qwen3TTSConfig init order to be compatible with newer Tansformers(5.x) by @RuixiangMa in #2306
- [Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #2195
- [skip ci][Doc] Refine the Diffusion Features User Guide by @wtomin in #1928
- [Bugfix] fix: return 400 for unsupported multi-image edits such as Qwen-Image-Layered by @david6666666 in #2298
- [Bugfix] fix: validate layered image layers range by @david6666666 in #2334
- [skip ci][Docs] reorganize multiple L4 test guidelines by @fhfuih in #2119
- [Diffusion] Refactor CFG parallel for extensibility and performance by @TKONIY in #2063
- Fix Qwen3-TTS Base on NPU running failed by @OrangePure in #2353
- [Test] Fix 4 broken Qwen3-TTS async chunk unit tests by @linyueqian in #2351
- [Test] Add qwen3-omni tests for audio_in_video and one word prompt by @yenuo26 in #2097
- [CI] fix test: use minimum supported layered output count by @david6666666 in #2350
- [CI]test: add wan22 i2v video similarity e2e by @david6666666 in #2262
- [Bugfix] Fix case-sensitivity in Qwen3 TTS speaker name lookup by @reidliu41 in #2358
- Fix Qwen3-TTS gradio demo by @noobHappylife in #2372
- [skip ci] update release 0.18.0 by @hsliuustc0106 in #2380
- [Bugfix] Update Whisper model loading to support multi-GPU ...
v0.18.0
Highlights
This release features 324 commits from 83 contributors, including 38 new contributors.
vLLM-Omni v0.18.0 is a major rebase and systems release that aligns the project with upstream vLLM v0.18.0, strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.
Key Improvements
- Rebased to upstream vLLM v0.18.0, with follow-up updates to docs and dockerfiles, plus cleanup of patches that were no longer needed after the rebase. (#2037, #2038, #2062, #2271)
- Refactored the serving entrypoint architecture, making the stack cleaner and easier to extend, while also laying groundwork for PD disaggregation, multimodal output decoupling, coordinator-based orchestration, and pipeline config cleanup. (#1908, #1863, #1816, #1465, #1115)
- Strengthened audio, speech, and omni production serving, especially for Qwen3-TTS, Qwen3-Omni, MiMo-Audio, Fish Speech S2 Pro, and Voxtral TTS, with lower latency, better concurrency, more robust streaming, and improved online serving stability. (#1583, #1617, #1797, #1913, #1985, #1852, #1656, #1963, #2009, #2019, #2239, #1688, #1752, #1964, #2225, #1859, #2145, #2151, #2156, #2158)
- Delivered substantial diffusion optimization, with scheduler/executor refactoring, faster startup, better cache-dit / TeaCache integration, broader TP/SP/HSDP support, and multiple correctness fixes for online and offline serving. (#1625, #1504, #1715, #1834, #1848, #1234, #2163, #1979, #2101, #2176)
- Expanded model support across omni, speech, image, and video, including Helios, Helios-Mid / Distilled, MammothModa2, Fun CosyVoice3-0.5B-2512, FLUX.2-dev, FLUX.1-Kontext-dev, Hunyuan Image3 AR, Fish Speech S2 Pro, Voxtral TTS, DreamID-Omni, LTX-2, and HunyuanVideo-1.5. (#1604, #1648, #336, #498, #1629, #561, #759, #1798, #1803, #1855, #841, #1516)
- Introduced a unified quantization framework and expanded quantization support across diffusion and image workloads, including INT8, FP8, and GGUF-related enablement. (#1764, #1470, #1640, #1755, #1473, #2180)
- Improved RL and custom pipeline readiness, verl collaboration & Qwen-Image E2E RL, Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support. Including collective RPC support at the entrypoint, custom input/output support, async batching for Qwen-Image, and dedicated E2E coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
Core Architecture & Runtime
- Reworked the core serving architecture through the vLLM-Omni Entrypoint Refactoring, while also adding PD disaggregation scaffolding, coordinator support, multimodal output decoupling foundations, and cleaner model/pipeline configuration handling. (#1908, #1863, #1465, #1816, #1115, #1958, #2105)
- Continued cleanup of runtime internals with stage/step pipeline refactors, dead-code cleanup, and improvements to async engine robustness and scheduler state handling. (#1368, #1579, #2153, #2028, #1893)
Model Support
- Omni / speech / audio models: added or expanded support for MammothModa2, Fun CosyVoice3-0.5B-2512, Fish Speech S2 Pro, and Voxtral TTS. (#336, #498, #1798, #1803)
- Image / diffusion models: added or expanded support for Hunyuan Image-3.0, FLUX.2-dev, FLUX.1-Kontext-dev, and continued improvements for Qwen-Image, Qwen-Image-Edit, Qwen-Image-Layered, LongCat-Image, GLM-Image, Bagel, and OmniGen2. (#759, #1629, #561, #1682, #2085, #1970, #2035, #1918, #1578, #1669, #1903, #1711, #1934)
- Video models: added or expanded support for Helios, Helios-Mid / Distilled, DreamID-Omni, LTX-2, HunyuanVideo-1.5, and updated supported video-generation coverage for Wan2.1-T2V. (#1604, #1648, #1855, #841, #1516, #1920)
Audio, Speech & Omni Production Optimization
- Qwen3-TTS received major optimization work, including lower TTFA, better high-concurrency throughput, improved Code Predictor / Code2Wav execution, websocket streaming audio output, async scheduling by default, voice upload support, optional
ref_text, and longref_audiohandling fixes. (#1583, #1617, #1797, #1913, #1985, #1852, #1719, #1853, #1201, #1879, #2046, #2104) - Qwen3-Omni gained lower inter-packet latency, speaker-switching support, decode-alignment fixes, and multiple correctness fixes for answer quality and online serving stability. (#1656, #1963, #2009, #2019, #2239)
- MiMo-Audio improved compatibility and production robustness with TP fixes, broader attention backend support, configurable chunk sizing, and documentation to prevent noise-only outputs under unsupported attention setups. (#1688, #1752, #1964, #2225, #2205)
- Fish Speech S2 Pro and Voxtral TTS were productionized further with online serving, voice cloning, better TTFP / inference performance, multilingual demo support, lighter flow matching, and voice-embedding fixes. (#1798, #1859, #2145, #1803, #2045, #2056, #2067, #2151, #2156, #2158, #2023)
- Added or improved speech-serving interfaces, including speech batch entrypoint, speaker embedding support for speech and voices APIs, proper HTTP status handling, and streaming
wavresponse support. (#1701, #1227, #1687, #1819)
Diffusion, Image & Video Generation
- Runtime refactor & benchmarking: Refactored the diffusion runtime with cleaner scheduler/executor boundaries, better request-state flow, unified profiling, and stronger benchmarking infrastructure. (#1625, #2099, #1757, #1917, #1995)
- Performance & startup gains: Improved diffusion performance through multi-threaded weight loading for Wan2.2, reduced IPC overhead for single-stage serving, cache-dit upgrades, TeaCache support, and nightly performance improvements for Qwen-Image. (#1504, #1715, #1834, #1234, #1314, #1805, #2111)
- Distributed scaling: Expanded distributed diffusion execution with broader TP/SP/HSDP support across Flux, GLM-Image, Hunyuan, and Bagel. (#1250, #1900, #1918, #2163, #1903)
- Serving UX & API ergonomics: Improved serving usability with a progress bar for diffusion models, richer image-edit parameters such as layers and resolution, and extra request-body support for video APIs. (#1652, #2053, #1955)
- Correctness & stability fixes: Fixed a wide range of diffusion correctness issues, including config misalignment between offline and online inference, TP/no-seed broken-image issues, GLM-Image stage/device bugs, and TeaCache incompatibilities. (#1979, #2176, #2137, #2101, #1894, #2025)
Quantization & Memory Efficiency
- Added the Unified Quantization Framework as a core infrastructure upgrade for more consistent quantized execution across model families. (#1764)
- Expanded quantization support for diffusion/image workloads, including INT8 for DiT (Z-Image and Qwen-Image), FP8 for Flux transformers, and GGUF adapter support for Qwen-Image. (#1470, #1640, #1755)
- Improved compatibility between quantization and runtime features such as CPU offload, tensor parallelism, and Flux-family execution. (#1473, #1723, #1978, #2180)
RL, Serving & Integrations
- verl collaboration & Qwen-Image E2E RL: Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support, custom input/output, async batching for Qwen-Image, and dedicated E2E CI coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
- Rollout scaling for visual RL: Added rollout building blocks referenced by verl’s Qwen-Image integration plan, including async batching for Qwen-Image plus tensor-parallel and data-parallel support for diffusion serving. (#1593, #1713, #1706)
- Deployment & ecosystem integrations: Improved deployment and ecosystem integration with a Helm chart for Kubernetes, ComfyUI video & LoRA support, and a rewritten async video API lifecycle. (#1337, #1596, #1665)
Platforms, Distributed Execution & Hardware Coverage
- Continued improving portability across CUDA, ROCm, NPU, and XPU/Intel GPU environments, including rebase follow-ups, ROCm CI setup, Intel CI dispatch, Intel GPU docs, and NPU docker/docs refreshes. (#2017, #1984, #1721, #2154, #2271, #2091)
- Expanded distributed execution coverage with T5 tensor parallelism, more model-level TP/SP/HSDP support, and better handling of visible GPUs and stage-device initialization. (#1881, #1250, #1900, #1918, #2163, #2025)
CI, Benchmarks & Documentation
- Strengthened release engineering and CI with a release pipeline, richer nightly benchmark/report generation, L3/L4/L5 test layering, expanded model E2E coverage, and stronger diffusion test coverage. (#1726, #1831, #1995, #1514, #1799, #2086, #1869, #2085, #2087, #2132, #2129, #2023)
- Improved benchmarking with Qwen3-TTS benchmark scripts, nightly Qwen3-TTS and Qwen-Image performance tracking, diffusion timing, random benchmark datasets, and T2I/I2I accuracy benchmark integration. (#1573, #1700, #1805, #2111, #1757, #1657, #1917)
- Refreshed project docs across installation, omni/TTS docs, diffusion serving parameters, UAA documentation, developer guides, and governance. (#1762, #1693, #2051, #2130, #2148, #1889)
Note
- GLM-Image requires manually upgrading the
transformersversion to >= 5.0.
What's Changed
v0.18.0rc1
Highlights
This release features approximately 120 commits across 120+ pull requests from 50+ contributors, including 13 new contributors.
Expanded Model Support
This release continues to grow the multimodal model ecosystem with several major additions:
- Added FLUX.2-dev image generation model (#1629).
- Added Bagel multistage img2img support (#1669).
- Added HunyuanVideo-1.5 text-to-video and image-to-video support (#1516).
- Added Voxtral TTS model (#1803, #2026, #2056).
- Added Fish Speech S2 Pro with online serving and voice cloning (#1798).
- Added Dreamid-Omni from ByteDance (#1855).
- Extended NPU support for HunyuanImage3 diffusion model (#1689).
- Added OmniGen2 transformer config loading for HF models (#1934).
Performance Improvements
Multiple optimizations improve throughput, latency, and runtime efficiency:
- Qwen3-Omni code predictor re-prefill + SDPA to eliminate decode hot-path CPU round-trips (#2012).
- Qwen3-TTS high-concurrency throughput & latency boost (#1852).
- Qwen3-TTS Code2Wav triton SnakeBeta kernel and CUDA Graph support (#1797).
- Qwen3-TTS CodePredictor torch.compile with reduce-overhead and dynamic=False (#1913).
- Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls (#1985).
- Simple dynamic TTFA based on Code2Wav load for Qwen3-TTS (#1714).
- Enabled async_scheduling by default for Qwen3-TTS (#1853).
- Fish Speech S2 Pro inference performance improvements (#1859).
- Fix slow hasattr in CUDAGraphWrapper.getattr (#1982).
- Diffusion timing profiling improvements (#1757).
Inference Infrastructure & Parallelism
New infrastructure capabilities improve scalability and production readiness:
- Model Pipeline Configuration System refactor (Part 1) (#1115).
- vLLM-Omni entrypoint refactoring for cleaner startup flow (#1908).
- Expert parallel for diffusion MoE layers (#1323).
- Sequence parallelism (SP) support for FLUX.2-klein (#1250) and HSDP for Flux family (#1900).
- T5 Tensor Parallelism support (#1881).
- LongCat Sequence Parallelism refactored to use SP Plan (#1772).
- PD disaggregation scaffolding (Split #1303 Part 1) (#1863).
- Coordinator module with unit tests (#1465).
- Refactored pipeline stage/step pipeline (#1368).
- Helm Chart to deploy vLLM-Omni on Kubernetes (#1337).
Text-to-Speech Improvements
Major TTS pipeline improvements for streaming, quality, and new models:
- Streaming audio output via WebSocket for Qwen3-TTS (#1719).
- Gradio demo for Qwen3-TTS online serving (#1231).
- Added wav response_format when stream is true in
/v1/audio/speech(#1819). - Fixed Base voice clone streaming quality and stop-token crash (#1945).
- Fixed streaming initial chunk — removed dynamic initial chunk, compute only on initial request (#1930).
- Preserved ref_code decoder context for Base ICL in Qwen3-TTS (#1731).
- Restored voice upload API and profiler endpoints reverted by #1719 (#1879).
- BugFix for CodePredictor CudaGraph Pool (#2059).
Quantization & Hardware Support
- Int8 quantization support for DiT (Z-Image & Qwen-Image) (#1470).
- Added cache-dit support for HunyuanImage3 (#1848) and Flux.2-dev (#1814).
- Enabled CPU offloading and Cache-DiT together on diffusion models (#1723).
- Upgraded cache-dit from 1.2.0 to 1.3.0 (#1834).
- NPU upgrade to v0.17.0 (#1890).
- Updated Bagel modeling to remove CUDA hardcode and added XPU stage_config (#1931).
- Updated GpuMemoryMonitor to DeviceMemoryMonitor for all hardware (#1526).
- ROCm bugfix for device environment issues and CI setup (#1984, #2017).
- Intel CI dispatch in Buildkite folder (#1721).
Frontend & Serving
- ComfyUI video & LoRA support (#1596).
- Rewrote video API for async job lifecycle (#1665).
- Fix /chat/completion not reading extra_body for diffusion models (#2042).
- Fix online server returning multiple images (#2007).
- Fix Ovis Image crash when guidance_scale is set without negative_prompt (#1956).
- Fix config misalignment between offline and online diffusion inference (#1979).
Reliability, Tooling & Developer Experience
- OmniStage.try_collect() patched with process alive checks (#1560) and Ray alive checks (#1561).
- Nightly Buildkite Pytest test case statistics with HTML report by email (#1674).
- Nightly Benchmark HTML generator and updated EXCEL generator (#1831).
- Added multimodal processing correctness tests for Omni models (#1445).
- Added Qwen3-TTS nightly performance benchmark (#1700) and benchmark scripts (#1573).
- Added Governance section (#1889).
- Rebase to vllm v0.18.0 (#2037, #2038).
- Numerous bug fixes across models, configuration, parallelism, and CI pipelines.
What's Changed
- [Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
- [Bagel]: Support multistage img2img by @princepride in #1669
- [BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
- [Doc] CLI Args Naming Style Correction by @wtomin in #1750
- [Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
- [Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
- Add online serving to Stable Audio Diffusion and introduce
v1/audio/generateendpoint by @ekagra-ranjan in #1255 - [Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
- [CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
- [Doc][skip-ci] Update installation instructions by @tzhouam in #1762
- Revert "Add online serving to Stable Audio Diffusion and introduce
v1/audio/generateendpoint" by @hsliuustc0106 in #1789 - [BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
- [feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
- [Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
- [Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
- [Docs] update async chunk performance diagram by @R2-Y in #1741
- [Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
- [Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
- [skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
- update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
- Add coordinator module and corresponding unit test by @NumberWan in #1465
- [Model]: add FLUX.2-dev model by @nuclearwu in #1629
- [skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
- [Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
- [Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
- [Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
- [Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
- [Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
- [Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
- Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
- [Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
- Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
- [Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
- [Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
- [Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
- [Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
- [Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever...
v0.17.0rc1
Highlights
This release features approximately 70 commits across 72 pull requests from 30+ contributors, including 12 new contributors.
Expanded Model Support
This release significantly expands the supported multimodal model ecosystem:
- Added support for Helios models and Helios-Mid / Distilled variants (#1604, #1648).
- Added Hunyuan Image3 AR generation support (#759).
- Added LTX-2 text-to-video and image-to-video support (#841).
- Added support for MammothModa2 (#336) and CosyVoice3-0.5B (#498).
- Improved compatibility and fixes for Qwen3-Omni and LongCat models (#1602, #1485, #1631).
Performance Improvements
Multiple optimizations improve startup time, streaming latency, and runtime efficiency:
- Accelerated diffusion model startup with multi-threaded weight loading (#1504).
- Reduced inter-packet latency in async chunking for Qwen3-Omni streaming (#1656).
- Reduced TTFA (time-to-first-audio) for Qwen3-TTS via flexible initial phases (#1583).
- Optimized TTS code predictor execution by removing GPU synchronization bottlenecks (#1614).
- Enabled torch.compile + CUDA Graph for TTS pipelines (#1617).
- Reduced IPC overhead in single-stage diffusion serving for Wan2.2 (#1715).
Inference Infrastructure & Parallelism
New infrastructure improvements improve scalability and flexibility for multimodal serving:
- Added CFG KV-cache transfer support for multi-stage pipelines (#1422).
- Added CFG parallel mode for Bagel diffusion models (#1578, #1695).
- Refactored tile/patch parallelism to simplify support for additional models (#1366).
- Added VAE patch parallel CLI option for online diffusion serving (#1716).
- Enabled async chunking for offline inference and configurable chunk parameters (#1415, #1423).
- Added collective RPC API entrypoint and custom I/O support for RL workloads (#1646).
Text-to-Speech Improvements
Major improvements to the stability and flexibility of the TTS pipeline:
- Added voice upload API for Qwen3-TTS (#1201).
- Added flexible
task_typeconfiguration for Qwen3-TTS models (#1197). - Added non-async chunk mode and improved offline batching support (#1678, #1417).
- Fixed several stability issues including predictor crashes, all-silence output, and Transformers 5.x compatibility (#1619, #1664, #1536).
Quantization & Hardware Support
- Added FP8 quantization support for Flux transformers (#1640).
- Improved NPU support, including MindIE-SD AdaLN compatibility (#1537).
- Improved device abstraction by replacing hard-coded CUDA generators with platform-aware detection (#1677).
- Updated XPU container configuration (#1545).
Reliability, Tooling & Developer Experience
- Added progress bar support for diffusion models (#1652).
- Introduced benchmark collection and reporting scripts in CI (#1307).
- Added TTS developer guide and testing documentation (#1693, #1376).
- Improved API robustness with better error handling and request validation (#1641, #1687).
- Numerous bug fixes across models, kernels, and configuration handling (#1391, #1566, #1609, #1661).
What's Changed
- 0.16.0 release by @ywang96 in #1576
- [Refactor]: Phase1 for rebasing_additional_info by @divyanshsinghvi in #1394
- [Feature]: Support cfg kv-cache transfer in multi-stage by @princepride in #1422
- [BugFix] Fix load_weights error when loading HunyuanImage3.0 by @Semmer2 in #1598
- [Bugfix] fix kernel error for qwen3-omni by @R2-Y in #1602
- [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio by @qibaoyuan in #1570
- [Bugfix] Import InputPreprocessor into Renderer by @lengrongfu in #1566
- [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading by @SamitHuang in #1504
- [Bugfix][Model] Fix LongCat Image Config Handling / Layer Creation by @alex-jw-brooks in #1485
- [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context by @ZhanqiuHu in #1619
- [Debug] Enable curl retry aligned with openai by @tzhouam in #1539
- [Doc] Fix links in the configuration doc by @yuanheng-zhao in #1615
- [CI] Add scripts for bechmark collection and email distribution. by @congw729 in #1307
- [FEATURE] Tile/Patch parallelism refactor for easily support other models by @Bounty-hunter in #1366
- [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation by @yuanheng-zhao in #1609
- Make chunk_size and left_context_size configurable via YAML for async chunking by @LJH-LBJ in #1423
- [Bugfix] Fix transformers 5.x compat issues in online TTS serving by @linyueqian in #1536
- [Refactor] lora: reuse load_weights packed mapping by @dongbo910220 in #991
- [Model]: support Helios from ByteDance by @princepride in #1604
- [chore] add _repeated_blocks for regional compilation support by @RuixiangMa in #1642
- [Bugfix] Add TTS request validation to prevent engine crashes by @linyueqian in #1641
- [CI] Fix ASCII codes. by @congw729 in #1647
- [Misc] update wechat by @david6666666 in #1649
- docs: Announce vllm-omni-skills community project by @hsliuustc0106 in #1651
- [Model] Add Hunyuan Image3 AR Support by @usberkeley in #759
- [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases by @amy-why-3459 in #1628
- [Bugfix] Fix Dtype Parsing by @alex-jw-brooks in #1391
- [XPU] fix UMD version in docker file by @yma11 in #1545
- add support for MammothModa2 model by @HonestDeng in #336
- [Model] Fun cosy voice3-0.5-b-2512 by @divyanshsinghvi in #498
- [Bugfix] Enable torch.compile for low noise model (transformer_2) by @lishunyang12 in #1541
- [NPU] [Features] [Bugfix] Support mindiesd adaln by @jiangmengyu18 in #1537
- [FP8 Quantization] Add FP8 quantization support for Flux transformer by @zzhuoxin1508 in #1640
- Replace hard-coded cuda generator with current_omni_platform.device_type by @pi314ever in #1677
- [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup by @alex-jw-brooks in #1631
- [Misc] remove logits_processor_pattern this field, because vllm have … by @lengrongfu in #1675
- [CI] Remove high concurrency tests before issue #1374 fixed. by @congw729 in #1683
- [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk by @ZeldaHuang in #1656
- [Feat][Qwen3TTS] reduce TTFA with flexible initial phase by @JuanPZuluaga in #1583
- [Model] support LTX-2 text-to-video image-to-video by @david6666666 in #841
- [BugFix] Return proper HTTP status for ErrorResponse in create_speech by @Lidang-Jiang in #1687
- [Doc] Add the test guide document. [skip ci] by @yenuo26 in #1376
- [UX] Add progress bar for diffusion models by @gcanlin in #1652
- [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder by @ZhanqiuHu in #1664
- [Feature] Support flexible task_type configuration for Qwen3-TTS models by @JackLeeHal in #1197
- [Cleanup] Move cosyvoice3 tests to model subdirectory by @linyueqian in #1666
- [Feature][Bagel] Add CFG parallel mode by @nussejzz in #1578
- perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor by @dubin555 in #1614
- [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph by @Sy0307 in #1617
- [MiMo-Audio] Bugfix tp lg than 1 by @qibaoyuan in #1688
- Add non-async chunk support for Qwen3-TTS by @linyueqian in #1678
- [1/N][Refactor] Clean up dead code in output processor by @gcanl...
v0.16.0
Highlights
This release features approximately 121 commits (merged PRs) from ~60 contributors (24 new contributors).
vLLM-Omni v0.16.0 is a major alignment + capability release that rebases the project onto upstream vLLM v0.16.0 and significantly expands performance, distributed execution, and production readiness across Qwen3-Omni / Qwen3-TTS, Bagel, MiMo-Audio, GLM-Image and the Diffusion (DiT) image/video stack—while also improving platform coverage (CUDA / ROCm / NPU / XPU), CI quality, and documentation.
Key Improvements
- Rebase to upstream vLLM v0.16.0: Tracks the latest vLLM runtime behavior and APIs while keeping Omni’s error handling aligned with upstream expectations. (#1357, #1122, plus follow-up fixes like #1401)
- Qwen3-Omni performance + correctness: Performance optimizations (cuda graph, async-chunk, streaming output) make TTFP reduced 90% and RTF 0.22~0.45, plus precision and E2E metric correctness fixes. (#1378, #1352, #1288, #1018, #1292)
- MiMo-Audio production Support: Performance optimizations (cuda graph, async-chunk, streaming output) improves the RTF ~ 0.2, 11x faster than baseline. #750
- Qwen3-TTS production upgrades: Disaggregated inference pipeline support, streaming output, batched Code2Wav decoding, and CUDA Graph support for speech tokenizer decoding—plus multiple robustness fixes across task type handling and voice cloning. (#1161, #1438, #1426, #1205, #1317, #1554)
- Bagel acceleration & scalability: Adds TP support, introduces CFG capabilities, and accelerates multi-branch CFG by merging branches into a single batch; includes KV transfer stability fixes. (#1293, #1310, #1429, #1437)
- Diffusion distributed execution expansion: Adds/extends TP/SP/HSDP and reduces redundant communication overhead; improves pipeline parallelism options (e.g., VAE patch parallel) and correctness across multiple diffusion families. (#964, #1275, #1339, #756, #1428)
- Quantization for DiT: Introduces FP8 quantization support and native GGUF quantization support for diffusion transformers, with code-path cleanups. (#1034, #1285, #1533)
- Broader model coverage (audio + image): Adds MiMo-Audio-7B-Instruct support and performance improvements for GLM-Image pipelines. (#750, #920)
Diffusion, Image & Video Generation
-
New/expanded model coverage
-
Distributed & parallel execution
-
Performance & memory efficiency
-
Correctness & stability
Audio, Speech & Omni (Qwen3-TTS / MiMo-Audio)
-
Qwen3-TTS feature set maturation
-
Stability & quality
Multimodal Model Improvements
-
Bagel
-
GLM-Image
Serving, APIs & Integrations
-
OpenAI-compatible video serving
-
Online serving robustness & usability
-
Ecosystem integration
- ComfyUI integration for improved workflow adoption. (#1113)
Performance, Scheduling & Memory Accounting
-
Async chunk enhancements
-
Metrics & benchmarking
-
Memory accounting
Platform, Hardware Backends & Deployment
-
XPU / NPU / ROCm coverage improvements
-
Deployment & connectivity
CI, Testing, Docs & Developer Experience
-
CI quality + coverage
-
Docs & tutorials
-
Tooling
- Online profiling support and other developer ergonomics improvements. (#1136)
Stability & Bug Fixes (Across the Stack)
This release includes broad correctness and robustness fixes spanning:
- Diffusion pipelines (dtype/shape, init crashes, model detection, seed and config handling)
- Image edit / generation endpoints (format validation, RoPE crash, argument typing, seed handling)
- Distributed execution (process group mapping accuracy, scheduler race conditions, kv transfer correctness)
- General runtime hygiene (removing unnecessary ZMQ init, CLI naming normalization, upstream-aligned error handling)
What's Changed
- [TeaCache]: Add Coefficient Estimation by @princepride in #940
- [CI]: Bagel E2E Smoked Test by @princepride in #1074
- [Misc] Bump version to 0.14.0 by @ywang96 in #1128
- [Doc] First stable release of vLLM-Omni by @ywang96 in #1129
- [Misc] Align error handling with upstream vLLM v0.14.0 by @ceanna93 in #1122
- [Feature] add Tensor Parallelism to LongCat-Image(-Edit) by @hadipash in #926
- [CI] Temporarily remove slow tests. by @congw729 in #1143
- [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat by @mxuax in #1165
- Dev/rebase v0.15.0 by @tzhouam in #1159
- Docs update paper link by @hsliuustc0106 in #1169
- [Debug] Clear Dockerfile.ci to accelerate build image by @tzhouam in #1172
- [Debug] Correct Unreasonable Long Timeout by @tzhouam in #1175
- [Doc]Fix - Align with repo. by @congw729 in #1176
- [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt by @gcanlin in #1170
- [Bugfix] fix qwen image oom by @ZJY0516 in https://github.c...
v0.16.0rc1
This pre-release is an alignment to the upstream vLLM v0.16.0.
Highlights
- Rebase to Upstream vLLM v0.16.0: vLLM-Omni is now fully aligned with the latest vLLM v0.16.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1357).
- Tensor Parallelism for Bagel & SD 3.5: Added Tensor Parallelism (TP) support for the Bagel model and Stable Diffusion 3.5, improving inference scalability for these diffusion workloads (#1293, #1336).
- CFG Parallel Expansion: Extended Classifier-Free Guidance (CFG) parallel support to Bagel and FLUX.1-dev models, enabling faster guided generation (#1310, #1269).
- Async Scheduling for Chunk IO Overlap: Introduced async scheduling to overlap chunk IO and computation across stages, reducing idle time and improving end-to-end throughput (#951).
- Diffusion Sequence Parallelism Optimization: Removed redundant communication cost by refining the SP hook design, improving diffusion parallelism efficiency (#1275).
- ComfyUI Integration: Added a full ComfyUI integration (
ComfyUI-vLLM-Omni) as an official app, supporting image generation, multimodal comprehension, and TTS workflows via vLLM-Omni's online serving API (multiple files underapps/ComfyUI-vLLM-Omni/). (#1113) - Qwen3-Omni Cudagraph by Default: Enabled cudagraph for Qwen3-Omni by default for improved inference performance (#1352).
What's Changed
Features & Optimizations
- [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker by @knlnguyen1802 in #764
- Refactor CPU Offloading Backend Pattern by @yuanheng-zhao in #1223
Alignment & Integration
- Unifying CLI Argument Naming Style by @wtomin in #1309
- fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
- [Debug] Add trigger to concurrent stage init by @tzhouam in #1274
Bug Fixes
- [Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
- [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
- [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
- [BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
- [Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
- [Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
- Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
- [XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295
Infrastructure (CI/CD) & Documentation
- [CI] Run nightly tests. by @congw729 in #1333
- [CI] Add env variable check for nightly CI by @congw729 in #1281
- [CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
- [CI] Add CI branch coverage calculation, fix statement coverage results by @yenuo26 in #1120
- [Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
- [Test] Add example test cases for omni online by @yenuo26 in #1086
- [Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
- [Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
- [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
- [DOC] Doc for CI test - Details about five level structure and some other files. by @congw729 in #1167
- [Bugfix] Fix Doc link Error by @lishunyang12 in #1263
- update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304
Remaining notes
- nvidia-cublas-cu12 is pinned to 12.9.1.4 via force-reinstall in Dockerfile.ci, waiting for updates from vLLM main repo and PyTorch. pytorch/pytorch#174949
- Qwen2.5-omni with mixed_modalities input only uses first frame of video, which originates from vLLM main repo: vllm-project/vllm#34506
New Contributors
- @ekagra-ranjan made their first contribution in #1317
- @zhenwei-intel made their first contribution in #1295
- @Shirley125 made their first contribution in #951
Full Changelog: v0.15.0rc1...v0.16.0rc1
v0.15.0rc1
This pre-release is a alignment to the upstream vLLM v0.15.0.
Highlights
- Rebase to Upstream vLLM v0.15.0: vLLM-Omni is now fully aligned with the latest vLLM v0.15.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1159).
- Tensor Parallelism for LongCat-Image: We have added Tensor Parallelism (TP) support for
LongCat-ImageandLongCat-Image-Editmodels, significantly improving the inference speed and scalability of these vision-language models (#926). - TeaCache Optimization: Introduced Coefficient Estimation for TeaCache, further refining the efficiency of our caching mechanisms for optimized generation (#940).
- Alignment & Stability:
- Update paper link: A intial paper to arxiv to give introductions to our design and some performance test results (#1169).
What's Changed
Features & Optimizations
- [TeaCache]: Add Coefficient Estimation by @princepride in #940
- [Feature] add Tensor Parallelism to LongCat-Image(-Edit) by @hadipash in #926
Alignment & Integration
- Dev/rebase v0.15.0 by @tzhouam in #1159
- [Misc] Align error handling with upstream vLLM v0.14.0 by @ceanna93 in #1122
- [Misc] Bump version to 0.14.0 by @ywang96 in #1128
Infrastructure (CI/CD) & Documentation
- [Doc] First stable release of vLLM-Omni by @ywang96 in #1129
- [CI]: Bagel E2E Smoked Test by @princepride in #1074
- [CI] Refactor test_sequence_parallel.py and add a warmup run by @mxuax in #1165
- [CI] Temporarily remove slow tests. by @congw729 in #1143
- [Debug] Clear Dockerfile.ci to accelerate build image by @tzhouam in #1172
- [Debug] Correct Unreasonable Long Timeout by @tzhouam in #1175
- [Docs] Update paper link by @hsliuustc0106 in #1169
New Contributors
Full Changelog: v0.14.0...v0.15.0rc1
v0.14.0
Highlights
This release features approximately 180 commits from over 70 contributors (23 new contributors).
vLLM-Omni v0.14.0 is a feature-heavy release that expands Omni’s diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability.
Key Improvements:
- Async chunk ([#727]): chunk pipeline overlap across stages to reduce idle time and improve end-to-end throughput/latency for staged execution.
- Stage-based deployment for the Bagel model ([#726]): Multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage) aligning it with the vllm-omni architecture
- Qwen3-TTS model family support ([#895]): Expands text-to-audio generation and supports online serving.
- Diffusion LoRA Adapter Support (PEFT-compatible) ([#758]): Adds LoRA fine-tuning/adaptation for diffusion workflows with a PEFT-aligned interface.
- DiT layerwise (blockwise) CPU offloading ([#858]): Fine-grained offloading to increase memory headroom for larger diffusion runs.
- Hardware platforms + plugin system ([#774]): Establishes a more extensible platform capability layer for cleaner multi-backend development.
Diffusion & Image/Video Generation
- Sequence Parallelism (SP) foundations + expansion: Adds a non-intrusive SP abstraction for diffusion models ([#779]), SP support in LongCatImageTransformer ([#721]), and SP support for Wan2.2 diffusion ([#966]).
- CFG improvements and parallelization: CFG parallel support for Qwen-Image ([#444]), CFG parallel abstraction ([#851]), and online-serving CFG parameter support ([#824]).
- Acceleration & execution plumbing: Torch compile support for diffusion ([#684]), GPU diffusion runner ([#822]), and diffusion executor ([#865]).
- Caching and memory efficiency: TeaCache for Z-Image ([#817]) and TeaCache for Bagel ([#848]); plus CPU offloading for diffusion ([#497]) and DiT tensor parallel enablement for diffusion pipeline (Z-Image) ([#735]).
- Model coverage expansion: Adds GLM-Image support ([#847]), FLUX family additions (e.g., FLUX.1-dev [#853], FLUX.2-klein [#809]) and related TP support ([#973]).
- Quality/stability fixes for pipelines: Multiple diffusion pipeline correctness fixes (e.g., CFG parsing failure fix [#922], SD3 compatibility fix [#772], video saving bug under certain fps [#893], noisy output without a seed in Qwen Image [#1043]).
Audio & Speech (TTS / Text-to-Audio)
- Text-to-audio model support: Stable Audio Open support for text-to-audio generation ([#331]).
- Qwen3-TTS stack maturation: Model series support ([#895]), online serving support ([#968]), plus stabilization fixes such as profile-run hang resolution ([#1082]) and dependency additions for Qwen3-TTS support ([#981]).
- Interoperability & correctness: Fixes and improvements across audio outputs and model input validation (e.g., StableAudio output standardization [#842], speaker/voices loading from config [#1079]).
Serving, APIs, and Frontend
- Diffusion-mode service endpoints & compatibility: Adds /health and /v1/models endpoints for diffusion mode and fixes streaming compatibility ([#454]).
- New/expanded image APIs: /v1/images/edit interface ([#1101]).
- Online serving usability improvements: Enables tensor_parallel_size argument with online serving command ([#761]) and supports CFG parameters in online serving ([#824]).
- Batching & request handling: Frontend/model support for batch requests (OmniDiffusionReq refinement) ([#797]).
Performance & Efficiency
- Qwen3-Omni performance work: SharedFusedMoE integration ([#560]), fused QKV & projection optimizations (e.g., fuse QKV linear and gate_up proj [#734], Talker MTP optimization [#1005]).
- Attention and kernel/backend tuning: Flash Attention attention-mask support ([#760]), FA3 backend defaults when supported ([#783]), and ROCm performance additions like AITER Flash Attention ([#941]).
- Memory-aware optimizations: Conditional transformer loading for Wan2.2 to reduce memory usage ([#980]).
Hardware / Backends / CI Coverage
- Broader backend support: XPU backend support ([#191]) plus the platform/plugin system groundwork ([#774]).
- NPU & ROCm updates: NPU upgrade alignment ([#820], [#1114]) and ROCm CI expansion / optimization ([#542], [#885], [#1039]).
- Test reliability / coverage: CI split to avoid timeouts ([#883]) and additional end-to-end / precision tests (e.g., chunk e2e tests [#956]).
Reliability, Correctness, and Developer Experience
- Stability fixes across staged execution and serving: Fixes for stage config loading issues ([#860]), stage output mismatch in online batching ([#691]), and server readiness wait-time increase for slow model loads ([#1089]).
- Profiling & benchmarking improvements: Diffusion profiler support ([#709]) plus benchmark additions (e.g., online benchmark [#780]).
- Documentation refresh: Multiple diffusion docs refactors and new guides (e.g., profiling guide [#738], torch profiler guide [#570], diffusion docs refactor [#753], ROCm instructions updates [#678], [#905]).
What's Changed
- [Docs] Fix diffusion module design doc by @SamitHuang in #645
- [Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
- [Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
- Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
- [Misc] Merge diffusion forward context by @iwzbi in #582
- [Doc] User guide for torch profiler by @lishunyang12 in #570
- [Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
- [BugFix] token2wav code out of range by @Bounty-hunter in #655
- [Doc] Update version 0.12.0 by @ywang96 in #662
- [Docs] Update diffusion_acceleration.md by @SamitHuang in #659
- [Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
- [Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
- [BUGFIX] Delete the CUDA context in the stage process. by @fake0fan in #661
- [Misc] Fix docs display problem of streaming mode and other related issues by @Gaohan123 in #667
- [Model] Add Stable Audio Open support for text-to-audio generation by @linyueqian in #331
- [Doc] Update ROCm getting started instruction by @tjtanaa in #678
- [Bugfix] Fix f-string formatting in image generation pipelines by @ApsarasX in #689
- [Bugfix] Solve Ulysses-SP sequence length not divisible by SP degree (using padding and attention mask) by @wtomin in #672
- omni entrypoint support tokenizer arg by @divyanshsinghvi in #572
- [Bug fix] fix e2e_total_tokens and e2e_total_time_ms by @LJH-LBJ in #648
- [BugFix] Explicitly release file locks during stage worker init by @yuanheng-zhao in #703
- [BugFix] Fix stage engine outputs mismatch bug in online batching by @ZeldaHuang in #691
- [core] add torch compile for diffusion by @ZJY0516 in #684
- [BugFix] Remove duplicate width assignment in SD3 pipeline by @dongbo910220 in #708
- [Feature] Support Qwen3 Omni talker cudagraph by @ZeldaHuang in #669
- [Benchmark] DiT Model Benchmark under Mixed Workloads by @asukaqaq-s in #529
- update design doc by @hsliuustc0106 in #711
- [Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni by @gcanlin in #560
- [Doc]: update vllm serve param and base64 data truncation by @nuclearwu in #718
- [BugFix] Fix assuming all stage model have talker by @princepride in #730
- [Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj by @gcanlin in #734
- [Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) by @dongbo910220 in #735
- [Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker by @LJH-LBJ in #697
- [ROCm] [CI] Add More Tests by @tjtanaa in #542
- [Docs] update design doc templated in RFC by @hsliuustc0106 in #746
- Add description of code version for bug report by @yenuo26 in #745
- [misc] fix rfc template by @hsliuustc0106 in https://github.com...
v0.14.0rc1
Highlights (vllm-omni v0.14.0rc1)
This release candidate includes approximately 90 commits from 35 contributors (12 new contributors).
This release candidate focuses on diffusion runtime maturity, Qwen-Omni performance, and expanded multimodal model support, alongside substantial improvements to serving ergonomics, profiling, ROCm/NPU enablement, and CI/docs quality. In addition, this is the first vllm-omni rc version with Day-0 alignment with vLLM upstream.
Model Support
- TTS: Added support for the Qwen3-TTS(Day-0) model series. (#895)
- Diffusion / image families: Added Flux.2-klein(Day-0) GLM-Image(Day-0), plus multiple qwen-image family correctness/perf improvements. (#809, #868, #847)
- Bagel ecosystem: Added Bagel model support and Cache-DiT support. (#726, #736)
- Text-to-audio: Added Stable Audio Open support for text-to-audio generation. (#331)
Key Improvements
-
Qwen-Omni performance & serving enhancements
-
Improved Qwen3-Omni throughput with vLLM SharedFusedMoE, plus additional kernel/graph optimizations:
-
Improved online serving configurability:
-
-
Diffusion runtime & acceleration upgrades
- Added sleep / wake_up / load_weights lifecycle controls for Omni Diffusion, improving operational flexibility for long-running services. (#376)
- Introduced torch.compile support for diffusion to improve execution efficiency on supported setups. (#684)
- Added a GPU Diffusion Runner and Diffusion executor, strengthening the core execution stack for diffusion workloads. (#822, #865)
- Enabled TeaCache acceleration for Z-Image diffusion pipelines. (#817)
- Defaulted to FA3 (FlashAttention v3) when supported, and extended FlashAttention to support attention masks. (#783, #760)
- Added CPU offloading support for diffusion to broaden deployment options under memory pressure. (#497)
-
Parallelism and scaling for diffusion pipelines
- Added CFG parallel support for Qwen-Image and introduced CFG parameter support in online serving. (#444, #824)
- Enabled DiT tensor parallel for Z-Image diffusion pipeline and extended TP support for qwen-image with test refactors. (#735, #830)
- Implemented Sequence Parallelism (SP) abstractions for diffusion, including SP support in LongCatImageTransformer. (#779, #721)
Stability, Tooling, and Platform
-
Correctness & robustness fixes across diffusion and staged execution:
- Fixed diffusion model load failure when stage config is present (#860)
- Fixed stage engine outputs mismatch under online batching (#691)
- Fixed CUDA-context lifecycle issues and file-lock handling in stage workers (#661, #703)
- Multiple model/pipeline fixes (e.g., SD3 compatibility, Wan2.2 warmup/scheduler, Qwen2.5-Omni stop behavior). (#772, #791, #804, #773)
-
Profiling & developer experience
-
ROCm / NPU / CI
Note:The NPU AR functionality is currently unavailable and will be supported in the official v0.14.0 release.
What's Changed
- [Docs] Fix diffusion module design doc by @SamitHuang in #645
- [Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
- [Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
- Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
- [Misc] Merge diffusion forward context by @iwzbi in #582
- [Doc] User guide for torch profiler by @lishunyang12 in #570
- [Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
- [BugFix] token2wav code out of range by @Bounty-hunter in #655
- [Doc] Update version 0.12.0 by @ywang96 in #662
- [Docs] Update diffusion_acceleration.md by @SamitHuang in #659
- [Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
- [Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
- [BUGFIX] Delete the CUDA context in the sta...
v0.12.0rc1
vLLM-Omni v0.12.0rc1 Pre-Release Notes Highlights
Highlights
This release features 187 commits from 45 contributors (34 new contributors)!
vLLM-Omni v0.12.0rc1 is a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm). It also rebases on vLLM v0.12.0 for better alignment with upstream (#335).
Breaking / Notable Changes
- Unified diffusion stage naming & structure: cleaned up legacy
Diffusion*paths and aligned onGeneration*-style stages to reduce duplication (#211, #163). - Safer serialization: switched
OmniSerializerfrompickleto MsgPack (#310). - Dependency & packaging updates: e.g., bumped
diffusersto 0.36.0 (#313) and refreshed Python/formatting baselines for the v0.12 release (#126).
Diffusion Engine: Architecture + Performance Upgrades
-
Core refactors for extensibility: diffusion model registry refactored to reuse vLLM’s
ModelRegistry(#200), improved diffusion weight loading and stage abstraction (#157, #391). -
Acceleration & parallelism features:
- Cache-DiT with a unified cache backend interface (#250)
- TeaCache integration and registry refactors (#179, #304, #416)
- New/extended attention & parallelism options: Sage Attention (#243), Ulysses Sequence Parallelism (#189), Ring Attention (#273)
- torch.compile optimizations for DiT and RoPE kernels (#317)
Serving: Stronger OpenAI Compatibility & Online Readiness
- DALL·E-compatible image generation endpoint (
/v1/images/generations) (#292), plus online serving fixes for image generation (#499). - Added OpenAI create speech endpoint (#305).
- Per-request modality control (output modality selection) (#298) with API usage examples (#411).
- Early support for streaming output (#367), request abort (#486), and request-id propagation in responses (#301).
Omni Pipeline: Multi-stage Orchestration & Observability
- Improved inter-stage plumbing: customizable process between stages and reduced coupling on
request_idsin model forward paths (#458). - Better observability and debugging: torch profiler across omni stages (#553), improved traceback reporting from background workers (#385), and logging refactors (#466).
Expanded Model Support (Selected)
-
Qwen-Omni / Qwen-Image family:
- Qwen-Omni offline inference with local files (#167)
- Qwen-Image-2512 support(#547)
- Qwen-Image-Edit support (including multi-image input variants and newer releases, Qwen-Image-Edit Qwen-Image-Edit-2509 Qwen-Image-Edit-2511) (#196, #330, #321)
- Qwen-Image-Layered model support (#381)
- Multiple fixes for Qwen2.5/Qwen3-Omni batching, examples, and OpenAI sampling parameter compatibility (#451, #450, #249)
-
Diffusion / video ecosystem:
Platform & CI Coverage
- ROCm / AMD: documented ROCm setup (#144) and added ROCm Dockerfile + AMD CI (#280).
- NPU: added NPU CI workflow (#231) and expanded NPU support for key Omni models (e.g., Qwen3-Omni, Qwen-Image series) (#484, #463, #485), with ongoing cleanup of NPU-specific paths (#597).
- CI and packaging improvements: diffusion CI, wheel compilation, and broader UT/E2E coverage (#174, #288, #216, #168).
What's Changed
- [Misc] Update link in issue template by @ywang96 in #155
- [Misc] Qwen-Omni support offline inference with local files by @SamitHuang in #167
- [diffusion] z-image support by @ZJY0516 in #149
- [Doc] Fix wrong examples URLs by @wjcwjc77 in #166
- [Doc] Update Security Advisory link by @DarkLight1337 in #173
- [Doc] change
vllm_omnitovllm-omniby @princepride in #177 - [Docs] Supplement volunteers and faq docs by @Gaohan123 in #182
- [Bugfix] Init early toch cuda by @knlnguyen1802 in #185
- [Docs] remove Ascend word to make docs general by @gcanlin in #190
- [Doc] Add installation part for pre built docker. by @congw729 in #141
- [CI] add diffusion ci by @ZJY0516 in #174
- [Misc] Add stage config for Qwen3-Omni-30B-A3B-Thinking by @linyueqian in #172
- [Doc]Fixed some spelling errors by @princepride in #199
- [Chore]: Refactor diffusion model registry to reuse vLLM's ModelRegistry by @Isotr0py in #200
- [FixBug]online serving fails for high-resolution videos by @princepride in #198
- [Engine] Remove Diffusion_XX which duplicates with Generation_XX by @tzhouam in #163
- [bugfix] qwen2.5 omni does not support chunked prefill now by @fake0fan in #193
- [NPU][Refactor] Rename Diffusion* to Generation* by @gcanlin in #211
- [Diffusion] Init Attention Backends and Selector for Diffusion by @ZJY0516 in #115
- [E2E] Add Qwen2.5-Omni model test with OmniRunner by @gcanlin in #168
- [Docs]Fix doc wrong link by @princepride in #223
- [Diffusion] Refactor diffusion models weights loading by @Isotr0py in #157
- Fix: Safe handling for multimodal_config to avoid 'NoneType' object h… by @qibaoyuan in #227
- [Bugfix] Fix ci bug for qwen2.5-omni by @Gaohan123 in #230
- [Core] add clean up method for diffusion engine by @ZJY0516 in #219
- [BugFix] Fix qwen3omni thinker batching. by @yinpeiqi in #207
- [Bugfix] Support passing vllm cli args to online serving in vLLM-Omni by @Gaohan123 in #206
- [Docs] Add basic usage examples for diffusion by @SamitHuang in #222
- [Model] Add Qwen-Image-Edit by @SamitHuang in #196
- update docs/readme.md and design folder by @hsliuustc0106 in #234
- [CI] Add Qwen3-omni offline UT by @R2-Y in #216
- [typo] fix doc readme by @hsliuustc0106 in #242
- [Model] Fuse Z-Image's
qkv_projandgate_up_projby @Isotr0py in #226 - [bugfix] Fix QwenImageEditPipeline transformer init by @dougbtv in #245
- [Bugfix] Qwen2.5-omni Qwen3-omni online gradio.py example fix by @david6666666 in #249
- [Bugfix] fix issue251, qwen3 omni does not support chunked prefill now by @david6666666 in #256
- [Bugfix]multi-GPU tp scenarios, devices: "0,1" uses physical IDs instead of logical IDs by @david6666666 in #253
- [Bugfix] Remove debug code in AsyncOmni.del to fix resource leak by @princepride in #260
- update arch overview by @hsliuustc0106 in #258
- [Feature] Omni Connector + ray supported by @natureofnature in #215
- [Misc] fix stage config describe and yaml format by @david6666666 in #265
- update desgin docs by @hsliuustc0106 in #269
- [Model] Add Wan2.2 text-to-video support by @linyueqian in #202
- [Doc] [ROCm]: Document the steps to run vLLM Omni on ROCm by @tjtanaa in #144
- [Entrypoints] Minor optimization in the orchestrator's final stage determination logic by @RuixiangMa in #275
- [Doc] update offline inference doc and offline_inference examples by @david6666666 in #274
- [Feature] teacache integration by @LawJarp-A in #179
- [CI] Qwen3-Omni online test by @R2-Y in #257
- [Doc] fix docs Feature Design and Module Design by @hsliuustc0106 in #283
- [CI] Test ready label by @ywang96 in #299
- [Doc] fix offline inference and online serving describe by @david6666666 in #285
- [CI] Adjust folder by @congw729 in #300
- [Diffusion][Attention] sage attention backend by @ZJY0516 in https://github.com/vllm-project/vllm-omni/...