Highlights
This release features approximately 180 commits from over 70 contributors (23 new contributors).
vLLM-Omni v0.14.0 is a feature-heavy release that expands Omni’s diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability.
Key Improvements:
- Async chunk ([#727]): chunk pipeline overlap across stages to reduce idle time and improve end-to-end throughput/latency for staged execution.
- Stage-based deployment for the Bagel model ([#726]): Multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage) aligning it with the vllm-omni architecture
- Qwen3-TTS model family support ([#895]): Expands text-to-audio generation and supports online serving.
- Diffusion LoRA Adapter Support (PEFT-compatible) ([#758]): Adds LoRA fine-tuning/adaptation for diffusion workflows with a PEFT-aligned interface.
- DiT layerwise (blockwise) CPU offloading ([#858]): Fine-grained offloading to increase memory headroom for larger diffusion runs.
- Hardware platforms + plugin system ([#774]): Establishes a more extensible platform capability layer for cleaner multi-backend development.
Diffusion & Image/Video Generation
- Sequence Parallelism (SP) foundations + expansion: Adds a non-intrusive SP abstraction for diffusion models ([#779]), SP support in LongCatImageTransformer ([#721]), and SP support for Wan2.2 diffusion ([#966]).
- CFG improvements and parallelization: CFG parallel support for Qwen-Image ([#444]), CFG parallel abstraction ([#851]), and online-serving CFG parameter support ([#824]).
- Acceleration & execution plumbing: Torch compile support for diffusion ([#684]), GPU diffusion runner ([#822]), and diffusion executor ([#865]).
- Caching and memory efficiency: TeaCache for Z-Image ([#817]) and TeaCache for Bagel ([#848]); plus CPU offloading for diffusion ([#497]) and DiT tensor parallel enablement for diffusion pipeline (Z-Image) ([#735]).
- Model coverage expansion: Adds GLM-Image support ([#847]), FLUX family additions (e.g., FLUX.1-dev [#853], FLUX.2-klein [#809]) and related TP support ([#973]).
- Quality/stability fixes for pipelines: Multiple diffusion pipeline correctness fixes (e.g., CFG parsing failure fix [#922], SD3 compatibility fix [#772], video saving bug under certain fps [#893], noisy output without a seed in Qwen Image [#1043]).
Audio & Speech (TTS / Text-to-Audio)
- Text-to-audio model support: Stable Audio Open support for text-to-audio generation ([#331]).
- Qwen3-TTS stack maturation: Model series support ([#895]), online serving support ([#968]), plus stabilization fixes such as profile-run hang resolution ([#1082]) and dependency additions for Qwen3-TTS support ([#981]).
- Interoperability & correctness: Fixes and improvements across audio outputs and model input validation (e.g., StableAudio output standardization [#842], speaker/voices loading from config [#1079]).
Serving, APIs, and Frontend
- Diffusion-mode service endpoints & compatibility: Adds /health and /v1/models endpoints for diffusion mode and fixes streaming compatibility ([#454]).
- New/expanded image APIs: /v1/images/edit interface ([#1101]).
- Online serving usability improvements: Enables tensor_parallel_size argument with online serving command ([#761]) and supports CFG parameters in online serving ([#824]).
- Batching & request handling: Frontend/model support for batch requests (OmniDiffusionReq refinement) ([#797]).
Performance & Efficiency
- Qwen3-Omni performance work: SharedFusedMoE integration ([#560]), fused QKV & projection optimizations (e.g., fuse QKV linear and gate_up proj [#734], Talker MTP optimization [#1005]).
- Attention and kernel/backend tuning: Flash Attention attention-mask support ([#760]), FA3 backend defaults when supported ([#783]), and ROCm performance additions like AITER Flash Attention ([#941]).
- Memory-aware optimizations: Conditional transformer loading for Wan2.2 to reduce memory usage ([#980]).
Hardware / Backends / CI Coverage
- Broader backend support: XPU backend support ([#191]) plus the platform/plugin system groundwork ([#774]).
- NPU & ROCm updates: NPU upgrade alignment ([#820], [#1114]) and ROCm CI expansion / optimization ([#542], [#885], [#1039]).
- Test reliability / coverage: CI split to avoid timeouts ([#883]) and additional end-to-end / precision tests (e.g., chunk e2e tests [#956]).
Reliability, Correctness, and Developer Experience
- Stability fixes across staged execution and serving: Fixes for stage config loading issues ([#860]), stage output mismatch in online batching ([#691]), and server readiness wait-time increase for slow model loads ([#1089]).
- Profiling & benchmarking improvements: Diffusion profiler support ([#709]) plus benchmark additions (e.g., online benchmark [#780]).
- Documentation refresh: Multiple diffusion docs refactors and new guides (e.g., profiling guide [#738], torch profiler guide [#570], diffusion docs refactor [#753], ROCm instructions updates [#678], [#905]).
What's Changed
- [Docs] Fix diffusion module design doc by @SamitHuang in #645
- [Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
- [Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
- Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
- [Misc] Merge diffusion forward context by @iwzbi in #582
- [Doc] User guide for torch profiler by @lishunyang12 in #570
- [Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
- [BugFix] token2wav code out of range by @Bounty-hunter in #655
- [Doc] Update version 0.12.0 by @ywang96 in #662
- [Docs] Update diffusion_acceleration.md by @SamitHuang in #659
- [Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
- [Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
- [BUGFIX] Delete the CUDA context in the stage process. by @fake0fan in #661
- [Misc] Fix docs display problem of streaming mode and other related issues by @Gaohan123 in #667
- [Model] Add Stable Audio Open support for text-to-audio generation by @linyueqian in #331
- [Doc] Update ROCm getting started instruction by @tjtanaa in #678
- [Bugfix] Fix f-string formatting in image generation pipelines by @ApsarasX in #689
- [Bugfix] Solve Ulysses-SP sequence length not divisible by SP degree (using padding and attention mask) by @wtomin in #672
- omni entrypoint support tokenizer arg by @divyanshsinghvi in #572
- [Bug fix] fix e2e_total_tokens and e2e_total_time_ms by @LJH-LBJ in #648
- [BugFix] Explicitly release file locks during stage worker init by @yuanheng-zhao in #703
- [BugFix] Fix stage engine outputs mismatch bug in online batching by @ZeldaHuang in #691
- [core] add torch compile for diffusion by @ZJY0516 in #684
- [BugFix] Remove duplicate width assignment in SD3 pipeline by @dongbo910220 in #708
- [Feature] Support Qwen3 Omni talker cudagraph by @ZeldaHuang in #669
- [Benchmark] DiT Model Benchmark under Mixed Workloads by @asukaqaq-s in #529
- update design doc by @hsliuustc0106 in #711
- [Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni by @gcanlin in #560
- [Doc]: update vllm serve param and base64 data truncation by @nuclearwu in #718
- [BugFix] Fix assuming all stage model have talker by @princepride in #730
- [Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj by @gcanlin in #734
- [Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) by @dongbo910220 in #735
- [Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker by @LJH-LBJ in #697
- [ROCm] [CI] Add More Tests by @tjtanaa in #542
- [Docs] update design doc templated in RFC by @hsliuustc0106 in #746
- Add description of code version for bug report by @yenuo26 in #745
- [misc] fix rfc template by @hsliuustc0106 in #748
- fix:#issue 432 by @GG-li in #517
- [Diffusion][Feature] Implement SP support in LongCatImageTransformer by @mxuax in #721
- [Debug] Clean code in Qwen 3 Omni and add warning for talker temperature. by @tzhouam in #688
- [feature] cpu offloading support for diffusion by @LawJarp-A in #497
- [Misc] Group omni arguments into OmniConfig section by @fake0fan in #744
- [Misc] Enable tensor_parallel_size argument with online serving cmd by @JustQJ in #761
- [Bugfix] Raise ValueError when joint_strategy='rear' and causal=True in Ring Attention by @mxuax in #767
- [Feat] add vllm-omni version collection by @sihyeonn in #740
- [Doc] refactor diffusion doc by @ZJY0516 in #753
- [Bugfix] Fix stable diffusion3 compatibility error by @iwzbi in #772
- [Feature] Support Qwen3 Omni talker mtp batch inference by @ZeldaHuang in #722
- [BugFix]Remove duplicate error handling for request results by @liuyuhanalex in #781
- [CI] Add pytest markers in config files. by @congw729 in #719
- [Doc] Fix mkdocs. by @congw729 in #785
- [Bugfix] Fix generation artifacts of Qwen-Image-Edit-2511 and update pipeline DiT param parsing by @SamitHuang in #776
- [bugfix] Fix Wan2.2 I2V warmup failure by adding support_image_input attribute by @linyueqian in #791
- [Misc] add wechat group and star history on README by @david6666666 in #801
- [BugFix] Fix incorrect mrope positions under cuda graph by @ZeldaHuang in #803
- [BugFix] Qwen2.5-omni supress end token and won't stop by @yinpeiqi in #773
- [Feature] Flash Attention to Support Attention Mask by @wtomin in #760
- [Model] add flux2 klein by @david6666666 in #809
- [bugfix] use unipc scheduler for Wan 2.2 by @linyueqian in #804
- [Test] Add full test for Qwen3-Omni-30B-A3B-Instruct by @yenuo26 in #720
- [Bagel] Support Cache-Dit by @princepride in #736
- [Perf] Optimize the Qwen2.5-Omni Model thinker-to-talker-proj with nn.Linear by @kechengliu97 in #825
- [Core]Add GPU Diffusion Runner by @princepride in #822
- [Feature]: Add CFG param to online serving by @gDINESH13 in #824
- [diffusion] add tp support for qwen-image and refactor some tests by @ZJY0516 in #830
- [Core] Implement Diffusion Profiler Support by @lishunyang12 in #709
- [Bugfix] Diffusion model fails to load when stage config is present by @fhfuih in #860
- chore: Bump up cache-dit and fix docs links by @DefTruth in #863
- [Misc] Change benchmark default port by @NickLucche in #872
- Dev/rebase 0.14.0 and Support GLM-Image by @tzhouam in #847
- [Doc] Add user guide for diffusion model profiling by @lishunyang12 in #738
- [Doc] Update rebase doc by @tzhouam in #878
- [Test] Add full test for Qwen3-Omni-30B-A3B-Instruct for image and audio single modal by @yenuo26 in #827
- [Diffusion] Non-Intrusive Sequence Parallelism (SP) Model Support Abstraction for vLLM-Omni Framework by @mxuax in #779
- [Bugfix] Remove the duplicate api registration in vllm-omni by @fake0fan in #880
- [CI] split tests to avoid timeout by @ZJY0516 in #883
- [Diffusion][Acceleration] Support TeaCache for Z-Image by @gcanlin in #817
- [Perf] Fuse Q/K/V Linear with QKVParallelLinear in Qwen2.5Omni DiTAttention by @kechengliu97 in #884
- [bugfix] support text + audio mixed output by @GG-li in #843
- [Misc] fix qwen image family redundant computation by @ZJY0516 in #868
- [ROCm] [CI] Optimize Dockerfilerocm and reduce build time on CI by @tjtanaa in #885
- [Core]Add Diffusion executor by @natureofnature in #865
- [Bugfix] Fix video saving bug under certain fps by @SamitHuang in #893
- [diffusion] use fa3 by default when device supports it by @ZJY0516 in #783
- [Model] Support Qwen3-TTS model series by @Gaohan123 in #895
- Support Bagel Model by @princepride in #726
- debug Qwen TTS by @tzhouam in #902
- [ROCm] [Doc] Add instructions to install ROCm dependencies to run Qwen3 TTS model by @tjtanaa in #905
- Bump version to 0.14.0rc1 by @ywang96 in #910
- [Doc] Fix the version in the documentation to v0.14.0rc1; fix dockerfile.rocm entrypoint by @tjtanaa in #913
- [Misc] update wechat image by @david6666666 in #914
- [Test] Add precision test cases for Qwen3-Omni-30B-A3B-Instruct in CI by @yenuo26 in #828
- [Misc] Fix error log for the diffusion stage timeout by @SamitHuang in #915
- [bugfix] qwen3-tts check_model_inputs by @qibaoyuan in #924
- [Misc] Fix t2i online serving example by @ZJY0516 in #928
- [examples] add --enable-cpu-offload args by @david6666666 in #930
- [BugFix] Standardize StableAudio audio output by @LudovicoYIN in #842
- [Bagel] Support TeaCache by @princepride in #848
- [Doc] Add Bagel model support to TeaCache documentation by @nussejzz in #943
- [Bugfix] raise error in diffusion engine and fix offload test by @ZJY0516 in #933
- [Feature]Support async computation and communication across stages by chunks by @amy-why-3459 in #727
- [Bugfix] Fix diffusion pipeline CFG (guidance_scale parsing failure bug) by @SamitHuang in #922
- [doc]Add Text-To-Audio Readme documentation by @zzhuoxin1508 in #958
- [Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment by @AndyZhou952 in #758
- [Diffusion][Feature] Non-Intrusive Sequence Parallelism (SP) Support for Wan2.2 by @mxuax in #966
- [doc] add some additional information for the diffusion model support by @ZJY0516 in #952
- [Perf] avoid cpu op in QwenImageCrossAttention by @ZJY0516 in #942
- Support Qwen3 tts online serving by @linyueqian in #968
- [test] fix test_image_generation_lora CI timeout by @AndyZhou952 in #975
- [diffusion] add tp for FLUX.2-klein by @ZJY0516 in #973
- [test] revert test flash attn file by @AndyZhou952 in #972
- [NPU] Upgrade to v0.14.0 by @gcanlin in #820
- [ROCm] [Perf] Add AITER Flash Attention by @tjtanaa in #941
- [Test] Add chunk e2e test case for CI by @yenuo26 in #956
- [Feature] Implement YuanrongConnector based on OmniConnectorBase by @yangsonglin13 in #716
- [NFC] Remove redundant torch.no_grad in models and pipelines by @yuanheng-zhao in #854
- [BugFix] Modify the method of obtaining external_request_id by @amy-why-3459 in #961
- Fix TTS speaker typo and add supported languages by @linyueqian in #990
- [Fix] make images LoRA e2e less flaky by @dongbo910220 in #978
- [Docs] Fix GLM-Image docstring indentation to resolve CI failure by @dongbo910220 in #992
- [Perf] replace torch's conv with vLLM's conv to fix torch 2.9 performance regression by @ZJY0516 in #982
- [BugFix] Add /health and /v1/models endpoints for diffusion mode, fix streaming compatibility by @majiayu000 in #454
- [Bugfix] Add missing dependencies (onnxruntime, sox) for Qwen3-TTS support by @zzhuoxin1508 in #981
- [Frontend][Model] Support batch request with refined OmniDiffusionReq… by @fhfuih in #797
- [Model]: add FLUX.1-dev model by @nuclearwu in #853
- [BugFix] ignore mm data from stages to async omni by @Bounty-hunter in #954
- Revert "[BugFix] ignore mm data from stages to async omni" by @hsliuustc0106 in #1023
- [Bugfix] Modify output to model_runner_output by @gcanlin in #1026
- [Feature] Support cache-dit for Wan 2.2 inference by @SamitHuang in #1021
- [Doc]Format profiling doc by @lishunyang12 in #993
- [Hardware] Support platforms and plugin system by @gcanlin in #774
- [Core]: KV Cache Transfer Encapsulation by @princepride in #979
- [Test]Delete skip mark for amd ci test and fix CI failure by @yenuo26 in #927
- [Bugfix][Doc]Specify Qwen3-TTS model name for each task type by @kylehh in #1036
- [Misc] pin version of fa3-fwd by @ZJY0516 in #1051
- [CI] [ROCm] Add more AMD CI tests by @tjtanaa in #1039
- [Bugfix] fix qwen image layerd in dummy run by @ZJY0516 in #1027
- [BugFix] Fix noisy output without setting a seed in Qwen Image by @natureofnature in #1043
- [bugfix] remove vllm speech route by @linyueqian in #1060
- [Debug] Update GLM-Image Pipeline by @tzhouam in #1049
- [Diffusion][Bugfix] Fix the flash_attn backends selection logic by @mxuax in #983
- [BugFix] Fix the accuracy issue of multimodal input. by @amy-why-3459 in #1020
- [Bugfix] Set VaeImageProcessor
do_convert_rgbTrue by @gcanlin in #1032 - [feat]: adapt batch request for flux by @nuclearwu in #1028
- [CI] Change Qwen3 Omni stage placement strategy by @ZeldaHuang in #1072
- [BugFix] Fix to use correct attn backend by @divyanshsinghvi in #1038
- [Perf] Qwen3 Omni talker mtp optimization by @ZeldaHuang in #1005
- [Wan2.2] Optimize memory usage with conditional transformer loading by @faaany in #980
- [Feat] Support XPU Backend in vLLM-Omni by @faaany in #191
- [Fix] stabilize diffusion images LoRA E2E across CI drift by @dongbo910220 in #1075
- [Bugfix][Test] Re-enable the log simple tests by @gcanlin in #1065
- [Bugfix] pr conflict fix, bugfix ignore mm data from stages to async omni by @Bounty-hunter in #1025
- [Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration by @nussejzz in #987
- [Fix] Increase max wait time for server readiness to accommodate model loading by @AndyZhou952 in #1089
- [Benchmark] Add vLLM-Omni Omni model online benchmark by @yenuo26 in #780
- [Bugfix] Remove Mooncake/Yuanrong connector import warning by @natureofnature in #1091
- fix: UnboundLocalError for role in streaming audio/image responses by @PierreLeGuen in #784
- [Misc] update wechat image by @david6666666 in #1096
- [Feature] Support DiT Layerwise (Blockwise) CPU Offloading by @yuanheng-zhao in #858
- [BugFix] Modify max_tokens and modify the log and fix #1103 by @amy-why-3459 in #1097
- [BugFix] Fix modulate_index shape error in Qwen-Image-Edit Task by @mxuax in #1100
- [Platform] Add supports_torch_inductor interface by @gcanlin in #1108
- [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error by @ZeldaHuang in #1104
- [Bugfix] fix request_id of image generation in api server by @ZJY0516 in #1112
- [Perf]: CFG parallel abstraction by @wtomin in #851
- [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) by @marksverdhei in #1082
- [CI] [ROCm] Quick fix amd ci by @tjtanaa in #1116
- [Bugfix] fix benchmark audio timing error and add benchmark test by @yenuo26 in #1109
- [Bugfix][Qwen3TTS] Load speaker_id/voices from model configuration by @JuanPZuluaga in #1079
- [NPU] Align with GPUModelRunner by @gcanlin in #1114
- [FEATURE] /v1/images/edit interface by @Bounty-hunter in #1101
- [Bugfix] Fix NPU SDPA attention mask shape and semantics by @gcanlin in #1031
New Contributors
- @ApsarasX made their first contribution in #689
- @dongbo910220 made their first contribution in #708
- @asukaqaq-s made their first contribution in #529
- @nuclearwu made their first contribution in #718
- @yenuo26 made their first contribution in #745
- @GG-li made their first contribution in #517
- @JustQJ made their first contribution in #761
- @sihyeonn made their first contribution in #740
- @liuyuhanalex made their first contribution in #781
- @kechengliu97 made their first contribution in #825
- @gDINESH13 made their first contribution in #824
- @NickLucche made their first contribution in #872
- @LudovicoYIN made their first contribution in #842
- @nussejzz made their first contribution in #943
- @amy-why-3459 made their first contribution in #727
- @zzhuoxin1508 made their first contribution in #958
- @AndyZhou952 made their first contribution in #758
- @yangsonglin13 made their first contribution in #716
- @majiayu000 made their first contribution in #454
- @kylehh made their first contribution in #1036
- @PierreLeGuen made their first contribution in #784
- @marksverdhei made their first contribution in #1082
- @JuanPZuluaga made their first contribution in #1079
Full Changelog: v0.12.0rc1...v0.14.0