Skip to content

v0.14.0rc1

Pre-release
Pre-release

Choose a tag to compare

@david6666666 david6666666 released this 22 Jan 15:22
a9012a1

Highlights (vllm-omni v0.14.0rc1)

This release candidate includes approximately 90 commits from 35 contributors (12 new contributors).

This release candidate focuses on diffusion runtime maturity, Qwen-Omni performance, and expanded multimodal model support, alongside substantial improvements to serving ergonomics, profiling, ROCm/NPU enablement, and CI/docs quality. In addition, this is the first vllm-omni rc version with Day-0 alignment with vLLM upstream.

Model Support

  • TTS: Added support for the Qwen3-TTS(Day-0) model series. (#895)
  • Diffusion / image families: Added Flux.2-klein(Day-0) GLM-Image(Day-0), plus multiple qwen-image family correctness/perf improvements. (#809, #868, #847)
  • Bagel ecosystem: Added Bagel model support and Cache-DiT support. (#726, #736)
  • Text-to-audio: Added Stable Audio Open support for text-to-audio generation. (#331)

Key Improvements

  • Qwen-Omni performance & serving enhancements

    • Improved Qwen3-Omni throughput with vLLM SharedFusedMoE, plus additional kernel/graph optimizations:

      • SharedFusedMoE integration (#560)
      • QKV linear + gate_up projection fusion (#734)
      • Talker cudagraph support and MTP batch inference for Qwen3-Omni talker (#669, #722)
      • Optimized thinker-to-talker projection path (#825)
    • Improved online serving configurability:

      • omni entrypoint tokenizer argument support (#572)
      • Enable tensor_parallel_size for online serving command (#761)
      • Grouped omni arguments into OmniConfig for cleaner UX (#744)
  • Diffusion runtime & acceleration upgrades

    • Added sleep / wake_up / load_weights lifecycle controls for Omni Diffusion, improving operational flexibility for long-running services. (#376)
    • Introduced torch.compile support for diffusion to improve execution efficiency on supported setups. (#684)
    • Added a GPU Diffusion Runner and Diffusion executor, strengthening the core execution stack for diffusion workloads. (#822, #865)
    • Enabled TeaCache acceleration for Z-Image diffusion pipelines. (#817)
    • Defaulted to FA3 (FlashAttention v3) when supported, and extended FlashAttention to support attention masks. (#783, #760)
    • Added CPU offloading support for diffusion to broaden deployment options under memory pressure. (#497)
  • Parallelism and scaling for diffusion pipelines

    • Added CFG parallel support for Qwen-Image and introduced CFG parameter support in online serving. (#444, #824)
    • Enabled DiT tensor parallel for Z-Image diffusion pipeline and extended TP support for qwen-image with test refactors. (#735, #830)
    • Implemented Sequence Parallelism (SP) abstractions for diffusion, including SP support in LongCatImageTransformer. (#779, #721)

Stability, Tooling, and Platform

  • Correctness & robustness fixes across diffusion and staged execution:

    • Fixed diffusion model load failure when stage config is present (#860)
    • Fixed stage engine outputs mismatch under online batching (#691)
    • Fixed CUDA-context lifecycle issues and file-lock handling in stage workers (#661, #703)
    • Multiple model/pipeline fixes (e.g., SD3 compatibility, Wan2.2 warmup/scheduler, Qwen2.5-Omni stop behavior). (#772, #791, #804, #773)
  • Profiling & developer experience

    • Added Diffusion Profiler support, plus user guides for diffusion profiling and torch profiler usage. (#709, #738, #570)
  • ROCm / NPU / CI

    • Enhanced ROCm CI coverage, optimized ROCm Dockerfile build time, and refreshed ROCm getting-started documentation. (#542, #885, #678)
    • CI reliability improvements (pytest markers, split tests to avoid timeouts). (#719, #883)

Note:The NPU AR functionality is currently unavailable and will be supported in the official v0.14.0 release.

What's Changed

  • [Docs] Fix diffusion module design doc by @SamitHuang in #645
  • [Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
  • [Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
  • Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
  • [Misc] Merge diffusion forward context by @iwzbi in #582
  • [Doc] User guide for torch profiler by @lishunyang12 in #570
  • [Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
  • [BugFix] token2wav code out of range by @Bounty-hunter in #655
  • [Doc] Update version 0.12.0 by @ywang96 in #662
  • [Docs] Update diffusion_acceleration.md by @SamitHuang in #659
  • [Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
  • [Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
  • [BUGFIX] Delete the CUDA context in the stage process. by @fake0fan in #661
  • [Misc] Fix docs display problem of streaming mode and other related issues by @Gaohan123 in #667
  • [Model] Add Stable Audio Open support for text-to-audio generation by @linyueqian in #331
  • [Doc] Update ROCm getting started instruction by @tjtanaa in #678
  • [Bugfix] Fix f-string formatting in image generation pipelines by @ApsarasX in #689
  • [Bugfix] Solve Ulysses-SP sequence length not divisible by SP degree (using padding and attention mask) by @wtomin in #672
  • omni entrypoint support tokenizer arg by @divyanshsinghvi in #572
  • [Bug fix] fix e2e_total_tokens and e2e_total_time_ms by @LJH-LBJ in #648
  • [BugFix] Explicitly release file locks during stage worker init by @yuanheng-zhao in #703
  • [BugFix] Fix stage engine outputs mismatch bug in online batching by @ZeldaHuang in #691
  • [core] add torch compile for diffusion by @ZJY0516 in #684
  • [BugFix] Remove duplicate width assignment in SD3 pipeline by @dongbo910220 in #708
  • [Feature] Support Qwen3 Omni talker cudagraph by @ZeldaHuang in #669
  • [Benchmark] DiT Model Benchmark under Mixed Workloads by @asukaqaq-s in #529
  • update design doc by @hsliuustc0106 in #711
  • [Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni by @gcanlin in #560
  • [Doc]: update vllm serve param and base64 data truncation by @nuclearwu in #718
  • [BugFix] Fix assuming all stage model have talker by @princepride in #730
  • [Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj by @gcanlin in #734
  • [Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) by @dongbo910220 in #735
  • [Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker by @LJH-LBJ in #697
  • [ROCm] [CI] Add More Tests by @tjtanaa in #542
  • [Docs] update design doc templated in RFC by @hsliuustc0106 in #746
  • Add description of code version for bug report by @yenuo26 in #745
  • [misc] fix rfc template by @hsliuustc0106 in #748
  • fix:#issue 432 by @GG-li in #517
  • [Diffusion][Feature] Implement SP support in LongCatImageTransformer by @mxuax in #721
  • [Debug] Clean code in Qwen 3 Omni and add warning for talker temperature. by @tzhouam in #688
  • [feature] cpu offloading support for diffusion by @LawJarp-A in #497
  • [Misc] Group omni arguments into OmniConfig section by @fake0fan in #744
  • [Misc] Enable tensor_parallel_size argument with online serving cmd by @JustQJ in #761
  • [Bugfix] Raise ValueError when joint_strategy='rear' and causal=True in Ring Attention by @mxuax in #767
  • [Feat] add vllm-omni version collection by @sihyeonn in #740
  • [Doc] refactor diffusion doc by @ZJY0516 in #753
  • [Bugfix] Fix stable diffusion3 compatibility error by @iwzbi in #772
  • [Feature] Support Qwen3 Omni talker mtp batch inference by @ZeldaHuang in #722
  • [BugFix]Remove duplicate error handling for request results by @liuyuhanalex in #781
  • [CI] Add pytest markers in config files. by @congw729 in #719
  • [Doc] Fix mkdocs. by @congw729 in #785
  • [Bugfix] Fix generation artifacts of Qwen-Image-Edit-2511 and update pipeline DiT param parsing by @SamitHuang in #776
  • [bugfix] Fix Wan2.2 I2V warmup failure by adding support_image_input attribute by @linyueqian in #791
  • [Misc] add wechat group and star history on README by @david6666666 in #801
  • [BugFix] Fix incorrect mrope positions under cuda graph by @ZeldaHuang in #803
  • [BugFix] Qwen2.5-omni supress end token and won't stop by @yinpeiqi in #773
  • [Feature] Flash Attention to Support Attention Mask by @wtomin in #760
  • [Model] add flux2 klein by @david6666666 in #809
  • [bugfix] use unipc scheduler for Wan 2.2 by @linyueqian in #804
  • [Test] Add full test for Qwen3-Omni-30B-A3B-Instruct by @yenuo26 in #720
  • [Bagel] Support Cache-Dit by @princepride in #736
  • [Perf] Optimize the Qwen2.5-Omni Model thinker-to-talker-proj with nn.Linear by @kechengliu97 in #825
  • [Core]Add GPU Diffusion Runner by @princepride in #822
  • [Feature]: Add CFG param to online serving by @gDINESH13 in #824
  • [diffusion] add tp support for qwen-image and refactor some tests by @ZJY0516 in #830
  • [Core] Implement Diffusion Profiler Support by @lishunyang12 in #709
  • [Bugfix] Diffusion model fails to load when stage config is present by @fhfuih in #860
  • chore: Bump up cache-dit and fix docs links by @DefTruth in #863
  • [Misc] Change benchmark default port by @NickLucche in #872
  • Dev/rebase 0.14.0 and Support GLM-Image by @tzhouam in #847
  • [Doc] Add user guide for diffusion model profiling by @lishunyang12 in #738
  • [Doc] Update rebase doc by @tzhouam in #878
  • [Test] Add full test for Qwen3-Omni-30B-A3B-Instruct for image and audio single modal by @yenuo26 in #827
  • [Diffusion] Non-Intrusive Sequence Parallelism (SP) Model Support Abstraction for vLLM-Omni Framework by @mxuax in #779
  • [Bugfix] Remove the duplicate api registration in vllm-omni by @fake0fan in #880
  • [CI] split tests to avoid timeout by @ZJY0516 in #883
  • [Diffusion][Acceleration] Support TeaCache for Z-Image by @gcanlin in #817
  • [Perf] Fuse Q/K/V Linear with QKVParallelLinear in Qwen2.5Omni DiTAttention by @kechengliu97 in #884
  • [bugfix] support text + audio mixed output by @GG-li in #843
  • [Misc] fix qwen image family redundant computation by @ZJY0516 in #868
  • [ROCm] [CI] Optimize Dockerfilerocm and reduce build time on CI by @tjtanaa in #885
  • [Core]Add Diffusion executor by @natureofnature in #865
  • [Bugfix] Fix video saving bug under certain fps by @SamitHuang in #893
  • [diffusion] use fa3 by default when device supports it by @ZJY0516 in #783
  • [Model] Support Qwen3-TTS model series by @Gaohan123 in #895
  • Support Bagel Model by @princepride in #726
  • Dev/debug qwen tts by @tzhouam in #903

New Contributors

Full Changelog: v0.12.0rc1...v0.14.0rc1