03 Feb 09:23

david6666666

d6f93b0

v0.15.0rc1 Pre-release

Pre-release

This pre-release is a alignment to the upstream vLLM v0.15.0.

Highlights

Rebase to Upstream vLLM v0.15.0: vLLM-Omni is now fully aligned with the latest vLLM v0.15.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1159).
Tensor Parallelism for LongCat-Image: We have added Tensor Parallelism (TP) support for LongCat-Image and LongCat-Image-Edit models, significantly improving the inference speed and scalability of these vision-language models (#926).
TeaCache Optimization: Introduced Coefficient Estimation for TeaCache, further refining the efficiency of our caching mechanisms for optimized generation (#940).
Alignment & Stability:
- Enhanced error handling logic to maintain consistency with upstream vLLM v0.14.0/v0.15.0 standards (#1122).
- Integrated "Bagel" E2E Smoke Tests and refactored sequence parallel tests to ensure robust CI/CD and accurate performance benchmarking (#1074, #1165).
Update paper link: A intial paper to arxiv to give introductions to our design and some performance test results (#1169).

What's Changed

Features & Optimizations

[TeaCache]: Add Coefficient Estimation by @princepride in #940
[Feature] add Tensor Parallelism to LongCat-Image(-Edit) by @hadipash in #926

Alignment & Integration

Dev/rebase v0.15.0 by @tzhouam in #1159
[Misc] Align error handling with upstream vLLM v0.14.0 by @ceanna93 in #1122
[Misc] Bump version to 0.14.0 by @ywang96 in #1128

Infrastructure (CI/CD) & Documentation

[Doc] First stable release of vLLM-Omni by @ywang96 in #1129
[CI]: Bagel E2E Smoked Test by @princepride in #1074
[CI] Refactor test_sequence_parallel.py and add a warmup run by @mxuax in #1165
[CI] Temporarily remove slow tests. by @congw729 in #1143
[Debug] Clear Dockerfile.ci to accelerate build image by @tzhouam in #1172
[Debug] Correct Unreasonable Long Timeout by @tzhouam in #1175
[Docs] Update paper link by @hsliuustc0106 in #1169

New Contributors

@ceanna93 made their first contribution in #1122
@hadipash made their first contribution in #926

Full Changelog: v0.14.0...v0.15.0rc1

Contributors

ceanna93, hadipash, and 6 other contributors

Assets 2

31 Jan 07:31

david6666666

v0.14.0

ed89c8b

v0.14.0 Latest

Latest

Highlights

This release features approximately 180 commits from over 70 contributors (23 new contributors).

vLLM-Omni v0.14.0 is a feature-heavy release that expands Omni’s diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability.

Key Improvements:

Async chunk ([#727]): chunk pipeline overlap across stages to reduce idle time and improve end-to-end throughput/latency for staged execution.
Stage-based deployment for the Bagel model ([#726]): Multi-stage pipeline (Thinker/AR stage + Diffusion/DiT stage) aligning it with the vllm-omni architecture
Qwen3-TTS model family support ([#895]): Expands text-to-audio generation and supports online serving.
Diffusion LoRA Adapter Support (PEFT-compatible) ([#758]): Adds LoRA fine-tuning/adaptation for diffusion workflows with a PEFT-aligned interface.
DiT layerwise (blockwise) CPU offloading ([#858]): Fine-grained offloading to increase memory headroom for larger diffusion runs.
Hardware platforms + plugin system ([#774]): Establishes a more extensible platform capability layer for cleaner multi-backend development.

Diffusion & Image/Video Generation

Sequence Parallelism (SP) foundations + expansion: Adds a non-intrusive SP abstraction for diffusion models ([#779]), SP support in LongCatImageTransformer ([#721]), and SP support for Wan2.2 diffusion ([#966]).
CFG improvements and parallelization: CFG parallel support for Qwen-Image ([#444]), CFG parallel abstraction ([#851]), and online-serving CFG parameter support ([#824]).
Acceleration & execution plumbing: Torch compile support for diffusion ([#684]), GPU diffusion runner ([#822]), and diffusion executor ([#865]).
Caching and memory efficiency: TeaCache for Z-Image ([#817]) and TeaCache for Bagel ([#848]); plus CPU offloading for diffusion ([#497]) and DiT tensor parallel enablement for diffusion pipeline (Z-Image) ([#735]).
Model coverage expansion: Adds GLM-Image support ([#847]), FLUX family additions (e.g., FLUX.1-dev [#853], FLUX.2-klein [#809]) and related TP support ([#973]).
Quality/stability fixes for pipelines: Multiple diffusion pipeline correctness fixes (e.g., CFG parsing failure fix [#922], SD3 compatibility fix [#772], video saving bug under certain fps [#893], noisy output without a seed in Qwen Image [#1043]).

Audio & Speech (TTS / Text-to-Audio)

Text-to-audio model support: Stable Audio Open support for text-to-audio generation ([#331]).
Qwen3-TTS stack maturation: Model series support ([#895]), online serving support ([#968]), plus stabilization fixes such as profile-run hang resolution ([#1082]) and dependency additions for Qwen3-TTS support ([#981]).
Interoperability & correctness: Fixes and improvements across audio outputs and model input validation (e.g., StableAudio output standardization [#842], speaker/voices loading from config [#1079]).

Serving, APIs, and Frontend

Diffusion-mode service endpoints & compatibility: Adds /health and /v1/models endpoints for diffusion mode and fixes streaming compatibility ([#454]).
New/expanded image APIs: /v1/images/edit interface ([#1101]).
Online serving usability improvements: Enables tensor_parallel_size argument with online serving command ([#761]) and supports CFG parameters in online serving ([#824]).
Batching & request handling: Frontend/model support for batch requests (OmniDiffusionReq refinement) ([#797]).

Performance & Efficiency

Qwen3-Omni performance work: SharedFusedMoE integration ([#560]), fused QKV & projection optimizations (e.g., fuse QKV linear and gate_up proj [#734], Talker MTP optimization [#1005]).
Attention and kernel/backend tuning: Flash Attention attention-mask support ([#760]), FA3 backend defaults when supported ([#783]), and ROCm performance additions like AITER Flash Attention ([#941]).
Memory-aware optimizations: Conditional transformer loading for Wan2.2 to reduce memory usage ([#980]).

Hardware / Backends / CI Coverage

Broader backend support: XPU backend support ([#191]) plus the platform/plugin system groundwork ([#774]).
NPU & ROCm updates: NPU upgrade alignment ([#820], [#1114]) and ROCm CI expansion / optimization ([#542], [#885], [#1039]).
Test reliability / coverage: CI split to avoid timeouts ([#883]) and additional end-to-end / precision tests (e.g., chunk e2e tests [#956]).

Reliability, Correctness, and Developer Experience

Stability fixes across staged execution and serving: Fixes for stage config loading issues ([#860]), stage output mismatch in online batching ([#691]), and server readiness wait-time increase for slow model loads ([#1089]).
Profiling & benchmarking improvements: Diffusion profiler support ([#709]) plus benchmark additions (e.g., online benchmark [#780]).
Documentation refresh: Multiple diffusion docs refactors and new guides (e.g., profiling guide [#738], torch profiler guide [#570], diffusion docs refactor [#753], ROCm instructions updates [#678], [#905]).

What's Changed

[Docs] Fix diffusion module design doc by @SamitHuang in #645
[Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
[Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
[Misc] Merge diffusion forward context by @iwzbi in #582
[Doc] User guide for torch profiler by @lishunyang12 in #570
[Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
[BugFix] token2wav code out of range by @Bounty-hunter in #655
[Doc] Update version 0.12.0 by @ywang96 in #662
[Docs] Update diffusion_acceleration.md by @SamitHuang in #659
[Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
[Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
[BUGFIX] Delete the CUDA context in the stage process. by @fake0fan in #661
[Misc] Fix docs display problem of streaming mode and other related issues by @Gaohan123 in #667
[Model] Add Stable Audio Open support for text-to-audio generation by @linyueqian in #331
[Doc] Update ROCm getting started instruction by @tjtanaa in #678
[Bugfix] Fix f-string formatting in image generation pipelines by @ApsarasX in #689
[Bugfix] Solve Ulysses-SP sequence length not divisible by SP degree (using padding and attention mask) by @wtomin in #672
omni entrypoint support tokenizer arg by @divyanshsinghvi in #572
[Bug fix] fix e2e_total_tokens and e2e_total_time_ms by @LJH-LBJ in #648
[BugFix] Explicitly release file locks during stage worker init by @yuanheng-zhao in #703
[BugFix] Fix stage engine outputs mismatch bug in online batching by @ZeldaHuang in #691
[core] add torch compile for diffusion by @ZJY0516 in #684
[BugFix] Remove duplicate width assignment in SD3 pipeline by @dongbo910220 in #708
[Feature] Support Qwen3 Omni talker cudagraph by @ZeldaHuang in #669
[Benchmark] DiT Model Benchmark under Mixed Workloads by @asukaqaq-s in #529
update design doc by @hsliuustc0106 in #711
[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni by @gcanlin in #560
[Doc]: update vllm serve param and base64 data truncation by @nuclearwu in #718
[BugFix] Fix assuming all stage model have talker by @princepride in #730
[Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj by @gcanlin in #734
[Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) by @dongbo910220 in #735
[Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker by @LJH-LBJ in #697
[ROCm] [CI] Add More Tests by @tjtanaa in #542
[Docs] update design doc templated in RFC by @hsliuustc0106 in #746
Add description of code version for bug report by @yenuo26 in #745
[misc] fix rfc template by @hsliuustc0106 in https://github.com...

Contributors

qibaoyuan, SamitHuang, and 51 other contributors

Assets 4

22 Jan 15:22

david6666666

v0.14.0rc1

a9012a1

v0.14.0rc1 Pre-release

Pre-release

Highlights (vllm-omni v0.14.0rc1)

This release candidate includes approximately 90 commits from 35 contributors (12 new contributors).

This release candidate focuses on diffusion runtime maturity, Qwen-Omni performance, and expanded multimodal model support, alongside substantial improvements to serving ergonomics, profiling, ROCm/NPU enablement, and CI/docs quality. In addition, this is the first vllm-omni rc version with Day-0 alignment with vLLM upstream.

Model Support

TTS: Added support for the Qwen3-TTS(Day-0) model series. (#895)
Diffusion / image families: Added Flux.2-klein(Day-0) GLM-Image(Day-0), plus multiple qwen-image family correctness/perf improvements. (#809, #868, #847)
Bagel ecosystem: Added Bagel model support and Cache-DiT support. (#726, #736)
Text-to-audio: Added Stable Audio Open support for text-to-audio generation. (#331)

Key Improvements

Qwen-Omni performance & serving enhancements
- Improved Qwen3-Omni throughput with vLLM SharedFusedMoE, plus additional kernel/graph optimizations:
  - SharedFusedMoE integration (#560)
  - QKV linear + gate_up projection fusion (#734)
  - Talker cudagraph support and MTP batch inference for Qwen3-Omni talker (#669, #722)
  - Optimized thinker-to-talker projection path (#825)
- Improved online serving configurability:
  - omni entrypoint tokenizer argument support (#572)
  - Enable tensor_parallel_size for online serving command (#761)
  - Grouped omni arguments into OmniConfig for cleaner UX (#744)
Diffusion runtime & acceleration upgrades
- Added sleep / wake_up / load_weights lifecycle controls for Omni Diffusion, improving operational flexibility for long-running services. (#376)
- Introduced torch.compile support for diffusion to improve execution efficiency on supported setups. (#684)
- Added a GPU Diffusion Runner and Diffusion executor, strengthening the core execution stack for diffusion workloads. (#822, #865)
- Enabled TeaCache acceleration for Z-Image diffusion pipelines. (#817)
- Defaulted to FA3 (FlashAttention v3) when supported, and extended FlashAttention to support attention masks. (#783, #760)
- Added CPU offloading support for diffusion to broaden deployment options under memory pressure. (#497)
Parallelism and scaling for diffusion pipelines
- Added CFG parallel support for Qwen-Image and introduced CFG parameter support in online serving. (#444, #824)
- Enabled DiT tensor parallel for Z-Image diffusion pipeline and extended TP support for qwen-image with test refactors. (#735, #830)
- Implemented Sequence Parallelism (SP) abstractions for diffusion, including SP support in LongCatImageTransformer. (#779, #721)

Stability, Tooling, and Platform

Correctness & robustness fixes across diffusion and staged execution:
- Fixed diffusion model load failure when stage config is present (#860)
- Fixed stage engine outputs mismatch under online batching (#691)
- Fixed CUDA-context lifecycle issues and file-lock handling in stage workers (#661, #703)
- Multiple model/pipeline fixes (e.g., SD3 compatibility, Wan2.2 warmup/scheduler, Qwen2.5-Omni stop behavior). (#772, #791, #804, #773)
Profiling & developer experience
- Added Diffusion Profiler support, plus user guides for diffusion profiling and torch profiler usage. (#709, #738, #570)
ROCm / NPU / CI
- Enhanced ROCm CI coverage, optimized ROCm Dockerfile build time, and refreshed ROCm getting-started documentation. (#542, #885, #678)
- CI reliability improvements (pytest markers, split tests to avoid timeouts). (#719, #883)

Note:The NPU AR functionality is currently unavailable and will be supported in the official v0.14.0 release.

What's Changed

[Docs] Fix diffusion module design doc by @SamitHuang in #645
[Docs] Remove multi-request streaming design document and update ray-based execution documentation structure by @tzhouam in #641
[Bugfix] Fix TI2V-5B weight loading by loading transformer config from model by @linyueqian in #633
Support sleep, wake_up and load_weights for Omni Diffusion by @knlnguyen1802 in #376
[Misc] Merge diffusion forward context by @iwzbi in #582
[Doc] User guide for torch profiler by @lishunyang12 in #570
[Docs][NPU] Upgrade to v0.12.0 by @gcanlin in #656
[BugFix] token2wav code out of range by @Bounty-hunter in #655
[Doc] Update version 0.12.0 by @ywang96 in #662
[Docs] Update diffusion_acceleration.md by @SamitHuang in #659
[Docs] Guide for using sleep mode and enable sleep mode by @knlnguyen1802 in #660
[Diffusion][Feature] CFG parallel support for Qwen-Image by @wtomin in #444
[BUGFIX] Delete the CUDA context in the sta...

Contributors

SamitHuang, NickLucche, and 38 other contributors

Assets 4

05 Jan 11:17

david6666666

v0.12.0rc1

e7eeb54

v0.12.0rc1 Pre-release

Pre-release

vLLM-Omni v0.12.0rc1 Pre-Release Notes Highlights

Highlights

This release features 187 commits from 45 contributors (34 new contributors)!

vLLM-Omni v0.12.0rc1 is a major RC milestone focused on maturing the diffusion stack, strengthening OpenAI-compatible serving, expanding omni-model coverage, and improving stability across platforms (GPU/NPU/ROCm). It also rebases on vLLM v0.12.0 for better alignment with upstream (#335).

Breaking / Notable Changes

Unified diffusion stage naming & structure: cleaned up legacy Diffusion* paths and aligned on Generation*-style stages to reduce duplication (#211, #163).
Safer serialization: switched OmniSerializer from pickle to MsgPack (#310).
Dependency & packaging updates: e.g., bumped diffusers to 0.36.0 (#313) and refreshed Python/formatting baselines for the v0.12 release (#126).

Diffusion Engine: Architecture + Performance Upgrades

Core refactors for extensibility: diffusion model registry refactored to reuse vLLM’s ModelRegistry (#200), improved diffusion weight loading and stage abstraction (#157, #391).
Acceleration & parallelism features:
- Cache-DiT with a unified cache backend interface (#250)
- TeaCache integration and registry refactors (#179, #304, #416)
- New/extended attention & parallelism options: Sage Attention (#243), Ulysses Sequence Parallelism (#189), Ring Attention (#273)
- torch.compile optimizations for DiT and RoPE kernels (#317)

Serving: Stronger OpenAI Compatibility & Online Readiness

DALL·E-compatible image generation endpoint (/v1/images/generations) (#292), plus online serving fixes for image generation (#499).
Added OpenAI create speech endpoint (#305).
Per-request modality control (output modality selection) (#298) with API usage examples (#411).
Early support for streaming output (#367), request abort (#486), and request-id propagation in responses (#301).

Omni Pipeline: Multi-stage Orchestration & Observability

Improved inter-stage plumbing: customizable process between stages and reduced coupling on request_ids in model forward paths (#458).
Better observability and debugging: torch profiler across omni stages (#553), improved traceback reporting from background workers (#385), and logging refactors (#466).

Expanded Model Support (Selected)

Qwen-Omni / Qwen-Image family:
- Qwen-Omni offline inference with local files (#167)
- Qwen-Image-2512 support(#547)
- Qwen-Image-Edit support (including multi-image input variants and newer releases, Qwen-Image-Edit Qwen-Image-Edit-2509 Qwen-Image-Edit-2511) (#196, #330, #321)
- Qwen-Image-Layered model support (#381)
- Multiple fixes for Qwen2.5/Qwen3-Omni batching, examples, and OpenAI sampling parameter compatibility (#451, #450, #249)
Diffusion / video ecosystem:
- Z-Image support and kernel fusions (#149, #226)
- Stable Diffusion 3 support (#439)
- Wan2.2 T2V plus I2V/TI2V pipelines (#202, #329)
- LongCat-Image and LongCat-Image-Edit support (#291, #392)
- Ovis Image model addition (#263)
- Bagel (diffusion-only) and image-edit support (#319, #588)

Platform & CI Coverage

ROCm / AMD: documented ROCm setup (#144) and added ROCm Dockerfile + AMD CI (#280).
NPU: added NPU CI workflow (#231) and expanded NPU support for key Omni models (e.g., Qwen3-Omni, Qwen-Image series) (#484, #463, #485), with ongoing cleanup of NPU-specific paths (#597).
CI and packaging improvements: diffusion CI, wheel compilation, and broader UT/E2E coverage (#174, #288, #216, #168).

What's Changed

[Misc] Update link in issue template by @ywang96 in #155
[Misc] Qwen-Omni support offline inference with local files by @SamitHuang in #167
[diffusion] z-image support by @ZJY0516 in #149
[Doc] Fix wrong examples URLs by @wjcwjc77 in #166
[Doc] Update Security Advisory link by @DarkLight1337 in #173
[Doc] change vllm_omni to vllm-omni by @princepride in #177
[Docs] Supplement volunteers and faq docs by @Gaohan123 in #182
[Bugfix] Init early toch cuda by @knlnguyen1802 in #185
[Docs] remove Ascend word to make docs general by @gcanlin in #190
[Doc] Add installation part for pre built docker. by @congw729 in #141
[CI] add diffusion ci by @ZJY0516 in #174
[Misc] Add stage config for Qwen3-Omni-30B-A3B-Thinking by @linyueqian in #172
[Doc]Fixed some spelling errors by @princepride in #199
[Chore]: Refactor diffusion model registry to reuse vLLM's ModelRegistry by @Isotr0py in #200
[FixBug]online serving fails for high-resolution videos by @princepride in #198
[Engine] Remove Diffusion_XX which duplicates with Generation_XX by @tzhouam in #163
[bugfix] qwen2.5 omni does not support chunked prefill now by @fake0fan in #193
[NPU][Refactor] Rename Diffusion* to Generation* by @gcanlin in #211
[Diffusion] Init Attention Backends and Selector for Diffusion by @ZJY0516 in #115
[E2E] Add Qwen2.5-Omni model test with OmniRunner by @gcanlin in #168
[Docs]Fix doc wrong link by @princepride in #223
[Diffusion] Refactor diffusion models weights loading by @Isotr0py in #157
Fix: Safe handling for multimodal_config to avoid 'NoneType' object h… by @qibaoyuan in #227
[Bugfix] Fix ci bug for qwen2.5-omni by @Gaohan123 in #230
[Core] add clean up method for diffusion engine by @ZJY0516 in #219
[BugFix] Fix qwen3omni thinker batching. by @yinpeiqi in #207
[Bugfix] Support passing vllm cli args to online serving in vLLM-Omni by @Gaohan123 in #206
[Docs] Add basic usage examples for diffusion by @SamitHuang in #222
[Model] Add Qwen-Image-Edit by @SamitHuang in #196
update docs/readme.md and design folder by @hsliuustc0106 in #234
[CI] Add Qwen3-omni offline UT by @R2-Y in #216
[typo] fix doc readme by @hsliuustc0106 in #242
[Model] Fuse Z-Image's qkv_proj and gate_up_proj by @Isotr0py in #226
[bugfix] Fix QwenImageEditPipeline transformer init by @dougbtv in #245
[Bugfix] Qwen2.5-omni Qwen3-omni online gradio.py example fix by @david6666666 in #249
[Bugfix] fix issue251, qwen3 omni does not support chunked prefill now by @david6666666 in #256
[Bugfix]multi-GPU tp scenarios, devices: "0,1" uses physical IDs instead of logical IDs by @david6666666 in #253
[Bugfix] Remove debug code in AsyncOmni.del to fix resource leak by @princepride in #260
update arch overview by @hsliuustc0106 in #258
[Feature] Omni Connector + ray supported by @natureofnature in #215
[Misc] fix stage config describe and yaml format by @david6666666 in #265
update desgin docs by @hsliuustc0106 in #269
[Model] Add Wan2.2 text-to-video support by @linyueqian in #202
[Doc] [ROCm]: Document the steps to run vLLM Omni on ROCm by @tjtanaa in #144
[Entrypoints] Minor optimization in the orchestrator's final stage determination logic by @RuixiangMa in #275
[Doc] update offline inference doc and offline_inference examples by @david6666666 in #274
[Feature] teacache integration by @LawJarp-A in #179
[CI] Qwen3-Omni online test by @R2-Y in #257
[Doc] fix docs Feature Design and Module Design by @hsliuustc0106 in #283
[CI] Test ready label by @ywang96 in #299
[Doc] fix offline inference and online serving describe by @david6666666 in #285
[CI] Adjust folder by @congw729 in #300
[Diffusion][Attention] sage attention backend by @ZJY0516 in https://github.com/vllm-project/vllm-omni/...

Contributors

qibaoyuan, Jeffwan, and 43 other contributors

Assets 4

01 Dec 18:14

ywang96

v0.11.0rc1

9fe730a

0.11.0rc1 Pre-release

Pre-release

Initial (Pre)-release of the vLLM-Omni Project

vLLM-Omni is a framework that extends its support for omni-modality model inference and serving. This pre-release is built on top of vllm==0.11.0, and same version number is used for the ease of tracking the dependency.

Please check out our documentation and we welcome any feedbacks & contributions!

What's Changed

init the folder directories for vLLM-omni by @hsliuustc0106 in #1
init main repo structure and demonstrate the AR + DiT demo for omni models by @hsliuustc0106 in #6
Add PR and issue templates from vLLM project by @hsliuustc0106 in #8
update RFC template by @hsliuustc0106 in #9
[Model]Add Qwen2.5-Omni model components by @tzhouam in #12
[Engine] Add entrypoint class and stage management by @Gaohan123 in #13
[Model] Add end2end example and documentation for qwen2.5-omni by @Gaohan123 in #14
[Worker]Feat/ar gpu worker and model runner by @tzhouam in #15
[Worker]Refactor GPU diffusion model runner and worker by @tzhouam in #16
[Worker]Add OmniGPUModelRunner and OmniModelInputForGPU classes by @tzhouam in #17
[Engine]Refactor output processing for multimodal capabilities in vLLM-omni by @tzhouam in #20
[Inputs, Engine]Add Omni model components and input processing for hidden states support by @tzhouam in #18
[Core]Add scheduling components for vLLM-omni by @tzhouam in #19
add precommit by @Gaohan123 in #32
End2end fixup by @tzhouam in #35
Remove unused files and fix some bugs by @Gaohan123 in #36
[bugfix] fix problem of installation by @Gaohan123 in #44
[Bugfix] Further supplement installation guide by @Gaohan123 in #46
[Bugfix] fix huggingface download problem for spk_dict.pt by @Gaohan123 in #47
[Refractor] Dependency refractored to vLLM v0.11.0 by @Gaohan123 in #48
[fix] Add support for loading model from a local path by @qibaoyuan in #52
[Feature] Multi Request Stream for Sync Mode by @tzhouam in #51
[Docs] Setup Documentation System and Re-organize Dependencies by @SamitHuang in #49
[fix] adapt hidden state device for multi-hardware support by @qibaoyuan in #61
[Feature] Support online inference by @Gaohan123 in #64
CI Workflows. by @congw729 in #50
[CI] fix ci and format existing code by @ZJY0516 in #71
[CI] disable unnecessary ci and update pre-commit by @ZJY0516 in #80
update readme for v0.11.0rc1 release by @hsliuustc0106 in #69
[CI] Add script for building wheel. by @congw729 in #75
[Feature] support multimodal inputs with multiple requests by @Gaohan123 in #76
[Feature] Add Gradio Demo for Qwen2.5Omni by @SamitHuang in #60
[CI] Buildkite setup by @ywang96 in #83
[CI]Add version number. by @congw729 in #87
[fix] Remove redundant parameter passing by @qibaoyuan in #90
[Docs] optimize and supplement docs system by @Gaohan123 in #86
[Diffusion] Qwen image support by @ZJY0516 in #82
[fix] add scheduler.py by @ZJY0516 in #94
Update gradio docs by @SamitHuang in #95
[Bugfix] Fix removal of old logs when stats are enabled by @syedmba in #84
[diffusion] add doc and fix qwen-image by @ZJY0516 in #96
Simple test from PR#88 on Buildkite by @ywang96 in #93
[Diffusion] Support Multi-image Generation and Add Web UI Demo for QwenImage by @SamitHuang in #97
[Doc] Misc documentation polishing by @ywang96 in #98
[Feature] add support for Qwen3-omni by @R2-Y in #55
[Bugfix] Fix special token nothink naming. by @ywang96 in #107
[Fix] fix qwen3-omni example by @ZJY0516 in #109
[CI] Fix ci by @ZJY0516 in #110
[Docs] Add qwen image missing doc in user guide by @SamitHuang in #111
[Bug-fix] Fix Bugs in Qwen3/Qwen2.5 Omni Rebased Support by @tzhouam in #114
[Bugfix] Remove mandatory flash-attn dependency and optimzie docs by @Gaohan123 in #113
[Feat] Add NPU Backend support for vLLM-Omni by @gcanlin in #89
[Feature] Support Gradio Demo for Qwen3-Omni by @SamitHuang in #116
[Feat] Enable loading local Qwen-Image model by @gcanlin in #117
[Bugfix] Fix bug of online serving for qwen2.5-omni by @Gaohan123 in #118
[Doc] Fix readme typos by @hsliuustc0106 in #108
[Feat] Rename AsyncOmniLLM -> AsyncOmni by @congw729 in #103
[Bugfix] Fix Qwen-omni Online Inference Bug caused by check_stop and long sequence by @SamitHuang in #112
[Fix] Resolve comments & update vLLM-Omni name usages. by @congw729 in #122
Refresh supported models and address nits in doc by @Yikun in #119
[Doc] Cleanup non-english comments by @ywang96 in #125
[Doc] Fix outdated CONTRIBUTING link by @DarkLight1337 in #127
[Misc] Update default stage config for qwen3-omni by @ywang96 in #124
[Doc] Cleanup reference to deleted files by @ywang96 in #134
[Doc] Fix arch pic reference by @ywang96 in #136
[Bugfix] Fix redundant shm broadcast warnings in diffusion workers by @SamitHuang in #133
Update README with vllm-omni blogpost link by @youkaichao in #137
[Bugfix] Fix the curl bug of qwen3-omni and doc errors by @Gaohan123 in #135
[Doc] Update developer & user channel by @ywang96 in #138
[Misc][WIP] Support qwen-omni online inference with local video/audio/image path by @SamitHuang in #131
[Doc] Logo by @ywang96 in #143
[Misc] Misc description updates by @ywang96 in #146
[Bugfix] Fix Qwen3-Omni gradio audio input bug by @SamitHuang in #147
[Bugfix] Add Fake VllmConfig on NPU and add slicing/tiling args in Qwen-Image by @gcanlin in #145
[Misc] Temporarily support downloading models from ModelScope by snapshot download by @MengqingCao in #132
[Misc] update image reference for PyPI by @ywang96 in #150

New Contributors

@tzhouam made their first contribution in #12
@qibaoyuan made their first contribution in #52
@SamitHuang made their first contribution in #49
@ZJY0516 made their first contribution in #71
@ywang96 made their first contribution in #83
@syedmba made their first contribution in #84
@R2-Y made their first contribution in #55
@gcanlin made their first contribution in #89
@Yikun made their fir...

Contributors

qibaoyuan, Yikun, and 13 other contributors

Assets 4

Releases: vllm-project/vllm-omni

v0.15.0rc1

Highlights

What's Changed

Features & Optimizations

Alignment & Integration

Infrastructure (CI/CD) & Documentation

New Contributors

Contributors

Uh oh!

v0.14.0

Highlights

Key Improvements:

Diffusion & Image/Video Generation

Audio & Speech (TTS / Text-to-Audio)

Serving, APIs, and Frontend

Performance & Efficiency

Hardware / Backends / CI Coverage

Reliability, Correctness, and Developer Experience

What's Changed

Contributors

Uh oh!

v0.14.0rc1

Highlights (vllm-omni v0.14.0rc1)

Model Support

Key Improvements

Stability, Tooling, and Platform

What's Changed

Contributors

Uh oh!

v0.12.0rc1

vLLM-Omni v0.12.0rc1 Pre-Release Notes Highlights

Highlights

Breaking / Notable Changes

Diffusion Engine: Architecture + Performance Upgrades

Serving: Stronger OpenAI Compatibility & Online Readiness

Omni Pipeline: Multi-stage Orchestration & Observability

Expanded Model Support (Selected)

Platform & CI Coverage

What's Changed

Contributors

Uh oh!

0.11.0rc1

Initial (Pre)-release of the vLLM-Omni Project

What's Changed

New Contributors

Contributors

Uh oh!