-
Notifications
You must be signed in to change notification settings - Fork 395
Open
0 / 30 of 3 issues completedDescription
Motivation.
This RFC lists the status of existing diffusion models support features, as well as some other features in future plan (under discussion).
Some items are overlapped with the tasks in [RFC]: vLLM-Omni 2026 Q1 Roadmap , which are more urgent.
A continuous diffusion model acceleration support plan is #1217 . Help wanted! 🙋
Proposed Change.
P0: 🙋
-
Single-card acceleration feature
- graph compilation
- torch.compile on repeated_blocks : [core] add torch compile for diffusion #684
- advanced attention
- sage attention: [Diffusion][Attention] sage attention backend #243
- Flash Attention 2 and 3: [diffusion] use fa3 by default when device supports it #783 ; [Feature] Flash Attention to Support Attention Mask #760
- quantization:
- diffusion distillation
- cache acceleration
- Cache-DiT: [Diffusion] Add cache-dit and unify diffusion cache backend interface #250 , [Feat] Enable cache-dit for stable diffusion3.5 #584
- TeaCache Refactor: [RFC]: TeaCache Refactoring and Bagel Support #833 [Diffusion][Acceleration] Support TeaCache for Z-Image #817 [Bagel] Support TeaCache #848 [TeaCache]: Add Coefficient Estimation #940
- Support CPU Offloading
- Standard module-wise offload (text encoder/dit/vae) [feature] cpu offloading support for diffusion #497
- Layerwise offload (LayerwiseOffloadManager) @yuanheng-zhao [RFC]: Layerwise CPU Offloading Support #754 [Feature] Support DiT Layerwise (Blockwise) CPU Offloading #858
- Scheduling & Serving
- Implement a Batch Scheduler. [Frontend][Model] Support batch request with refined OmniDiffusionReq… #797 [RFC]: Support batch request of diffusion models #427 [RFC]: Data class design & refactor—batched multimodal diffusion request #701 @fhfuih @asukaqaq-s
- ❗important❗Support ComfyUI web serving integration. [ComfyUI]: ComfyUI integration #1113 [RFC]: ComfyUI Integration Design #900 @fhfuih
- Static and Dynamic LoRA support. Add diffusion LoRA request path and worker cache #657 [Feature] Diffusion LoRA Adapter Support (PEFT compatible) for vLLM alignment #758
- avoid cpu sync
- [Perf] avoid cpu op in QwenImageCrossAttention #942
- graph compilation
-
Multi-card acceleration feature
- CFG parallel:
- QwenImage: [Diffusion][Feature] CFG parallel support for Qwen-Image #444
- CFG Parallel refactor: allow minimal intrusive edits to exisiting single-card pipeline: [RFC]: CFG Parallelism Abstraction #850, [Perf]: CFG parallel abstraction #851
- Sequence Parallel :
- QwenImage series models support Ulysses SP and Ring Attention : [Diffusion]: Diffusion Ulysses-Sequence-Parallelism support #189 ; [Diffusion]: Diffusion Ring Attention support #273 ;
- LongCatImage model supports Ulysses SP and Ring Attention: [Diffusion][Feature] Implement SP support in LongCatImageTransformer #721;
- stable diffusion3.5 supports Ulysses SP and Ring Attention: [Feat] Enable sequence parallel for stable diffusion3.5 #654
- [Diffusion] Non-Intrusive Sequence Parallelism (SP) Model Support Abstraction for vLLM-Omni Framework #779
- Patch VAE Parallel:
- Patch VAE tiling in distributed ranks (lossy) : [Feat]: support VAE patch parallelism #756
- ❗important❗A unified interface to support Patch VAE Parallelism methods for multiple models
- Tensor Parallel:
- Z-Image [Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) #735
- Qwen-Image [diffusion] add tp support for qwen-image and refactor some tests #830
- HunyuanImage 3.0: [New model] Support HY-Image3.0 DiT #794
- LongCat-Image [Feature] add Tensor Parallelism to LongCat-Image(-Edit) #926
- Ovis-Image
- Stable-Diffusion-3 @ZANMANGLOOPYE
- Expert Parallelism (EP)
- @Semmer2 (waiting for the Hunyuan Image model to be merged first)
- Compile and Parallel [Bug]: The speed of compile mode does not perform well during parallel inference #819
- CFG parallel:
-
Model E2E performance acceleration:
- Z-Image family
- Qwen-Image family @GG-li
- Stable-Diffusion family @ZANMANGLOOPYE
- Flux family
- WAN family
-
Online serving:
- i2i /v1/images/edit [RFC]: OpenAI Image Edit API Interface for ComfyUI #510 [FEATURE] /v1/images/edit interface #1101
- ❗important❗t2v, i2v [Bug]: Diffusion chat completion failed: 'numpy.ndarray' object has no attribute 'save' #793 [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API #1073
P1: 🙋
-
Single-card acceleration
- graph compilation
- torch.compile key arguments optimization: e..g, fullgraph=True
- advanced attention
- torch_sdpa advanced attention backends, such as torch.backends.cuda.enable_mem_efficient_sdp().
- Video Sparse Attention: from FastVideo
- SageAttention with quantization support, e.g.,
sageattn_qk_int8_pv_fp16_cuda - SpargeAttn Sparse Attention Backend: [RFC]: Add SpargeAttn Sparse Attention Backend #765
- diffusion distillation
- cache acceleration
- TeaCache suppports LongCat-Image and LongCat-Image-Edit
- TeaCache suppports Z-Image: [Diffusion][Acceleration] Support TeaCache for Z-Image #817
- TeaCache suppports Stable-Diffusion3.5
- TeaCache suppports Wan2.2
- graph compilation
-
Multi-card acceleration
- CFG parallel:
- LongCatImage, Ovis-Image, Stable-Diffusion3.5, wan 2.2 [Perf]: CFG parallel abstraction #851
- Z-Image
- Sequence Parallel
- Ovis-Image
- Z-Image
- wan 2.2 [Diffusion][Feature] Non-Intrusive Sequence Parallelism (SP) Support for Wan2.2 #966
- Patch VAE Parallel
- wan 2.2
- Pipeline Parallelism
- PipeFusion RFC: [RFC]: PipeFusion Implementation in vLLM-Omni #647
- Data Parallelism
- CFG parallel:
Feedback Period.
No response
CC List.
@hsliuustc0106 @ZJY0516 @SamitHuang @david6666666 @mxuax @lishunyang12 @xiaolin8 @gcanlin @dongbo910220
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reactions are currently unavailable