pip3 install torch==2.9.1 transformers accelerate torchao==0.14.1 bitsandbytes torchvision
pip3 install opencv-python-headless einops imageio-ffmpeg ftfy
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latestpython3 -m cache_dit.generate list # list all available examples
[generate.py:47] Available examples:
[generate.py:53] - ✅ flux_nunchaku - Defalut: nunchaku-tech/nunchaku-flux.1-dev
[generate.py:53] - ✅ flux - Defalut: black-forest-labs/FLUX.1-dev
[generate.py:53] - ✅ flux_fill - Defalut: black-forest-labs/FLUX.1-Fill-dev
[generate.py:53] - ✅ flux2 - Defalut: black-forest-labs/FLUX.2-dev
[generate.py:53] - ✅ flux2_klein_base_9b - Defalut: black-forest-labs/FLUX.2-klein-base-9B
[generate.py:53] - ✅ flux2_klein_base_4b - Defalut: black-forest-labs/FLUX.2-klein-base-4B
[generate.py:53] - ✅ flux2_klein_9b - Defalut: black-forest-labs/FLUX.2-klein-9B
[generate.py:53] - ✅ flux2_klein_4b - Defalut: black-forest-labs/FLUX.2-klein-4B
[generate.py:53] - ✅ qwen_image_lightning - Defalut: lightx2v/Qwen-Image-Lightning
[generate.py:53] - ✅ qwen_image_2512 - Defalut: Qwen/Qwen-Image-2512
[generate.py:53] - ✅ qwen_image - Defalut: Qwen/Qwen-Image
[generate.py:53] - ✅ qwen_image_edit_2511_lightning - Defalut: lightx2v/Qwen-Image-Edit-2511-Lightning
[generate.py:53] - ✅ qwen_image_edit_2511 - Defalut: Qwen/Qwen-Image-Edit-2511
[generate.py:53] - ✅ qwen_image_edit_lightning - Defalut: lightx2v/Qwen-Image-Lightning
[generate.py:53] - ✅ qwen_image_edit - Defalut: Qwen/Qwen-Image-Edit-2509
[generate.py:53] - ✅ qwen_image_controlnet - Defalut: InstantX/Qwen-Image-ControlNet-Inpainting
[generate.py:53] - ✅ qwen_image_layered - Defalut: Qwen/Qwen-Image-Layered
[generate.py:53] - ✅ ltx2_t2v - Defalut: Lightricks/LTX-2
[generate.py:53] - ✅ ltx2_i2v - Defalut: Lightricks/LTX-2
[generate.py:53] - ✅ skyreels_v2 - Defalut: Skywork/SkyReels-V2-T2V-14B-720P-Diffusers
[generate.py:53] - ✅ wan2.2_t2v - Defalut: Wan-AI/Wan2.2-T2V-A14B-Diffusers
[generate.py:53] - ✅ wan2.1_t2v - Defalut: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
[generate.py:53] - ✅ wan2.2_i2v - Defalut: Wan-AI/Wan2.2-I2V-A14B-Diffusers
[generate.py:53] - ✅ wan2.1_i2v - Defalut: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers
[generate.py:53] - ✅ wan2.2_vace - Defalut: linoyts/Wan2.2-VACE-Fun-14B-diffusers
[generate.py:53] - ✅ wan2.1_vace - Defalut: Wan-AI/Wan2.1-VACE-1.3B-diffusers
[generate.py:53] - ✅ ovis_image - Defalut: AIDC-AI/Ovis-Image-7B
[generate.py:53] - ✅ zimage_nunchaku - Defalut: nunchaku/nunchaku-z-image-turbo
[generate.py:53] - ✅ zimage - Defalut: Tongyi-MAI/Z-Image-Turbo
[generate.py:53] - ✅ zimage_controlnet_2.0 - Defalut: alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0
[generate.py:53] - ✅ zimage_controlnet_2.1 - Defalut: alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.1
[generate.py:53] - ✅ longcat_image - Defalut: meituan-longcat/LongCat-Image
[generate.py:53] - ✅ longcat_image_edit - Defalut: meituan-longcat/LongCat-Image-EditThe easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single GPU inference. For examples:
# baseline
# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux
python3 -m cache_dit.generate flux_nunchaku # need nunchaku library
python3 -m cache_dit.generate flux2
python3 -m cache_dit.generate ovis_image
python3 -m cache_dit.generate qwen_image_edit_lightning
python3 -m cache_dit.generate qwen_image
python3 -m cache_dit.generate ltx2_t2v --cache --cpu-offload
python3 -m cache_dit.generate ltx2_i2v --cache --cpu-offload
python3 -m cache_dit.generate skyreels_v2
python3 -m cache_dit.generate wan2.2
python3 -m cache_dit.generate zimage
python3 -m cache_dit.generate zimage_nunchaku
python3 -m cache_dit.generate zimage_controlnet_2.1
python3 -m cache_dit.generate generate longcat_image
python3 -m cache_dit.generate generate longcat_image_edit
# w/ cache acceleration
python3 -m cache_dit.generate flux --cache
python3 -m cache_dit.generate flux --cache --taylorseer
python3 -m cache_dit.generate flux_nunchaku --cache
python3 -m cache_dit.generate qwen_image --cache
python3 -m cache_dit.generate zimage --cache --rdt 0.6 --scm fast
python3 -m cache_dit.generate zimage_controlnet_2.1 --cache --rdt 0.6 --scm fast
# enable cpu offload or vae tiling if your encounter an OOM error
python3 -m cache_dit.generate qwen_image --cache --cpu-offload
python3 -m cache_dit.generate qwen_image --cache --cpu-offload --vae-tiling
python3 -m cache_dit.generate qwen_image_edit_lightning --cpu-offload --steps 4
python3 -m cache_dit.generate qwen_image_edit_lightning --cpu-offload --steps 8
# or, enable sequential cpu offload for extremly low VRAM device
python3 -m cache_dit.generate flux2 --sequential-cpu-offload # FLUX2 56B total
# use `--summary` option to show the cache acceleration stats
python3 -m cache_dit.generate zimage --cache --rdt 0.6 --scm fast --summaryThe default model path are the official model names on HuggingFace Hub. Users can set custom local model path by settig --model-path. For examples:
python3 -m cache_dit.generate flux --model-path /PATH/TO/FLUX.1-dev
python3 -m cache_dit.generate zimage --model-path /PATH/TO/Z-Image-Turbo
python3 -m cache_dit.generate qwem_image --model-path /PATH/TO/Qwen-Imagecache-dit is designed to work seamlessly with CPU or Sequential Offloading, 🔥Context Parallelism, 🔥Tensor Parallelism. For examples:
# context parallelism or tensor parallelism
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ring
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel usp # USP: Ulysses + Ring
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel tp
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses
torchrun --nproc_per_node=4 -m cache_dit.generate zimage_controlnet_2.1 --parallel ulysses
# ulysses anything attention
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --ulysses-anything
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image_edit_lightning --parallel ulysses --ulysses-anything
# text encoder parallelism, enable it by add: `--parallel-text-encoder`
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel tp --parallel-text-encoder
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image_edit_lightning --parallel ulysses --ulysses-anything --parallel-text-encoder
# Hint: set `--local-ranks-filter=0` to torchrun -> only show logs on rank 0
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate flux --parallel ulysses
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_t2v --parallel ulysses --parallel-vae --parallel-text-encoder --cache --ulysses-anything
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_t2v --parallel tp --parallel-vae --parallel-text-encoder --cache
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_i2v --parallel ulysses --parallel-vae --parallel-text-encoder --cache --ulysses-anything
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_i2v --parallel tp --parallel-vae --parallel-text-encoder --cachecache-dit is designed to work seamlessly with torch.compile, Quantization (🔥torchao, 🔥nunchaku), For examples:
# please also enable torch.compile if the quantation is using.
python3 -m cache_dit.generate flux --cache --quantize-type float8 --compile
python3 -m cache_dit.generate flux --cache --quantize-type int8 --compile
python3 -m cache_dit.generate flux --cache --quantize-type float8_weight_only --compile
python3 -m cache_dit.generate flux --cache --quantize-type int8_weight_only --compile
python3 -m cache_dit.generate flux --cache --quantize-type bnb_4bit --compile # w4a16
python3 -m cache_dit.generate flux_nunchaku --cache --compile # w4a4 SVDQHere are some examples for hybrid cache acceleration + parallelism for popular DiTs with cache-dit.
# DBCache + SCM + Taylorseer
python3 -m cache_dit.generate flux --cache --scm fast --taylorsees --taylorseer-order 1
# DBCache + SCM + Taylorseer + Context Parallelism + Text Encoder Parallelism + Compile
# + FP8 quantization + FP8 All2All comm + CUDNN Attention (--attn _sdpa_cudnn)
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --ulysses-float8 \
--attn _sdpa_cudnn --parallel-text-encoder --cache --scm fast --taylorseer \
--taylorseer-order 1 --quantize-type float8 --warmup 2 --repeat 5 --compile
# DBCache + SCM + Taylorseer + Context Parallelism + Text Encoder Parallelism + Compile
# + FP8 quantization + FP8 All2All comm + FP8 SageAttention (--attn sage)
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --ulysses-float8 \
--attn sage --parallel-text-encoder --cache --scm fast --taylorseer \
--taylorseer-order 1 --quantize-type float8 --warmup 2 --repeat 5 --compile
# Case: Hybrid Acceleration for Qwen-Image-Edit-Lightning, tracking memory usage.
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate qwen_image_edit_lightning \
--parallel ulysses --ulysses-anything --parallel-text-encoder \
--quantize-type float8_weight_only --steps 4 --track-memory --compile
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate qwen_image_edit_lightning \
--parallel tp --parallel-text-encoder --quantize-type float8_weight_only \
--steps 4 --track-memory --compile
# Case: Hybrid Acceleration + Context Parallelism + ControlNet Parallelism, e.g, Z-Image-ControlNet
torchrun --nproc_per_node=4 -m cache_dit.generate zimage_controlnet_2.1 --parallel ulysses \
--parallel-controlnet --cache --rdt 0.6 --scm fast
torchrun --nproc_per_node=4 -m cache_dit.generate zimage_controlnet_2.1 --parallel ulysses \
--parallel-controlnet --cache --scm fast --rdt 0.6 --compile \
--compile-controlnet --ulysses-float8 --attn _sdpa_cudnn \
--warmup 2 --repeat 4 # NO Cache Acceleration: 8.27s
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate flux --parallel ulysses
INFO 12-17 09:02:31 [base.py:151] Example Input Summary:
INFO 12-17 09:02:31 [base.py:151] - prompt: A cat holding a sign that says hello world
INFO 12-17 09:02:31 [base.py:151] - height: 1024
INFO 12-17 09:02:31 [base.py:151] - width: 1024
INFO 12-17 09:02:31 [base.py:151] - num_inference_steps: 28
INFO 12-17 09:02:31 [base.py:214] Example Output Summary:
INFO 12-17 09:02:31 [base.py:225] - Model: flux
INFO 12-17 09:02:31 [base.py:225] - Optimization: C0_Q0_NONE_Ulysses4
INFO 12-17 09:02:31 [base.py:225] - Load Time: 0.79s
INFO 12-17 09:02:31 [base.py:225] - Warmup Time: 21.09s
INFO 12-17 09:02:31 [base.py:225] - Inference Time: 8.27s
INFO 12-17 09:02:32 [base.py:182] Image saved to flux.1024x1024.C0_Q0_NONE_Ulysses4.png
# Enabled Cache Acceleration: 4.23s
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate flux --parallel ulysses --cache --scm fast
INFO 12-17 09:10:09 [base.py:151] Example Input Summary:
INFO 12-17 09:10:09 [base.py:151] - prompt: A cat holding a sign that says hello world
INFO 12-17 09:10:09 [base.py:151] - height: 1024
INFO 12-17 09:10:09 [base.py:151] - width: 1024
INFO 12-17 09:10:09 [base.py:151] - num_inference_steps: 28
INFO 12-17 09:10:09 [base.py:214] Example Output Summary:
INFO 12-17 09:10:09 [base.py:225] - Model: flux
INFO 12-17 09:10:09 [base.py:225] - Optimization: C0_Q0_DBCache_F1B0_W8I1M0MC3_R0.24_CFG0_T0O0_Ulysses4_S15
INFO 12-17 09:10:09 [base.py:225] - Load Time: 0.78s
INFO 12-17 09:10:09 [base.py:225] - Warmup Time: 18.49s
INFO 12-17 09:10:09 [base.py:225] - Inference Time: 4.23s
INFO 12-17 09:10:09 [base.py:182] Image saved to flux.1024x1024.C0_Q0_DBCache_F1B0_W8I1M0MC3_R0.24_CFG0_T0O0_Ulysses4_S15.png| NO Cache Acceleration: 8.27s | w/ Cache Acceleration: 4.23s |
|---|---|
![]() |
![]() |
It is very easy to add a new example. Please refer to the specific implementation in examples.py. For example:
@ExampleRegister.register("flux")
def flux_example(args: argparse.Namespace, **kwargs) -> Example:
from diffusers import FluxPipeline
return Example(
args=args,
init_config=ExampleInitConfig(
task_type=ExampleType.T2I, # Text to Image
model_name_or_path=_path("black-forest-labs/FLUX.1-dev"),
pipeline_class=FluxPipeline,
# `text_encoder_2` will be quantized when `--quantize-type`
# is set to `bnb_4bit`.
bnb_4bit_components=["text_encoder_2"],
),
input_data=ExampleInputData(
prompt="A cat holding a sign that says hello world",
height=1024,
width=1024,
num_inference_steps=28,
),
)
# NOTE: DON'T forget to add `flux_example` into helpers.pypython3 -m cache_dit.generate --help
positional arguments:
{generate,list,flux_nunchaku,flux,flux_fill,flux2,flux2_klein_base_9b,flux2_klein_base_4b,flux2_klein_9b,flux2_klein_4b,qwen_image_lightning,qwen_image_2512,qwen_image,qwen_image_edit_2511_lightning,qwen_image_edit_2511,qwen_image_edit_lightning,qwen_image_edit,qwen_image_controlnet,qwen_image_layered,skyreels_v2,ltx2_t2v,ltx2_i2v,wan2.2_t2v,wan2.1_t2v,wan2.2_i2v,wan2.1_i2v,wan2.2_vace,wan2.1_vace,ovis_image,zimage_nunchaku,zimage,zimage_controlnet_2.0,zimage_controlnet_2.1,longcat_image,longcat_image_edit}
The task to perform or example name to run. Use 'list' to list all available examples, or specify an example name directly (defaults to 'generate' task).
{None,flux_nunchaku,flux,flux_fill,flux2,flux2_klein_base_9b,flux2_klein_base_4b,flux2_klein_9b,flux2_klein_4b,qwen_image_lightning,qwen_image_2512,qwen_image,qwen_image_edit_2511_lightning,qwen_image_edit_2511,qwen_image_edit_lightning,qwen_image_edit,qwen_image_controlnet,qwen_image_layered,skyreels_v2,ltx2_t2v,ltx2_i2v,wan2.2_t2v,wan2.1_t2v,wan2.2_i2v,wan2.1_i2v,wan2.2_vace,wan2.1_vace,ovis_image,zimage_nunchaku,zimage,zimage_controlnet_2.0,zimage_controlnet_2.1,longcat_image,longcat_image_edit}
Names of the examples to run. If not specified, skip running example.
options:
-h, --help show this help message and exit
--model-path MODEL_PATH
Override model path if provided
--controlnet-path CONTROLNET_PATH
Override controlnet model path if provided
--lora-path LORA_PATH
Override lora model path if provided
--transformer-path TRANSFORMER_PATH
Override transformer model path if provided
--image-path IMAGE_PATH
Override image path if provided
--mask-image-path MASK_IMAGE_PATH
Override mask image path if provided
--config-path CONFIG_PATH, --config CONFIG_PATH
Path to CacheDiT configuration YAML file
--prompt PROMPT Override default prompt if provided
--negative-prompt NEGATIVE_PROMPT
Override default negative prompt if provided
--num_inference_steps NUM_INFERENCE_STEPS, --steps NUM_INFERENCE_STEPS
Number of inference steps
--warmup WARMUP Number of warmup steps before measuring performance
--warmup-num-inference-steps WARMUP_NUM_INFERENCE_STEPS, --warmup-steps WARMUP_NUM_INFERENCE_STEPS
Number of warmup inference steps per warmup before measuring performance
--repeat REPEAT Number of times to repeat the inference for performance measurement
--height HEIGHT Height of the generated image
--width WIDTH Width of the generated image
--seed SEED Random seed for reproducibility
--num-frames NUM_FRAMES, --frames NUM_FRAMES
Number of frames to generate for video
--save-path SAVE_PATH
Path to save the generated output, e.g., output.png or output.mp4
--cache Enable Cache Acceleration
--cache-summary, --summary
Enable Cache Summary logging
--Fn-compute-blocks FN_COMPUTE_BLOCKS, --Fn FN_COMPUTE_BLOCKS
CacheDiT Fn_compute_blocks parameter
--Bn-compute-blocks BN_COMPUTE_BLOCKS, --Bn BN_COMPUTE_BLOCKS
CacheDiT Bn_compute_blocks parameter
--residual-diff-threshold RESIDUAL_DIFF_THRESHOLD, --rdt RESIDUAL_DIFF_THRESHOLD
CacheDiT residual diff threshold
--max-warmup-steps MAX_WARMUP_STEPS, --ws MAX_WARMUP_STEPS
Maximum warmup steps for CacheDiT
--warmup-interval WARMUP_INTERVAL, --wi WARMUP_INTERVAL
Warmup interval for CacheDiT
--max-cached-steps MAX_CACHED_STEPS, --mc MAX_CACHED_STEPS
Maximum cached steps for CacheDiT
--max-continuous-cached-steps MAX_CONTINUOUS_CACHED_STEPS, --mcc MAX_CONTINUOUS_CACHED_STEPS
Maximum continuous cached steps for CacheDiT
--taylorseer Enable TaylorSeer for CacheDiT
--taylorseer-order TAYLORSEER_ORDER, -order TAYLORSEER_ORDER
TaylorSeer order
--steps-mask Enable steps mask for CacheDiT
--mask-policy {None,slow,s,medium,m,fast,f,ultra,u}, --scm {None,slow,s,medium,m,fast,f,ultra,u}
Pre-defined steps computation mask policy
--quantize, --q Enable quantization for transformer
--quantize-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}, --q-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}
--quantize-text-encoder, --q-text
Enable quantization for text encoder
--quantize-text-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}, --q-text-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}
--quantize-controlnet, --q-controlnet
Enable quantization for text encoder
--quantize-controlnet-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}, --q-controlnet-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}
--parallel-type {None,tp,ulysses,ring}, --parallel {None,tp,ulysses,ring}
--parallel-vae Enable VAE parallelism if applicable.
--parallel-text-encoder, --parallel-text
Enable text encoder parallelism if applicable.
--parallel-controlnet
Enable ControlNet parallelism if applicable.
--attn {None,flash,_flash_3,native,_native_cudnn,_sdpa_cudnn,sage,_native_npu}
--ulysses-anything, --uaa
Enable Ulysses Anything Attention for context parallelism
--ulysses-float8, --ufp8
Enable Ulysses Attention/UAA Float8 for context parallelism
--ulysses-async, --uaqkv
Enabled experimental Async QKV Projection with Ulysses for context parallelism
--cpu-offload, --cpu-offload-model
Enable CPU offload for model if applicable.
--sequential-cpu-offload
Enable sequential GPU offload for model if applicable.
--device-map-balance, --device-map
Enable automatic device map balancing model if multiple GPUs are available.
--vae-tiling Enable VAE tiling for low memory device.
--vae-slicing Enable VAE slicing for low memory device.
--compile Enable compile for transformer
--compile-repeated-blocks
Enable compile for repeated blocks in transformer
--compile-vae Enable compile for VAE
--compile-text-encoder, --compile-text
Enable compile for text encoder
--compile-controlnet Enable compile for ControlNet
--max-autotune Enable max-autotune mode for torch.compile
--track-memory Track and report peak GPU memory usage
--profile Enable profiling with torch.profiler
--profile-name PROFILE_NAME
Name for the profiling session
--profile-dir PROFILE_DIR
Directory to save profiling results
--profile-activities {CPU,GPU,MEM} [{CPU,GPU,MEM} ...]
Activities to profile (CPU, GPU, MEM)
--profile-with-stack profile with stack for better traceability
--profile-record-shapes
profile record shapes for better analysis
--disable-fuse-lora DISABLE_FUSE_LORA
Disable fuse_lora even if lora weights are provided.
--generator-device GENERATOR_DEVICE, --gen-device GENERATOR_DEVICE
Device for torch.Generator, e.g., 'cuda' or 'cpu'. If not set, use 'cpu' for better reproducibility across different hardware.








