Examples for Cache-DiT

Z-Image-ControlNet	Context Parallel: Ulysses 2	Context Parallel: Ulysses 4	+ ControlNet Parallel
Base L20x1: 22s	15.7s	12.7s	🚀7.71s

+ Hybrid Cache	+ Torch Compile	+ Async Ulyess CP	+ FP8 All2All + CUDNN ATTN
🚀6.85s	6.45s	6.38s	🚀6.19s, 5.47s

Installation

pip3 install torch==2.9.1 transformers accelerate torchao==0.14.1 bitsandbytes torchvision 
pip3 install opencv-python-headless einops imageio-ffmpeg ftfy 
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latest

Available Examples

python3 -m cache_dit.generate list  # list all available examples

[generate.py:47] Available examples:
[generate.py:53] - ✅ flux_nunchaku                  - Defalut: nunchaku-tech/nunchaku-flux.1-dev
[generate.py:53] - ✅ flux                           - Defalut: black-forest-labs/FLUX.1-dev
[generate.py:53] - ✅ flux_fill                      - Defalut: black-forest-labs/FLUX.1-Fill-dev
[generate.py:53] - ✅ flux2                          - Defalut: black-forest-labs/FLUX.2-dev
[generate.py:53] - ✅ flux2_klein_base_9b            - Defalut: black-forest-labs/FLUX.2-klein-base-9B
[generate.py:53] - ✅ flux2_klein_base_4b            - Defalut: black-forest-labs/FLUX.2-klein-base-4B
[generate.py:53] - ✅ flux2_klein_9b                 - Defalut: black-forest-labs/FLUX.2-klein-9B
[generate.py:53] - ✅ flux2_klein_4b                 - Defalut: black-forest-labs/FLUX.2-klein-4B
[generate.py:53] - ✅ qwen_image_lightning           - Defalut: lightx2v/Qwen-Image-Lightning
[generate.py:53] - ✅ qwen_image_2512                - Defalut: Qwen/Qwen-Image-2512
[generate.py:53] - ✅ qwen_image                     - Defalut: Qwen/Qwen-Image
[generate.py:53] - ✅ qwen_image_edit_2511_lightning - Defalut: lightx2v/Qwen-Image-Edit-2511-Lightning
[generate.py:53] - ✅ qwen_image_edit_2511           - Defalut: Qwen/Qwen-Image-Edit-2511
[generate.py:53] - ✅ qwen_image_edit_lightning      - Defalut: lightx2v/Qwen-Image-Lightning
[generate.py:53] - ✅ qwen_image_edit                - Defalut: Qwen/Qwen-Image-Edit-2509
[generate.py:53] - ✅ qwen_image_controlnet          - Defalut: InstantX/Qwen-Image-ControlNet-Inpainting
[generate.py:53] - ✅ qwen_image_layered             - Defalut: Qwen/Qwen-Image-Layered
[generate.py:53] - ✅ ltx2_t2v                       - Defalut: Lightricks/LTX-2
[generate.py:53] - ✅ ltx2_i2v                       - Defalut: Lightricks/LTX-2
[generate.py:53] - ✅ skyreels_v2                    - Defalut: Skywork/SkyReels-V2-T2V-14B-720P-Diffusers
[generate.py:53] - ✅ wan2.2_t2v                     - Defalut: Wan-AI/Wan2.2-T2V-A14B-Diffusers
[generate.py:53] - ✅ wan2.1_t2v                     - Defalut: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
[generate.py:53] - ✅ wan2.2_i2v                     - Defalut: Wan-AI/Wan2.2-I2V-A14B-Diffusers
[generate.py:53] - ✅ wan2.1_i2v                     - Defalut: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers
[generate.py:53] - ✅ wan2.2_vace                    - Defalut: linoyts/Wan2.2-VACE-Fun-14B-diffusers
[generate.py:53] - ✅ wan2.1_vace                    - Defalut: Wan-AI/Wan2.1-VACE-1.3B-diffusers
[generate.py:53] - ✅ ovis_image                     - Defalut: AIDC-AI/Ovis-Image-7B
[generate.py:53] - ✅ zimage_nunchaku                - Defalut: nunchaku/nunchaku-z-image-turbo
[generate.py:53] - ✅ zimage                         - Defalut: Tongyi-MAI/Z-Image-Turbo
[generate.py:53] - ✅ zimage_controlnet_2.0          - Defalut: alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.0
[generate.py:53] - ✅ zimage_controlnet_2.1          - Defalut: alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.1
[generate.py:53] - ✅ longcat_image                  - Defalut: meituan-longcat/LongCat-Image
[generate.py:53] - ✅ longcat_image_edit             - Defalut: meituan-longcat/LongCat-Image-Edit

Single GPU Inference

The easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single GPU inference. For examples:

# baseline
# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux 
python3 -m cache_dit.generate flux_nunchaku # need nunchaku library
python3 -m cache_dit.generate flux2
python3 -m cache_dit.generate ovis_image
python3 -m cache_dit.generate qwen_image_edit_lightning
python3 -m cache_dit.generate qwen_image
python3 -m cache_dit.generate ltx2_t2v --cache --cpu-offload
python3 -m cache_dit.generate ltx2_i2v --cache --cpu-offload
python3 -m cache_dit.generate skyreels_v2
python3 -m cache_dit.generate wan2.2
python3 -m cache_dit.generate zimage 
python3 -m cache_dit.generate zimage_nunchaku 
python3 -m cache_dit.generate zimage_controlnet_2.1 
python3 -m cache_dit.generate generate longcat_image
python3 -m cache_dit.generate generate longcat_image_edit
# w/ cache acceleration
python3 -m cache_dit.generate flux --cache
python3 -m cache_dit.generate flux --cache --taylorseer
python3 -m cache_dit.generate flux_nunchaku --cache
python3 -m cache_dit.generate qwen_image --cache
python3 -m cache_dit.generate zimage --cache --rdt 0.6 --scm fast
python3 -m cache_dit.generate zimage_controlnet_2.1 --cache --rdt 0.6 --scm fast
# enable cpu offload or vae tiling if your encounter an OOM error
python3 -m cache_dit.generate qwen_image --cache --cpu-offload
python3 -m cache_dit.generate qwen_image --cache --cpu-offload --vae-tiling
python3 -m cache_dit.generate qwen_image_edit_lightning --cpu-offload --steps 4
python3 -m cache_dit.generate qwen_image_edit_lightning --cpu-offload --steps 8
# or, enable sequential cpu offload for extremly low VRAM device
python3 -m cache_dit.generate flux2 --sequential-cpu-offload # FLUX2 56B total
# use `--summary` option to show the cache acceleration stats
python3 -m cache_dit.generate zimage --cache --rdt 0.6 --scm fast --summary

Custom Model Path

The default model path are the official model names on HuggingFace Hub. Users can set custom local model path by settig --model-path. For examples:

python3 -m cache_dit.generate flux --model-path /PATH/TO/FLUX.1-dev
python3 -m cache_dit.generate zimage --model-path /PATH/TO/Z-Image-Turbo
python3 -m cache_dit.generate qwem_image --model-path /PATH/TO/Qwen-Image

Multi-GPU Inference

cache-dit is designed to work seamlessly with CPU or Sequential Offloading, 🔥Context Parallelism, 🔥Tensor Parallelism. For examples:

# context parallelism or tensor parallelism
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses 
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ring 
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel usp # USP: Ulysses + Ring
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel tp
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses 
torchrun --nproc_per_node=4 -m cache_dit.generate zimage_controlnet_2.1 --parallel ulysses 
# ulysses anything attention
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --ulysses-anything
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image_edit_lightning --parallel ulysses --ulysses-anything
# text encoder parallelism, enable it by add: `--parallel-text-encoder`
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel tp --parallel-text-encoder
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image_edit_lightning --parallel ulysses --ulysses-anything --parallel-text-encoder
# Hint: set `--local-ranks-filter=0` to torchrun -> only show logs on rank 0
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate flux --parallel ulysses 
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_t2v --parallel ulysses --parallel-vae --parallel-text-encoder --cache --ulysses-anything
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_t2v --parallel tp --parallel-vae --parallel-text-encoder --cache
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_i2v --parallel ulysses --parallel-vae --parallel-text-encoder --cache --ulysses-anything
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate ltx2_i2v --parallel tp --parallel-vae --parallel-text-encoder --cache

Low-bits Quantization

cache-dit is designed to work seamlessly with torch.compile, Quantization (🔥torchao, 🔥nunchaku), For examples:

# please also enable torch.compile if the quantation is using.
python3 -m cache_dit.generate flux --cache --quantize-type float8 --compile
python3 -m cache_dit.generate flux --cache --quantize-type int8 --compile
python3 -m cache_dit.generate flux --cache --quantize-type float8_weight_only --compile
python3 -m cache_dit.generate flux --cache --quantize-type int8_weight_only --compile
python3 -m cache_dit.generate flux --cache --quantize-type bnb_4bit --compile # w4a16
python3 -m cache_dit.generate flux_nunchaku --cache --compile # w4a4 SVDQ

Hybrid Acceleration

Here are some examples for hybrid cache acceleration + parallelism for popular DiTs with cache-dit.

# DBCache + SCM + Taylorseer
python3 -m cache_dit.generate flux --cache --scm fast --taylorsees --taylorseer-order 1
# DBCache + SCM + Taylorseer + Context Parallelism + Text Encoder Parallelism + Compile 
# + FP8 quantization + FP8 All2All comm + CUDNN Attention (--attn _sdpa_cudnn)
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --ulysses-float8 \
         --attn _sdpa_cudnn --parallel-text-encoder --cache --scm fast --taylorseer \
         --taylorseer-order 1 --quantize-type float8 --warmup 2 --repeat 5 --compile 
# DBCache + SCM + Taylorseer + Context Parallelism + Text Encoder Parallelism + Compile 
# + FP8 quantization + FP8 All2All comm + FP8 SageAttention (--attn sage)
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --ulysses-float8 \
         --attn sage --parallel-text-encoder --cache --scm fast --taylorseer \
         --taylorseer-order 1 --quantize-type float8 --warmup 2 --repeat 5 --compile 
# Case: Hybrid Acceleration for Qwen-Image-Edit-Lightning, tracking memory usage.
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate qwen_image_edit_lightning \
         --parallel ulysses --ulysses-anything --parallel-text-encoder \
         --quantize-type float8_weight_only --steps 4 --track-memory --compile
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate qwen_image_edit_lightning \
         --parallel tp --parallel-text-encoder --quantize-type float8_weight_only \
         --steps 4 --track-memory --compile
# Case: Hybrid Acceleration + Context Parallelism + ControlNet Parallelism, e.g, Z-Image-ControlNet
torchrun --nproc_per_node=4 -m cache_dit.generate zimage_controlnet_2.1 --parallel ulysses \
         --parallel-controlnet --cache --rdt 0.6 --scm fast
torchrun --nproc_per_node=4 -m cache_dit.generate zimage_controlnet_2.1 --parallel ulysses \
         --parallel-controlnet --cache --scm fast --rdt 0.6 --compile \
         --compile-controlnet --ulysses-float8 --attn _sdpa_cudnn \
         --warmup 2 --repeat 4

End2End Examples

# NO Cache Acceleration: 8.27s
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate flux --parallel ulysses

INFO 12-17 09:02:31 [base.py:151] Example Input Summary:
INFO 12-17 09:02:31 [base.py:151] - prompt: A cat holding a sign that says hello world
INFO 12-17 09:02:31 [base.py:151] - height: 1024
INFO 12-17 09:02:31 [base.py:151] - width: 1024
INFO 12-17 09:02:31 [base.py:151] - num_inference_steps: 28
INFO 12-17 09:02:31 [base.py:214] Example Output Summary:
INFO 12-17 09:02:31 [base.py:225] - Model: flux
INFO 12-17 09:02:31 [base.py:225] - Optimization: C0_Q0_NONE_Ulysses4
INFO 12-17 09:02:31 [base.py:225] - Load Time: 0.79s
INFO 12-17 09:02:31 [base.py:225] - Warmup Time: 21.09s
INFO 12-17 09:02:31 [base.py:225] - Inference Time: 8.27s
INFO 12-17 09:02:32 [base.py:182] Image saved to flux.1024x1024.C0_Q0_NONE_Ulysses4.png

# Enabled Cache Acceleration: 4.23s
torchrun --nproc_per_node=4 --local-ranks-filter=0 -m cache_dit.generate flux --parallel ulysses --cache --scm fast

INFO 12-17 09:10:09 [base.py:151] Example Input Summary:
INFO 12-17 09:10:09 [base.py:151] - prompt: A cat holding a sign that says hello world
INFO 12-17 09:10:09 [base.py:151] - height: 1024
INFO 12-17 09:10:09 [base.py:151] - width: 1024
INFO 12-17 09:10:09 [base.py:151] - num_inference_steps: 28
INFO 12-17 09:10:09 [base.py:214] Example Output Summary:
INFO 12-17 09:10:09 [base.py:225] - Model: flux
INFO 12-17 09:10:09 [base.py:225] - Optimization: C0_Q0_DBCache_F1B0_W8I1M0MC3_R0.24_CFG0_T0O0_Ulysses4_S15
INFO 12-17 09:10:09 [base.py:225] - Load Time: 0.78s
INFO 12-17 09:10:09 [base.py:225] - Warmup Time: 18.49s
INFO 12-17 09:10:09 [base.py:225] - Inference Time: 4.23s
INFO 12-17 09:10:09 [base.py:182] Image saved to flux.1024x1024.C0_Q0_DBCache_F1B0_W8I1M0MC3_R0.24_CFG0_T0O0_Ulysses4_S15.png

NO Cache Acceleration: 8.27s	w/ Cache Acceleration: 4.23s

How to Add New Example

It is very easy to add a new example. Please refer to the specific implementation in examples.py. For example:

@ExampleRegister.register("flux")
def flux_example(args: argparse.Namespace, **kwargs) -> Example:
    from diffusers import FluxPipeline

    return Example(
        args=args,
        init_config=ExampleInitConfig(
            task_type=ExampleType.T2I,  # Text to Image
            model_name_or_path=_path("black-forest-labs/FLUX.1-dev"),
            pipeline_class=FluxPipeline,
            # `text_encoder_2` will be quantized when `--quantize-type` 
            # is set to `bnb_4bit`.
            bnb_4bit_components=["text_encoder_2"],
        ),
        input_data=ExampleInputData(
            prompt="A cat holding a sign that says hello world",
            height=1024,
            width=1024,
            num_inference_steps=28,
        ),
    )

# NOTE: DON'T forget to add `flux_example` into helpers.py

More Usages about Examples

python3 -m cache_dit.generate --help

positional arguments:
  {generate,list,flux_nunchaku,flux,flux_fill,flux2,flux2_klein_base_9b,flux2_klein_base_4b,flux2_klein_9b,flux2_klein_4b,qwen_image_lightning,qwen_image_2512,qwen_image,qwen_image_edit_2511_lightning,qwen_image_edit_2511,qwen_image_edit_lightning,qwen_image_edit,qwen_image_controlnet,qwen_image_layered,skyreels_v2,ltx2_t2v,ltx2_i2v,wan2.2_t2v,wan2.1_t2v,wan2.2_i2v,wan2.1_i2v,wan2.2_vace,wan2.1_vace,ovis_image,zimage_nunchaku,zimage,zimage_controlnet_2.0,zimage_controlnet_2.1,longcat_image,longcat_image_edit}
                        The task to perform or example name to run. Use 'list' to list all available examples, or specify an example name directly (defaults to 'generate' task).
  {None,flux_nunchaku,flux,flux_fill,flux2,flux2_klein_base_9b,flux2_klein_base_4b,flux2_klein_9b,flux2_klein_4b,qwen_image_lightning,qwen_image_2512,qwen_image,qwen_image_edit_2511_lightning,qwen_image_edit_2511,qwen_image_edit_lightning,qwen_image_edit,qwen_image_controlnet,qwen_image_layered,skyreels_v2,ltx2_t2v,ltx2_i2v,wan2.2_t2v,wan2.1_t2v,wan2.2_i2v,wan2.1_i2v,wan2.2_vace,wan2.1_vace,ovis_image,zimage_nunchaku,zimage,zimage_controlnet_2.0,zimage_controlnet_2.1,longcat_image,longcat_image_edit}
                        Names of the examples to run. If not specified, skip running example.

options:
  -h, --help            show this help message and exit
  --model-path MODEL_PATH
                        Override model path if provided
  --controlnet-path CONTROLNET_PATH
                        Override controlnet model path if provided
  --lora-path LORA_PATH
                        Override lora model path if provided
  --transformer-path TRANSFORMER_PATH
                        Override transformer model path if provided
  --image-path IMAGE_PATH
                        Override image path if provided
  --mask-image-path MASK_IMAGE_PATH
                        Override mask image path if provided
  --config-path CONFIG_PATH, --config CONFIG_PATH
                        Path to CacheDiT configuration YAML file
  --prompt PROMPT       Override default prompt if provided
  --negative-prompt NEGATIVE_PROMPT
                        Override default negative prompt if provided
  --num_inference_steps NUM_INFERENCE_STEPS, --steps NUM_INFERENCE_STEPS
                        Number of inference steps
  --warmup WARMUP       Number of warmup steps before measuring performance
  --warmup-num-inference-steps WARMUP_NUM_INFERENCE_STEPS, --warmup-steps WARMUP_NUM_INFERENCE_STEPS
                        Number of warmup inference steps per warmup before measuring performance
  --repeat REPEAT       Number of times to repeat the inference for performance measurement
  --height HEIGHT       Height of the generated image
  --width WIDTH         Width of the generated image
  --seed SEED           Random seed for reproducibility
  --num-frames NUM_FRAMES, --frames NUM_FRAMES
                        Number of frames to generate for video
  --save-path SAVE_PATH
                        Path to save the generated output, e.g., output.png or output.mp4
  --cache               Enable Cache Acceleration
  --cache-summary, --summary
                        Enable Cache Summary logging
  --Fn-compute-blocks FN_COMPUTE_BLOCKS, --Fn FN_COMPUTE_BLOCKS
                        CacheDiT Fn_compute_blocks parameter
  --Bn-compute-blocks BN_COMPUTE_BLOCKS, --Bn BN_COMPUTE_BLOCKS
                        CacheDiT Bn_compute_blocks parameter
  --residual-diff-threshold RESIDUAL_DIFF_THRESHOLD, --rdt RESIDUAL_DIFF_THRESHOLD
                        CacheDiT residual diff threshold
  --max-warmup-steps MAX_WARMUP_STEPS, --ws MAX_WARMUP_STEPS
                        Maximum warmup steps for CacheDiT
  --warmup-interval WARMUP_INTERVAL, --wi WARMUP_INTERVAL
                        Warmup interval for CacheDiT
  --max-cached-steps MAX_CACHED_STEPS, --mc MAX_CACHED_STEPS
                        Maximum cached steps for CacheDiT
  --max-continuous-cached-steps MAX_CONTINUOUS_CACHED_STEPS, --mcc MAX_CONTINUOUS_CACHED_STEPS
                        Maximum continuous cached steps for CacheDiT
  --taylorseer          Enable TaylorSeer for CacheDiT
  --taylorseer-order TAYLORSEER_ORDER, -order TAYLORSEER_ORDER
                        TaylorSeer order
  --steps-mask          Enable steps mask for CacheDiT
  --mask-policy {None,slow,s,medium,m,fast,f,ultra,u}, --scm {None,slow,s,medium,m,fast,f,ultra,u}
                        Pre-defined steps computation mask policy
  --quantize, --q       Enable quantization for transformer
  --quantize-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}, --q-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}
  --quantize-text-encoder, --q-text
                        Enable quantization for text encoder
  --quantize-text-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}, --q-text-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}
  --quantize-controlnet, --q-controlnet
                        Enable quantization for text encoder
  --quantize-controlnet-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}, --q-controlnet-type {None,float8,float8_weight_only,float8_wo,int8,int8_weight_only,int8_wo,int4,int4_weight_only,int4_wo,bitsandbytes_4bit,bnb_4bit}
  --parallel-type {None,tp,ulysses,ring}, --parallel {None,tp,ulysses,ring}
  --parallel-vae        Enable VAE parallelism if applicable.
  --parallel-text-encoder, --parallel-text
                        Enable text encoder parallelism if applicable.
  --parallel-controlnet
                        Enable ControlNet parallelism if applicable.
  --attn {None,flash,_flash_3,native,_native_cudnn,_sdpa_cudnn,sage,_native_npu}
  --ulysses-anything, --uaa
                        Enable Ulysses Anything Attention for context parallelism
  --ulysses-float8, --ufp8
                        Enable Ulysses Attention/UAA Float8 for context parallelism
  --ulysses-async, --uaqkv
                        Enabled experimental Async QKV Projection with Ulysses for context parallelism
  --cpu-offload, --cpu-offload-model
                        Enable CPU offload for model if applicable.
  --sequential-cpu-offload
                        Enable sequential GPU offload for model if applicable.
  --device-map-balance, --device-map
                        Enable automatic device map balancing model if multiple GPUs are available.
  --vae-tiling          Enable VAE tiling for low memory device.
  --vae-slicing         Enable VAE slicing for low memory device.
  --compile             Enable compile for transformer
  --compile-repeated-blocks
                        Enable compile for repeated blocks in transformer
  --compile-vae         Enable compile for VAE
  --compile-text-encoder, --compile-text
                        Enable compile for text encoder
  --compile-controlnet  Enable compile for ControlNet
  --max-autotune        Enable max-autotune mode for torch.compile
  --track-memory        Track and report peak GPU memory usage
  --profile             Enable profiling with torch.profiler
  --profile-name PROFILE_NAME
                        Name for the profiling session
  --profile-dir PROFILE_DIR
                        Directory to save profiling results
  --profile-activities {CPU,GPU,MEM} [{CPU,GPU,MEM} ...]
                        Activities to profile (CPU, GPU, MEM)
  --profile-with-stack  profile with stack for better traceability
  --profile-record-shapes
                        profile record shapes for better analysis
  --disable-fuse-lora DISABLE_FUSE_LORA
                        Disable fuse_lora even if lora weights are provided.
  --generator-device GENERATOR_DEVICE, --gen-device GENERATOR_DEVICE
                        Device for torch.Generator, e.g., 'cuda' or 'cpu'. If not set, use 'cpu' for better reproducibility across different hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples for Cache-DiT

Installation

Available Examples

Single GPU Inference

Custom Model Path

Multi-GPU Inference

Low-bits Quantization

Hybrid Acceleration

End2End Examples

How to Add New Example

More Usages about Examples

FilesExpand file tree

EXAMPLES.md

Latest commit

History

EXAMPLES.md

File metadata and controls

Examples for Cache-DiT

Installation

Available Examples

Single GPU Inference

Custom Model Path

Multi-GPU Inference

Low-bits Quantization

Hybrid Acceleration

End2End Examples

How to Add New Example

More Usages about Examples