Ascend NPU Support

🔥We are excited to announce that Cache-DiT now provides native support for Ascend NPU. Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies, including:

Hybrid Cache Acceleration (DBCache, DBPrune, TaylorSeer, SCM and more)
Context Parallelism (w/ Extended Diffusers' CP APIs, UAA, Async Ulysses, ...)
Tensor Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
Text Encoder Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
Auto Encoder (VAE) Parallelism (w/ Data or Tile Parallelism, avoid OOM)
ControlNet Parallelism (w/ Context Parallelism for ControlNet module)
Built-in HTTP serving deployment support with simple REST APIs

Please refer to Ascend NPU Supported Matrix for more details.

Features Support

Device	Hybrid Cache	Context Parallel	Tensor Parallel	Text Encoder Parallel	Auto Encoder(VAE) Parallel	Compilation
Atlas 800T A2	✅	✅	✅	✅	✅	🟡
Atlas 800I A2	✅	✅	✅	✅	✅	🟡

🟡: Experimental Feature

Attention backend

Cache-DiT supports multiple Attention backends for better performance. The supported attention backends for Ascend NPU list is as follows:

backend	details	parallelism	attn_mask
native	Native SDPA Attention in PyTorch	✅	✅
_native_npu	Optimized Ascend NPU Attention	✅	✅

We strongly recommend using the _native_npu backend to achieve better performance.

Environment Requirements

There are two installation methods:

Using pip: first prepare env manually or via CANN image, then install cache-dit using pip.
Using docker: use the Ascend NPU community: vllm-ascend pre-built docker image as the base image for cache-dit directly. (Recommended, no need for installing torch and torch_npu manually)

Install NPU SDKs Manually

This section describes how to install NPU environment manually.

Requirements

OS: Linux; Python: >= 3.10, < 3.12; A hardware with Ascend NPU. It's usually the Atlas 800 A2 series; Softwares:

Software	Supported version	Note
Ascend HDK	Refer to here	Required for CANN
CANN	== 8.3.RC2	Required for cache-dit and torch-npu
torch-npu	== 2.8.0	Required for cache-dit
torch	== 2.8.0	Required for torch-npu and cache-dit
NNAL	== 8.3.RC2	Required for libatb.so, enables advanced tensor operations

Configure CANN environment.

Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to Ascend Environment Setup Guide for more details. To verify that the Ascend NPU firmware and driver were correctly installed, run:

npu-smi info

Please refer to Ascend Environment Setup Guide for more details.

Configure software environment.

The easiest way to prepare your software environment is using CANN image directly. We recommend using the Ascend NPU community: vllm-ascend pre-built docker image as the base image of Ascend NPU for cache-dit. CANN image can be found in Ascend official community website: here. The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library) which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the pre-built image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
    --name cache-dit-ascend \
    --shm-size=1g \
    --device $DEVICE \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -it $IMAGE bash

Install PyTorch

If install failed by using pip command, you can get torch-2.8.0*.whl file by Link and install manually.

# torch: aarch64
pip3 install torch==2.8.0
# torch: x86
pip3 install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpu

Install torch_npu

Strongly recommend install torch_npu by acquire torch_npu-2.8.0*.whl file by Link and install manually. For more detail about Ascend Pytorch Adapter installation, please refer https://gitcode.com/Ascend/pytorch

Install Extra Dependences

pip install --no-deps torchvision==0.16.0 
pip install einops sentencepiece accelerate

Use prebuilt Docker Image

We recommend using the prebuilt image from the Ascend NPU community: vllm-ascend as the base image of Ascend NPU for cache-dit. You can just pull the prebuilt image from the image repository and run it with bash. For example:

# Download pre-built image for Ascend NPU
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1

# Use the pre-built image for cache-dit
docker run \
    --name cache-dit-ascend \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    --net=host \
    --shm-size=80g \
    --privileged=true \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /data:/data \
    -itd quay.io/ascend/vllm-ascend:v0.13.0rc1 bash

Ascend Environment variables

# Make sure CANN_path is set to your CANN installation path
# e.g., export CANN_path=/usr/local/Ascend
source $CANN_path/ascend-toolkit/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Set NPU devices by ASCEND_RT_VISIBLE_DEVICES env
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Once it is done, you can start to set up cache-dit.

Install Cache-DiT Library

You can install the stable release of cache-dit from PyPI:

pip3 install -U cache-dit

Or you can install the latest develop version from GitHub:

pip3 install git+https://github.com/vipshop/cache-dit.git

Please also install the latest main branch of diffusers for context parallelism:

pip3 install git+https://github.com/huggingface/diffusers.git # or >= 0.36.0

Exmaples and Benchmark

After the environment configuration is complete, users can refer to the Quick Examples, Ascend NPU Benchmark and Ascend NPU Supported Matrix for more details.

pip3 install opencv-python-headless einops imageio-ffmpeg ftfy 
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latest

Single NPU Inference

The easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single NPU inference. For examples:

# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux --attn _native_npu
python3 -m cache_dit.generate qwen_image --attn _native_npu
python3 -m cache_dit.generate flux --cache --attn _native_npu
python3 -m cache_dit.generate qwen_image --cache --attn _native_npu

Distributed Inference

cache-dit is designed to work 🔥Context Parallelism, 🔥Tensor Parallelism. For examples:

torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --cache --attn _native_npu

[Experimental] Speedup with MindIE-SD Compilation

MindIE-SD is an open-source acceleration framework for diffusion models on Ascend NPU. By providing a custom MindieSDBackend for torch.compile, it enables automatic operator fusion and optimization for enhanced performance on Ascend hardware. For detailed documentation and examples, visit MindIE-SD.

Install MindIE-SD

git clone --branch master --single-branch https://gitcode.com/Ascend/MindIE-SD.git
cd MindIE-SD
pip install wheel
python3 setup.py bdist_wheel
pip install ./dist/*.whl --force-reinstall

Enable MindIE-SD Compilation

python3 generate.py flux --attn _native_npu --compile

Performance

The performance of MindIE-SD Compilation is as follows (device 910B3):

Model	Batch Size	Resolution	Compile	E2E Time(s)
flux.1-dev	1	1024x1024	❌	14.09
flux.1-dev	1	1024x1024	✅	12.85

⚠️ Experimental Feature: MindIE-SD Compilation is currently in the experimental stage. For bug reports, feature requests, or detailed information, please visit the MindIE-SD Compilation Documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ascend NPU Support

Features Support

Attention backend

Environment Requirements

Install NPU SDKs Manually

Requirements

Configure CANN environment.

Configure software environment.

Install PyTorch

Install torch_npu

Install Extra Dependences

Use prebuilt Docker Image

Ascend Environment variables

Install Cache-DiT Library

Exmaples and Benchmark

Single NPU Inference

Distributed Inference

[Experimental] Speedup with MindIE-SD Compilation

Install MindIE-SD

Enable MindIE-SD Compilation

Performance

FilesExpand file tree

ASCEND_NPU.md

Latest commit

History

ASCEND_NPU.md

File metadata and controls

Ascend NPU Support

Features Support

Attention backend

Environment Requirements

Install NPU SDKs Manually

Requirements

Configure CANN environment.

Configure software environment.

Install PyTorch

Install torch_npu

Install Extra Dependences

Use prebuilt Docker Image

Ascend Environment variables

Install Cache-DiT Library

Exmaples and Benchmark

Single NPU Inference

Distributed Inference

[Experimental] Speedup with MindIE-SD Compilation

Install MindIE-SD

Enable MindIE-SD Compilation

Performance