🔥We are excited to announce that Cache-DiT now provides native support for Ascend NPU. Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies, including:
- Hybrid Cache Acceleration (DBCache, DBPrune, TaylorSeer, SCM and more)
- Context Parallelism (w/ Extended Diffusers' CP APIs, UAA, Async Ulysses, ...)
- Tensor Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
- Text Encoder Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
- Auto Encoder (VAE) Parallelism (w/ Data or Tile Parallelism, avoid OOM)
- ControlNet Parallelism (w/ Context Parallelism for ControlNet module)
- Built-in HTTP serving deployment support with simple REST APIs
Please refer to Ascend NPU Supported Matrix for more details.
| Device | Hybrid Cache | Context Parallel | Tensor Parallel | Text Encoder Parallel | Auto Encoder(VAE) Parallel | Compilation |
|---|---|---|---|---|---|---|
| Atlas 800T A2 | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 |
| Atlas 800I A2 | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 |
🟡: Experimental Feature
Cache-DiT supports multiple Attention backends for better performance. The supported attention backends for Ascend NPU list is as follows:
| backend | details | parallelism | attn_mask |
|---|---|---|---|
| native | Native SDPA Attention in PyTorch | ✅ | ✅ |
| _native_npu | Optimized Ascend NPU Attention | ✅ | ✅ |
We strongly recommend using the _native_npu backend to achieve better performance.
There are two installation methods:
- Using pip: first prepare env manually or via CANN image, then install
cache-ditusing pip. - Using docker: use the Ascend NPU community: vllm-ascend pre-built docker image as the base image for cache-dit directly. (Recommended, no need for installing torch and torch_npu manually)
This section describes how to install NPU environment manually.
OS: Linux; Python: >= 3.10, < 3.12; A hardware with Ascend NPU. It's usually the Atlas 800 A2 series; Softwares:
| Software | Supported version | Note |
|---|---|---|
| Ascend HDK | Refer to here | Required for CANN |
| CANN | == 8.3.RC2 | Required for cache-dit and torch-npu |
| torch-npu | == 2.8.0 | Required for cache-dit |
| torch | == 2.8.0 | Required for torch-npu and cache-dit |
| NNAL | == 8.3.RC2 | Required for libatb.so, enables advanced tensor operations |
Before installation, you need to make sure firmware/driver and CANN are installed correctly, refer to Ascend Environment Setup Guide for more details. To verify that the Ascend NPU firmware and driver were correctly installed, run:
npu-smi infoPlease refer to Ascend Environment Setup Guide for more details.
The easiest way to prepare your software environment is using CANN image directly. We recommend using the Ascend NPU community: vllm-ascend pre-built docker image as the base image of Ascend NPU for cache-dit. CANN image can be found in Ascend official community website: here. The CANN prebuilt image includes NNAL (Ascend Neural Network Acceleration Library) which provides libatb.so for advanced tensor operations. No additional installation is required when using the prebuilt image.
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the pre-built image
export IMAGE=quay.io/ascend/cann:|cann_image_tag|
docker run --rm \
--name cache-dit-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bashIf install failed by using pip command, you can get torch-2.8.0*.whl file by Link and install manually.
# torch: aarch64
pip3 install torch==2.8.0
# torch: x86
pip3 install torch==2.8.0+cpu --index-url https://download.pytorch.org/whl/cpuStrongly recommend install torch_npu by acquire torch_npu-2.8.0*.whl file by Link and install manually. For more detail about Ascend Pytorch Adapter installation, please refer https://gitcode.com/Ascend/pytorch
pip install --no-deps torchvision==0.16.0
pip install einops sentencepiece accelerateWe recommend using the prebuilt image from the Ascend NPU community: vllm-ascend as the base image of Ascend NPU for cache-dit. You can just pull the prebuilt image from the image repository and run it with bash. For example:
# Download pre-built image for Ascend NPU
docker pull quay.io/ascend/vllm-ascend:v0.13.0rc1
# Use the pre-built image for cache-dit
docker run \
--name cache-dit-ascend \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
--net=host \
--shm-size=80g \
--privileged=true \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /data:/data \
-itd quay.io/ascend/vllm-ascend:v0.13.0rc1 bash# Make sure CANN_path is set to your CANN installation path
# e.g., export CANN_path=/usr/local/Ascend
source $CANN_path/ascend-toolkit/set_env.sh
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Set NPU devices by ASCEND_RT_VISIBLE_DEVICES env
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7Once it is done, you can start to set up cache-dit.
You can install the stable release of cache-dit from PyPI:
pip3 install -U cache-ditOr you can install the latest develop version from GitHub:
pip3 install git+https://github.com/vipshop/cache-dit.gitPlease also install the latest main branch of diffusers for context parallelism:
pip3 install git+https://github.com/huggingface/diffusers.git # or >= 0.36.0After the environment configuration is complete, users can refer to the Quick Examples, Ascend NPU Benchmark and Ascend NPU Supported Matrix for more details.
pip3 install opencv-python-headless einops imageio-ffmpeg ftfy
pip3 install git+https://github.com/huggingface/diffusers.git # latest or >= 0.36.0
pip3 install git+https://github.com/vipshop/cache-dit.git # latestThe easiest way to enable hybrid cache acceleration for DiTs with cache-dit is to start with single NPU inference. For examples:
# use default model path, e.g, "black-forest-labs/FLUX.1-dev"
python3 -m cache_dit.generate flux --attn _native_npu
python3 -m cache_dit.generate qwen_image --attn _native_npu
python3 -m cache_dit.generate flux --cache --attn _native_npu
python3 -m cache_dit.generate qwen_image --cache --attn _native_npucache-dit is designed to work 🔥Context Parallelism, 🔥Tensor Parallelism. For examples:
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate flux --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate zimage --parallel ulysses --cache --attn _native_npu
torchrun --nproc_per_node=4 -m cache_dit.generate qwen_image --parallel ulysses --cache --attn _native_npuMindIE-SD is an open-source acceleration framework for diffusion models on Ascend NPU. By providing a custom MindieSDBackend for torch.compile, it enables automatic operator fusion and optimization for enhanced performance on Ascend hardware. For detailed documentation and examples, visit MindIE-SD.
git clone --branch master --single-branch https://gitcode.com/Ascend/MindIE-SD.git
cd MindIE-SD
pip install wheel
python3 setup.py bdist_wheel
pip install ./dist/*.whl --force-reinstallpython3 generate.py flux --attn _native_npu --compileThe performance of MindIE-SD Compilation is as follows (device 910B3):
| Model | Batch Size | Resolution | Compile | E2E Time(s) |
|---|---|---|---|---|
| flux.1-dev | 1 | 1024x1024 | ❌ | 14.09 |
| flux.1-dev | 1 | 1024x1024 | ✅ | 12.85 |