burn-npu

Early development. Apple backend tested and working. Intel and Qualcomm implemented but need hardware validation. Contributions welcome.

NPU backend for Burn. Drop-in replacement for burn-wgpu or burn-ndarray that runs inference on hardware Neural Processing Units.

use burn::tensor::Tensor;
use burn_npu::{NpuBurnBackend, NpuBurnDevice};

type B = NpuBurnBackend;  // swap this line — that's it

let device = NpuBurnDevice::Default;
let a = Tensor::<B, 2>::from_floats([[1.0, 2.0], [3.0, 4.0]], &device);
let b = Tensor::<B, 2>::from_floats([[5.0, 6.0], [7.0, 8.0]], &device);
let c = a.matmul(b);

Any Burn model works. No code changes needed. Just change the backend type.

What NPUs can and can't do

NPUs are inference-only accelerators. They are not programmable like GPUs — you can't write custom kernels. Each vendor provides their own API (Apple MLTensor, Intel OpenVINO, Qualcomm QNN), and there is no universal standard like Vulkan or WebGPU.

burn-npu works by wrapping each vendor's API behind Burn's Backend trait. This means:

Works: matrix multiply, elementwise ops, reductions, attention, feedforward — the ops that make up transformer inference
Doesn't work: training (no autograd), custom kernels, ops the vendor API doesn't support
Fragmented: each platform is a separate implementation, not a single portable backend

For training, use burn-wgpu (Metal/Vulkan GPU) or burn-cuda (NVIDIA).

Benchmark

GPT-2 124M forward pass, seq=32, FP32, Apple M2 Pro.

Backend	Latency	Throughput
burn-npu (Apple NPU)	29 ms	34.8 tok/s
burn-wgpu (Metal GPU)	37 ms	27.1 tok/s
burn-ndarray (CPU)	107 ms	9.4 tok/s

cargo run --release --example bench --features apple

Installation

[dependencies]
burn-npu = { version = "0.3", features = ["apple"] }

Feature	Hardware	Status	Requires
`apple`	Apple Neural Engine (M1/M2/M3/M4)	tested, working	macOS 15+, Xcode
`intel`	Intel Core Ultra NPU	implemented, needs hardware validation	OpenVINO runtime
`qualcomm`	Qualcomm Hexagon (Snapdragon)	implemented, needs hardware + QNN SDK	QNN SDK

Enable one feature at a time. Without any feature, falls back to burn-ndarray (CPU).

How It Works

Each platform has a native tensor type that stays on the NPU between operations. No data copies between ops — only into_data() reads back to CPU.

Platform	Tensor type	Dispatch
Apple	MLTensor handle	ANE / GPU / CPU via Core ML
Intel	OpenVINO tensor	NPU / GPU / CPU via OpenVINO
Qualcomm	QNN tensor (planned)	CPU fallback until QNN SDK integrated

On Apple, 37 float ops run natively on the NPU. On Intel, matmul is NPU-accelerated via OpenVINO. Remaining ops and int/bool tensors delegate to burn-ndarray.

Background

This project was motivated by this discussion in the Burn repo, where NPU support was considered difficult because NPUs "are often not programmable chips" with no common API across vendors. That's true — there is no universal NPU API. burn-npu takes a different approach: per-vendor integration using each vendor's own tensor/inference API (MLTensor, OpenVINO, QNN), wrapped behind Burn's Backend trait.

Contributing

This is an early project. Help is welcome:

Intel hardware testing — run cargo test --features intel on a Core Ultra machine and open an issue with results
Qualcomm hardware testing — if you have a Snapdragon X Elite device or Rubik Pi 3, help integrate the QNN SDK
More NPU-accelerated ops — move remaining float ops from burn-ndarray delegation to native NPU execution on Intel/Qualcomm
Bug reports — if a Burn model doesn't work on NpuBurnBackend, open an issue

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
npu-sys/Sources		npu-sys/Sources
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

burn-npu

What NPUs can and can't do

Benchmark

Installation

How It Works

Background

Contributing

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

burn-npu

What NPUs can and can't do

Benchmark

Installation

How It Works

Background

Contributing

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages