Skip to content

frane/burn-npu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

burn-npu

Early development. Apple backend tested and working. Intel and Qualcomm implemented but need hardware validation. Contributions welcome.

NPU backend for Burn. Drop-in replacement for burn-wgpu or burn-ndarray that runs inference on hardware Neural Processing Units.

use burn::tensor::Tensor;
use burn_npu::{NpuBurnBackend, NpuBurnDevice};

type B = NpuBurnBackend;  // swap this line — that's it

let device = NpuBurnDevice::Default;
let a = Tensor::<B, 2>::from_floats([[1.0, 2.0], [3.0, 4.0]], &device);
let b = Tensor::<B, 2>::from_floats([[5.0, 6.0], [7.0, 8.0]], &device);
let c = a.matmul(b);

Any Burn model works. No code changes needed. Just change the backend type.

What NPUs can and can't do

NPUs are inference-only accelerators. They are not programmable like GPUs — you can't write custom kernels. Each vendor provides their own API (Apple MLTensor, Intel OpenVINO, Qualcomm QNN), and there is no universal standard like Vulkan or WebGPU.

burn-npu works by wrapping each vendor's API behind Burn's Backend trait. This means:

  • Works: matrix multiply, elementwise ops, reductions, attention, feedforward — the ops that make up transformer inference
  • Doesn't work: training (no autograd), custom kernels, ops the vendor API doesn't support
  • Fragmented: each platform is a separate implementation, not a single portable backend

For training, use burn-wgpu (Metal/Vulkan GPU) or burn-cuda (NVIDIA).

Benchmark

GPT-2 124M forward pass, seq=32, FP32, Apple M2 Pro.

Backend Latency Throughput
burn-npu (Apple NPU) 29 ms 34.8 tok/s
burn-wgpu (Metal GPU) 37 ms 27.1 tok/s
burn-ndarray (CPU) 107 ms 9.4 tok/s
cargo run --release --example bench --features apple

Installation

[dependencies]
burn-npu = { version = "0.3", features = ["apple"] }
Feature Hardware Status Requires
apple Apple Neural Engine (M1/M2/M3/M4) tested, working macOS 15+, Xcode
intel Intel Core Ultra NPU implemented, needs hardware validation OpenVINO runtime
qualcomm Qualcomm Hexagon (Snapdragon) implemented, needs hardware + QNN SDK QNN SDK

Enable one feature at a time. Without any feature, falls back to burn-ndarray (CPU).

How It Works

Each platform has a native tensor type that stays on the NPU between operations. No data copies between ops — only into_data() reads back to CPU.

Platform Tensor type Dispatch
Apple MLTensor handle ANE / GPU / CPU via Core ML
Intel OpenVINO tensor NPU / GPU / CPU via OpenVINO
Qualcomm QNN tensor (planned) CPU fallback until QNN SDK integrated

On Apple, 37 float ops run natively on the NPU. On Intel, matmul is NPU-accelerated via OpenVINO. Remaining ops and int/bool tensors delegate to burn-ndarray.

Background

This project was motivated by this discussion in the Burn repo, where NPU support was considered difficult because NPUs "are often not programmable chips" with no common API across vendors. That's true — there is no universal NPU API. burn-npu takes a different approach: per-vendor integration using each vendor's own tensor/inference API (MLTensor, OpenVINO, QNN), wrapped behind Burn's Backend trait.

Contributing

This is an early project. Help is welcome:

  • Intel hardware testing — run cargo test --features intel on a Core Ultra machine and open an issue with results
  • Qualcomm hardware testing — if you have a Snapdragon X Elite device or Rubik Pi 3, help integrate the QNN SDK
  • More NPU-accelerated ops — move remaining float ops from burn-ndarray delegation to native NPU execution on Intel/Qualcomm
  • Bug reports — if a Burn model doesn't work on NpuBurnBackend, open an issue

License

MIT OR Apache-2.0

About

NPU backend for the Burn deep learning framework

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors