Add fused SwiGLU programming example for NPU2 by erwei-xilinx · Pull Request #1511 · Xilinx/mlir-air

erwei-xilinx · 2026-04-08T05:18:50Z

Summary

Add fused SwiGLU example: output = SiLU(x @ W_gate) * (x @ W_up) as a single launch on NPU2 (AIE2P)
Uses 6 herds named "herd_0" chained into one while_true loop body: gate GEMM → up GEMM → SiLU+mul → writeback
ONE B_L3L2 channel carries both gate and up weight data via FIFO ordering, staying within the 2 S2MM hardware limit per compute tile
Takes 4 separate arguments (x[M,K], w_gate[K,N], w_up[K,N], out[M,N]) — no host-side weight preprocessing

Architecture

Single segment K-loop of 2*k_tiles iterations creates one memtile BD chain, avoiding BD chain interleaving in while_true mode
Explicit shared L2→L1 channels: A_L2L1[herd_m,1] broadcast [herd_m,herd_n], B_L2L1[1,herd_n] broadcast [herd_m,herd_n]
dma_memcpy_nd for C writeback (avoids memtile DMA channel exhaustion)
8x8x8 bf16 mmul with BFP16 emulation on AIE2P

Performance (NPU2 hardware)

Config	Size	Tiles	Peak GFLOPS
4x4 herd	512x512x512	16	691
4x4 herd	2048x2048x2048	16	1,059
8x4 herd	512x512x512	32	691
8x4 herd	2048x2048x2048	32	1,427

Dependencies

PR Prevent air-fuse-channels from merging channels in sibling loops #1508 (air-fuse-channels sibling loop fix)
PR Add traceDeps for HierarchyInterface in air-dependency #1509 (air-dependency traceDeps for HierarchyInterface)
PR Fix BD repeat dim double-counting when size-1 dim creates 3D layout #1510 (BD repeat dim double-counting fix)

Test plan

Correctness verified on NPU2 hardware at 512x512x512 and 2048x2048x2048
Both 4x4 (16-tile) and 8x4 (32-tile, full device) herd configurations pass
LIT test: make run4x4 with COMPILE_MODE=compile-and-run
Profile mode: make profile

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds a new fused SwiGLU programming example targeting NPU2 (AIE2P), including an AIR/Python module, AIE tile kernels, and build/test harnesses.

Changes:

Introduces swiglu_fused.py implementing a single-launch fused SwiGLU pipeline using shared channels and a combined K-loop BD chain.
Adds AIE2P kernels in swiglu_fused.cc for matmul, SiLU, and elementwise multiply.
Provides a Makefile + LIT test, and a C++ ELF profiling harness.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
programming_examples/ffn_swiglu/fused/swiglu_fused.py	Builds the fused SwiGLU AIR module (channels, segment/herd schedule, reference checking, and profiling mode).
programming_examples/ffn_swiglu/fused/swiglu_fused.cc	Implements the linked AIE2P compute kernels used by the generated MLIR.
programming_examples/ffn_swiglu/fused/Makefile	Adds build/run targets for compiling kernels and running the example in different modes/configs.
programming_examples/ffn_swiglu/fused/run_makefile_peano.lit	Adds a LIT regression invoking `make run4x4` and checking for PASS.
programming_examples/ffn_swiglu/fused/test.cpp	Adds an ELF/XRT profiling harness for measuring end-to-end latency/GFLOPS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add a fused SwiGLU implementation: output = SiLU(x @ W_gate) * (x @ W_up) running as a single launch on NPU2 (AIE2P) with 8x8x8 bf16 mmul. Architecture: - Single launch with 6 herds named "herd_0" chained into one while_true loop body (gate GEMM → up GEMM → SiLU+mul → writeback) - ONE B_L3L2 channel carries both gate and up weight data via FIFO ordering — 2 S2MM channels at compute tile (within hardware limit) - ONE segment K-loop of 2*k_tiles iterations creates a single memtile BD chain, avoiding BD chain interleaving in while_true mode - 4 function arguments: x[M,K], w_gate[K,N], w_up[K,N], out[M,N] (no host-side weight preprocessing required) - Supports 4x4 (16-tile) and 8x4 (32-tile, full device) herd configs - Includes compile-only, compile-and-run, and profile modes Files: - swiglu_fused.py: Python design with explicit L2→L1 channels - swiglu_fused.cc: AIE2P kernel (zero, matmul, silu, elemwise_mul) - Makefile: build targets (run, compile-only, profile, etc.) - test.cpp: C++ ELF-based test harness - run_makefile_peano.lit: LIT test for CI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add missing <cstring> include for std::memset in test.cpp - Update test.cpp to 4-arg signature (x, w_gate, w_up, out) matching the Python module's separate weight arguments - Pass PEANO_INSTALL_DIR and OUTPUT_FORMAT=elf in LIT test RUN line, matching the convention used by decode/ and prefill/ examples - Fix CHECK pattern to match "PASS!" (with exclamation mark) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change the LIT test from run4x4 (4 cols x 4 rows = 16 tiles) to run with HERD_M=8 HERD_N=4 (8 cols x 4 rows = 32 tiles) to exercise the full NPU2 compute array. Update REQUIRES to ryzen_ai_npu2 since 8 columns are only available on NPU2 (Strix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove omit_pingpong="all" — the ping-pong transform handles the multi-phase 6-herd structure correctly and gives a significant speedup by double-buffering L1 inputs to overlap DMA with compute. Performance improvement (4x4 herd, 2048x2048x2048): omit_pingpong: 1,102 GFLOPS → ping-pong enabled: 1,339 GFLOPS (+21%) Performance improvement (8x4 herd, 2048x2048x2048): omit_pingpong: 1,427 GFLOPS → ping-pong enabled: 1,561 GFLOPS (+9%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erwei-xilinx requested a review from jgmelber as a code owner April 8, 2026 05:18

Copilot AI review requested due to automatic review settings April 8, 2026 05:18

Copilot started reviewing on behalf of erwei-xilinx April 8, 2026 05:20 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread programming_examples/ffn_swiglu/fused/test.cpp

Comment thread programming_examples/ffn_swiglu/fused/test.cpp Outdated

Comment thread programming_examples/ffn_swiglu/fused/test.cpp

Comment thread programming_examples/ffn_swiglu/fused/run_makefile_peano.lit Outdated

erwei-xilinx force-pushed the erwei/fused-swiglu-example branch from 9f60f5b to 7493226 Compare April 8, 2026 17:42

erwei-xilinx and others added 4 commits April 8, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused SwiGLU programming example for NPU2#1511

Add fused SwiGLU programming example for NPU2#1511
erwei-xilinx wants to merge 4 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/fused-swiglu-example

erwei-xilinx commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented Apr 8, 2026

Summary

Architecture

Performance (NPU2 hardware)

Dependencies

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants