Skip to content

Add fused SwiGLU programming example for NPU2#1511

Open
erwei-xilinx wants to merge 4 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/fused-swiglu-example
Open

Add fused SwiGLU programming example for NPU2#1511
erwei-xilinx wants to merge 4 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/fused-swiglu-example

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

Summary

  • Add fused SwiGLU example: output = SiLU(x @ W_gate) * (x @ W_up) as a single launch on NPU2 (AIE2P)
  • Uses 6 herds named "herd_0" chained into one while_true loop body: gate GEMM → up GEMM → SiLU+mul → writeback
  • ONE B_L3L2 channel carries both gate and up weight data via FIFO ordering, staying within the 2 S2MM hardware limit per compute tile
  • Takes 4 separate arguments (x[M,K], w_gate[K,N], w_up[K,N], out[M,N]) — no host-side weight preprocessing

Architecture

  • Single segment K-loop of 2*k_tiles iterations creates one memtile BD chain, avoiding BD chain interleaving in while_true mode
  • Explicit shared L2→L1 channels: A_L2L1[herd_m,1] broadcast [herd_m,herd_n], B_L2L1[1,herd_n] broadcast [herd_m,herd_n]
  • dma_memcpy_nd for C writeback (avoids memtile DMA channel exhaustion)
  • 8x8x8 bf16 mmul with BFP16 emulation on AIE2P

Performance (NPU2 hardware)

Config Size Tiles Peak GFLOPS
4x4 herd 512x512x512 16 691
4x4 herd 2048x2048x2048 16 1,059
8x4 herd 512x512x512 32 691
8x4 herd 2048x2048x2048 32 1,427

Dependencies

Test plan

  • Correctness verified on NPU2 hardware at 512x512x512 and 2048x2048x2048
  • Both 4x4 (16-tile) and 8x4 (32-tile, full device) herd configurations pass
  • LIT test: make run4x4 with COMPILE_MODE=compile-and-run
  • Profile mode: make profile

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx requested a review from jgmelber as a code owner April 8, 2026 05:18
Copilot AI review requested due to automatic review settings April 8, 2026 05:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new fused SwiGLU programming example targeting NPU2 (AIE2P), including an AIR/Python module, AIE tile kernels, and build/test harnesses.

Changes:

  • Introduces swiglu_fused.py implementing a single-launch fused SwiGLU pipeline using shared channels and a combined K-loop BD chain.
  • Adds AIE2P kernels in swiglu_fused.cc for matmul, SiLU, and elementwise multiply.
  • Provides a Makefile + LIT test, and a C++ ELF profiling harness.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
programming_examples/ffn_swiglu/fused/swiglu_fused.py Builds the fused SwiGLU AIR module (channels, segment/herd schedule, reference checking, and profiling mode).
programming_examples/ffn_swiglu/fused/swiglu_fused.cc Implements the linked AIE2P compute kernels used by the generated MLIR.
programming_examples/ffn_swiglu/fused/Makefile Adds build/run targets for compiling kernels and running the example in different modes/configs.
programming_examples/ffn_swiglu/fused/run_makefile_peano.lit Adds a LIT regression invoking make run4x4 and checking for PASS.
programming_examples/ffn_swiglu/fused/test.cpp Adds an ELF/XRT profiling harness for measuring end-to-end latency/GFLOPS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread programming_examples/ffn_swiglu/fused/test.cpp
Comment thread programming_examples/ffn_swiglu/fused/test.cpp Outdated
Comment thread programming_examples/ffn_swiglu/fused/test.cpp
Comment thread programming_examples/ffn_swiglu/fused/run_makefile_peano.lit Outdated
@erwei-xilinx erwei-xilinx force-pushed the erwei/fused-swiglu-example branch from 9f60f5b to 7493226 Compare April 8, 2026 17:42
erwei-xilinx and others added 4 commits April 8, 2026 10:42
Add a fused SwiGLU implementation: output = SiLU(x @ W_gate) * (x @ W_up)
running as a single launch on NPU2 (AIE2P) with 8x8x8 bf16 mmul.

Architecture:
- Single launch with 6 herds named "herd_0" chained into one while_true
  loop body (gate GEMM → up GEMM → SiLU+mul → writeback)
- ONE B_L3L2 channel carries both gate and up weight data via FIFO
  ordering — 2 S2MM channels at compute tile (within hardware limit)
- ONE segment K-loop of 2*k_tiles iterations creates a single memtile
  BD chain, avoiding BD chain interleaving in while_true mode
- 4 function arguments: x[M,K], w_gate[K,N], w_up[K,N], out[M,N]
  (no host-side weight preprocessing required)
- Supports 4x4 (16-tile) and 8x4 (32-tile, full device) herd configs
- Includes compile-only, compile-and-run, and profile modes

Files:
- swiglu_fused.py: Python design with explicit L2→L1 channels
- swiglu_fused.cc: AIE2P kernel (zero, matmul, silu, elemwise_mul)
- Makefile: build targets (run, compile-only, profile, etc.)
- test.cpp: C++ ELF-based test harness
- run_makefile_peano.lit: LIT test for CI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add missing <cstring> include for std::memset in test.cpp
- Update test.cpp to 4-arg signature (x, w_gate, w_up, out) matching
  the Python module's separate weight arguments
- Pass PEANO_INSTALL_DIR and OUTPUT_FORMAT=elf in LIT test RUN line,
  matching the convention used by decode/ and prefill/ examples
- Fix CHECK pattern to match "PASS!" (with exclamation mark)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change the LIT test from run4x4 (4 cols x 4 rows = 16 tiles) to
run with HERD_M=8 HERD_N=4 (8 cols x 4 rows = 32 tiles) to exercise
the full NPU2 compute array. Update REQUIRES to ryzen_ai_npu2 since
8 columns are only available on NPU2 (Strix).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove omit_pingpong="all" — the ping-pong transform handles the
multi-phase 6-herd structure correctly and gives a significant
speedup by double-buffering L1 inputs to overlap DMA with compute.

Performance improvement (4x4 herd, 2048x2048x2048):
  omit_pingpong: 1,102 GFLOPS → ping-pong enabled: 1,339 GFLOPS (+21%)

Performance improvement (8x4 herd, 2048x2048x2048):
  omit_pingpong: 1,427 GFLOPS → ping-pong enabled: 1,561 GFLOPS (+9%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants