Add fused SwiGLU programming example for NPU2#1511
Open
erwei-xilinx wants to merge 4 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new fused SwiGLU programming example targeting NPU2 (AIE2P), including an AIR/Python module, AIE tile kernels, and build/test harnesses.
Changes:
- Introduces
swiglu_fused.pyimplementing a single-launch fused SwiGLU pipeline using shared channels and a combined K-loop BD chain. - Adds AIE2P kernels in
swiglu_fused.ccfor matmul, SiLU, and elementwise multiply. - Provides a Makefile + LIT test, and a C++ ELF profiling harness.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| programming_examples/ffn_swiglu/fused/swiglu_fused.py | Builds the fused SwiGLU AIR module (channels, segment/herd schedule, reference checking, and profiling mode). |
| programming_examples/ffn_swiglu/fused/swiglu_fused.cc | Implements the linked AIE2P compute kernels used by the generated MLIR. |
| programming_examples/ffn_swiglu/fused/Makefile | Adds build/run targets for compiling kernels and running the example in different modes/configs. |
| programming_examples/ffn_swiglu/fused/run_makefile_peano.lit | Adds a LIT regression invoking make run4x4 and checking for PASS. |
| programming_examples/ffn_swiglu/fused/test.cpp | Adds an ELF/XRT profiling harness for measuring end-to-end latency/GFLOPS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9f60f5b to
7493226
Compare
Add a fused SwiGLU implementation: output = SiLU(x @ W_gate) * (x @ W_up) running as a single launch on NPU2 (AIE2P) with 8x8x8 bf16 mmul. Architecture: - Single launch with 6 herds named "herd_0" chained into one while_true loop body (gate GEMM → up GEMM → SiLU+mul → writeback) - ONE B_L3L2 channel carries both gate and up weight data via FIFO ordering — 2 S2MM channels at compute tile (within hardware limit) - ONE segment K-loop of 2*k_tiles iterations creates a single memtile BD chain, avoiding BD chain interleaving in while_true mode - 4 function arguments: x[M,K], w_gate[K,N], w_up[K,N], out[M,N] (no host-side weight preprocessing required) - Supports 4x4 (16-tile) and 8x4 (32-tile, full device) herd configs - Includes compile-only, compile-and-run, and profile modes Files: - swiglu_fused.py: Python design with explicit L2→L1 channels - swiglu_fused.cc: AIE2P kernel (zero, matmul, silu, elemwise_mul) - Makefile: build targets (run, compile-only, profile, etc.) - test.cpp: C++ ELF-based test harness - run_makefile_peano.lit: LIT test for CI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add missing <cstring> include for std::memset in test.cpp - Update test.cpp to 4-arg signature (x, w_gate, w_up, out) matching the Python module's separate weight arguments - Pass PEANO_INSTALL_DIR and OUTPUT_FORMAT=elf in LIT test RUN line, matching the convention used by decode/ and prefill/ examples - Fix CHECK pattern to match "PASS!" (with exclamation mark) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change the LIT test from run4x4 (4 cols x 4 rows = 16 tiles) to run with HERD_M=8 HERD_N=4 (8 cols x 4 rows = 32 tiles) to exercise the full NPU2 compute array. Update REQUIRES to ryzen_ai_npu2 since 8 columns are only available on NPU2 (Strix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove omit_pingpong="all" — the ping-pong transform handles the multi-phase 6-herd structure correctly and gives a significant speedup by double-buffering L1 inputs to overlap DMA with compute. Performance improvement (4x4 herd, 2048x2048x2048): omit_pingpong: 1,102 GFLOPS → ping-pong enabled: 1,339 GFLOPS (+21%) Performance improvement (8x4 herd, 2048x2048x2048): omit_pingpong: 1,427 GFLOPS → ping-pong enabled: 1,561 GFLOPS (+9%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
output = SiLU(x @ W_gate) * (x @ W_up)as a single launch on NPU2 (AIE2P)"herd_0"chained into onewhile_trueloop body: gate GEMM → up GEMM → SiLU+mul → writebackB_L3L2channel carries both gate and up weight data via FIFO ordering, staying within the 2 S2MM hardware limit per compute tilex[M,K],w_gate[K,N],w_up[K,N],out[M,N]) — no host-side weight preprocessingArchitecture
2*k_tilesiterations creates one memtile BD chain, avoiding BD chain interleaving inwhile_truemodeA_L2L1[herd_m,1]broadcast[herd_m,herd_n],B_L2L1[1,herd_n]broadcast[herd_m,herd_n]dma_memcpy_ndfor C writeback (avoids memtile DMA channel exhaustion)Performance (NPU2 hardware)
Dependencies
Test plan
make run4x4withCOMPILE_MODE=compile-and-runmake profile🤖 Generated with Claude Code