Add KV cache prefill flash attention example for AIE2P by erwei-xilinx · Pull Request #1500 · Xilinx/mlir-air

erwei-xilinx · 2026-04-06T17:13:02Z

Summary

Add a new programming example demonstrating fused flash attention with KV cache write-back on AIE2P NPU
K cache data is written back to DDR during attention computation using L1-to-L3 direct DMA paths, bypassing memtile to avoid channel congestion
Uses a dedicated staging buffer to prevent DMA race conditions between K receive and write-back, and un-tiling DMA strides to convert 8×8 blocked L1 layout back to row-major
Supports GQA (grouped query attention), causal masking, and includes a C++ test executable for ELF-based profiling

Test plan

make run with KWB enabled: K cache 0/65536 mismatches, attention correlation=0.9994
make run --no-k-writeback: attention correlation=0.9994 (regression-free)
CI lit test run_npu2_makefile_peano_elf.lit on NPU2 hardware
Test with --causal flag
Test with GQA (NUM_KV_HEADS != NUM_HEADS)

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds a new AIE2P programming example that fuses flash attention with K-cache write-back (KV prefill) and includes Makefile/lit coverage plus an ELF-based C++ runner.

Changes:

Introduces kv_cache_prefill example implementation (AIR/MLIR Python builder + AIE kernel) with optional K write-back and GQA/causal support.
Adds build/run tooling (Makefile, lit test) and a C++ ELF runner for profiling.
Adds a convenience run_test.sh wrapper script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
programming_examples/flash_attention/kv_cache_prefill/attn_npu2.py	Builds and runs the fused attention + K write-back module; includes host-side validation.
programming_examples/flash_attention/kv_cache_prefill/attn_npu2.cc	AIE2P bf16 attention/softmax kernels plus copy/mask helpers.
programming_examples/flash_attention/kv_cache_prefill/Makefile	Builds the kernel object, runs the Python driver, and optionally profiles via ELF runner.
programming_examples/flash_attention/kv_cache_prefill/test_elf_npu2.cpp	C++ executable to load/run an ELF kernel for profiling.
programming_examples/flash_attention/kv_cache_prefill/run_npu2_makefile_peano_elf.lit	Lit test invoking `make run` for the new example.
programming_examples/flash_attention/kv_cache_prefill/run_test.sh	Local wrapper script to set env and execute the Python driver.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add a new programming example that demonstrates fused flash attention with KV cache write-back on AIE2P NPU. This extends the existing kernel_fusion_based flash attention with K cache prefill capability, where RoPE'd K data is written back to DDR during attention computation. Key design features: - L1-to-L3 direct K write-back path bypassing memtile to avoid DMA channel congestion - Dedicated staging buffer to prevent DMA race conditions between K receive and write-back - Un-tiling DMA strides to convert 8x8 blocked L1 layout back to row-major for the K cache - Support for GQA (grouped query attention) with configurable head counts - Causal masking support - C++ test executable for ELF-based profiling workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend the KV cache prefill design to write both K and V caches to DDR during flash attention computation. Uses a single CacheWB channel with an interleaved KV cache layout [K_c0, V_c0, K_c1, V_c1, ...] where both K and V data are staged through kwb_buf before DMA transfer. Key design choices: - Single CacheWB channel avoids shim S2MM channel exhaustion (no packet switching needed) - Shared kwb_buf staging buffer prevents DMA race between CacheWB read and V2L1 write on the v buffer - scf.for loop in launch body enables compiler BD folding, preventing BD exhaustion at large sequence lengths (tested up to 12h x 4096) Compiler changes (AIRToAIEPass.cpp): - Fix packet BD attribute lookup for L1-to-L3 dma_packet channels (getExistingPacketFlowOpFromDevice searches both flow maps) - Place outbound MM2S lock acquire before channel put and release after channel put, enabling interleaved lock pattern for multiple puts sharing the same staging buffer Performance: 12 heads x 4096 seq_len achieves 2460 peak GFLOPS with zero overhead vs K-only writeback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update the interleaved KV cache layout to support dk_chunks > 1: - Per chunk stores [K_dk0, ..., K_dk(N-1), V_dv_lz] with N = dk_chunks - KV cache outer dimension combines (kv_head, dv_chunk) like V L3 layout - Launch body scf.for iterates cache_slots_per_chunk = dk_chunks + 1 - Host test constructs expected data with per-dk_tile K slots Currently dk=dv=128 fails at the aiecc level due to L1 memory exhaustion (kwb_buf staging buffer + extra Q saved buffer exceeds 64KB), not due to layout issues. The generalized layout is ready for when L1 capacity is freed (e.g., by eliminating the staging buffer via compiler lock fixes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove run_test.sh with hardcoded machine-specific paths - Fix lit test CHECK pattern: OVERALL: PASSED (not PASS!) - Fix misleading RoPE message in C++ profiler: host pre-rotation - Add missing C++ standard headers (chrono, cstring, cstdlib) - Document GQA duplicate-write behavior for gqa_group_size > unroll Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The lock placement fix for outbound puts sharing a staging buffer has been moved to a separate PR (Xilinx#1515). This PR now contains only the programming example changes. The example requires PR Xilinx#1515 to be merged first for V cache write-back to work correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Register flash_attention/kv_cache_prefill in the programming examples dashboard generator. Shows as NPU2-only (green) based on the lit test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erwei-xilinx requested a review from jgmelber as a code owner April 6, 2026 17:13

Copilot AI review requested due to automatic review settings April 6, 2026 17:13

Copilot started reviewing on behalf of erwei-xilinx April 6, 2026 17:14 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

erwei-xilinx force-pushed the erwei/add-kv-cache-prefill-example branch from 09a78ec to df4f24d Compare April 8, 2026 18:21

erwei-xilinx requested a review from fifield as a code owner April 8, 2026 18:21

erwei-xilinx and others added 5 commits April 8, 2026 14:34

erwei-xilinx force-pushed the erwei/add-kv-cache-prefill-example branch 2 times, most recently from 66d0aba to fdc78b2 Compare April 8, 2026 21:41

Add KV cache prefill to operator dashboard

fdc78b2

Register flash_attention/kv_cache_prefill in the programming examples dashboard generator. Shows as NPU2-only (green) based on the lit test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KV cache prefill flash attention example for AIE2P#1500

Add KV cache prefill flash attention example for AIE2P#1500
erwei-xilinx wants to merge 6 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/add-kv-cache-prefill-example

erwei-xilinx commented Apr 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented Apr 6, 2026 •

edited

Loading