Skip to content

Add KV cache prefill flash attention example for AIE2P#1500

Open
erwei-xilinx wants to merge 6 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/add-kv-cache-prefill-example
Open

Add KV cache prefill flash attention example for AIE2P#1500
erwei-xilinx wants to merge 6 commits into
Xilinx:mainfrom
erwei-xilinx:erwei/add-kv-cache-prefill-example

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented Apr 6, 2026

Summary

  • Add a new programming example demonstrating fused flash attention with KV cache write-back on AIE2P NPU
  • K cache data is written back to DDR during attention computation using L1-to-L3 direct DMA paths, bypassing memtile to avoid channel congestion
  • Uses a dedicated staging buffer to prevent DMA race conditions between K receive and write-back, and un-tiling DMA strides to convert 8×8 blocked L1 layout back to row-major
  • Supports GQA (grouped query attention), causal masking, and includes a C++ test executable for ELF-based profiling

Test plan

  • make run with KWB enabled: K cache 0/65536 mismatches, attention correlation=0.9994
  • make run --no-k-writeback: attention correlation=0.9994 (regression-free)
  • CI lit test run_npu2_makefile_peano_elf.lit on NPU2 hardware
  • Test with --causal flag
  • Test with GQA (NUM_KV_HEADS != NUM_HEADS)

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx requested a review from jgmelber as a code owner April 6, 2026 17:13
Copilot AI review requested due to automatic review settings April 6, 2026 17:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AIE2P programming example that fuses flash attention with K-cache write-back (KV prefill) and includes Makefile/lit coverage plus an ELF-based C++ runner.

Changes:

  • Introduces kv_cache_prefill example implementation (AIR/MLIR Python builder + AIE kernel) with optional K write-back and GQA/causal support.
  • Adds build/run tooling (Makefile, lit test) and a C++ ELF runner for profiling.
  • Adds a convenience run_test.sh wrapper script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
programming_examples/flash_attention/kv_cache_prefill/attn_npu2.py Builds and runs the fused attention + K write-back module; includes host-side validation.
programming_examples/flash_attention/kv_cache_prefill/attn_npu2.cc AIE2P bf16 attention/softmax kernels plus copy/mask helpers.
programming_examples/flash_attention/kv_cache_prefill/Makefile Builds the kernel object, runs the Python driver, and optionally profiles via ELF runner.
programming_examples/flash_attention/kv_cache_prefill/test_elf_npu2.cpp C++ executable to load/run an ELF kernel for profiling.
programming_examples/flash_attention/kv_cache_prefill/run_npu2_makefile_peano_elf.lit Lit test invoking make run for the new example.
programming_examples/flash_attention/kv_cache_prefill/run_test.sh Local wrapper script to set env and execute the Python driver.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread programming_examples/flash_attention/kv_cache_prefill/run_test.sh Outdated
Comment thread programming_examples/flash_attention/kv_cache_prefill/attn_npu2.py
Comment thread programming_examples/flash_attention/kv_cache_prefill/attn_npu2.py
Comment thread programming_examples/flash_attention/kv_cache_prefill/attn_npu2.py Outdated
Comment thread programming_examples/flash_attention/kv_cache_prefill/test_elf_npu2.cpp Outdated
Comment thread programming_examples/flash_attention/kv_cache_prefill/attn_npu2.cc
@erwei-xilinx erwei-xilinx force-pushed the erwei/add-kv-cache-prefill-example branch from 09a78ec to df4f24d Compare April 8, 2026 18:21
@erwei-xilinx erwei-xilinx requested a review from fifield as a code owner April 8, 2026 18:21
erwei-xilinx and others added 5 commits April 8, 2026 14:34
Add a new programming example that demonstrates fused flash attention
with KV cache write-back on AIE2P NPU. This extends the existing
kernel_fusion_based flash attention with K cache prefill capability,
where RoPE'd K data is written back to DDR during attention computation.

Key design features:
- L1-to-L3 direct K write-back path bypassing memtile to avoid DMA
  channel congestion
- Dedicated staging buffer to prevent DMA race conditions between K
  receive and write-back
- Un-tiling DMA strides to convert 8x8 blocked L1 layout back to
  row-major for the K cache
- Support for GQA (grouped query attention) with configurable head counts
- Causal masking support
- C++ test executable for ELF-based profiling workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the KV cache prefill design to write both K and V caches to DDR
during flash attention computation. Uses a single CacheWB channel with
an interleaved KV cache layout [K_c0, V_c0, K_c1, V_c1, ...] where
both K and V data are staged through kwb_buf before DMA transfer.

Key design choices:
- Single CacheWB channel avoids shim S2MM channel exhaustion (no packet
  switching needed)
- Shared kwb_buf staging buffer prevents DMA race between CacheWB read
  and V2L1 write on the v buffer
- scf.for loop in launch body enables compiler BD folding, preventing
  BD exhaustion at large sequence lengths (tested up to 12h x 4096)

Compiler changes (AIRToAIEPass.cpp):
- Fix packet BD attribute lookup for L1-to-L3 dma_packet channels
  (getExistingPacketFlowOpFromDevice searches both flow maps)
- Place outbound MM2S lock acquire before channel put and release after
  channel put, enabling interleaved lock pattern for multiple puts
  sharing the same staging buffer

Performance: 12 heads x 4096 seq_len achieves 2460 peak GFLOPS with
zero overhead vs K-only writeback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update the interleaved KV cache layout to support dk_chunks > 1:
- Per chunk stores [K_dk0, ..., K_dk(N-1), V_dv_lz] with N = dk_chunks
- KV cache outer dimension combines (kv_head, dv_chunk) like V L3 layout
- Launch body scf.for iterates cache_slots_per_chunk = dk_chunks + 1
- Host test constructs expected data with per-dk_tile K slots

Currently dk=dv=128 fails at the aiecc level due to L1 memory exhaustion
(kwb_buf staging buffer + extra Q saved buffer exceeds 64KB), not due to
layout issues. The generalized layout is ready for when L1 capacity is
freed (e.g., by eliminating the staging buffer via compiler lock fixes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove run_test.sh with hardcoded machine-specific paths
- Fix lit test CHECK pattern: OVERALL: PASSED (not PASS!)
- Fix misleading RoPE message in C++ profiler: host pre-rotation
- Add missing C++ standard headers (chrono, cstring, cstdlib)
- Document GQA duplicate-write behavior for gqa_group_size > unroll

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lock placement fix for outbound puts sharing a staging buffer has
been moved to a separate PR (Xilinx#1515). This PR now contains only the
programming example changes. The example requires PR Xilinx#1515 to be
merged first for V cache write-back to work correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the erwei/add-kv-cache-prefill-example branch 2 times, most recently from 66d0aba to fdc78b2 Compare April 8, 2026 21:41
Register flash_attention/kv_cache_prefill in the programming examples
dashboard generator. Shows as NPU2-only (green) based on the lit test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants