Add KV cache prefill flash attention example for AIE2P#1500
Open
erwei-xilinx wants to merge 6 commits into
Open
Add KV cache prefill flash attention example for AIE2P#1500erwei-xilinx wants to merge 6 commits into
erwei-xilinx wants to merge 6 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new AIE2P programming example that fuses flash attention with K-cache write-back (KV prefill) and includes Makefile/lit coverage plus an ELF-based C++ runner.
Changes:
- Introduces
kv_cache_prefillexample implementation (AIR/MLIR Python builder + AIE kernel) with optional K write-back and GQA/causal support. - Adds build/run tooling (Makefile, lit test) and a C++ ELF runner for profiling.
- Adds a convenience
run_test.shwrapper script.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| programming_examples/flash_attention/kv_cache_prefill/attn_npu2.py | Builds and runs the fused attention + K write-back module; includes host-side validation. |
| programming_examples/flash_attention/kv_cache_prefill/attn_npu2.cc | AIE2P bf16 attention/softmax kernels plus copy/mask helpers. |
| programming_examples/flash_attention/kv_cache_prefill/Makefile | Builds the kernel object, runs the Python driver, and optionally profiles via ELF runner. |
| programming_examples/flash_attention/kv_cache_prefill/test_elf_npu2.cpp | C++ executable to load/run an ELF kernel for profiling. |
| programming_examples/flash_attention/kv_cache_prefill/run_npu2_makefile_peano_elf.lit | Lit test invoking make run for the new example. |
| programming_examples/flash_attention/kv_cache_prefill/run_test.sh | Local wrapper script to set env and execute the Python driver. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
09a78ec to
df4f24d
Compare
Add a new programming example that demonstrates fused flash attention with KV cache write-back on AIE2P NPU. This extends the existing kernel_fusion_based flash attention with K cache prefill capability, where RoPE'd K data is written back to DDR during attention computation. Key design features: - L1-to-L3 direct K write-back path bypassing memtile to avoid DMA channel congestion - Dedicated staging buffer to prevent DMA race conditions between K receive and write-back - Un-tiling DMA strides to convert 8x8 blocked L1 layout back to row-major for the K cache - Support for GQA (grouped query attention) with configurable head counts - Causal masking support - C++ test executable for ELF-based profiling workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the KV cache prefill design to write both K and V caches to DDR during flash attention computation. Uses a single CacheWB channel with an interleaved KV cache layout [K_c0, V_c0, K_c1, V_c1, ...] where both K and V data are staged through kwb_buf before DMA transfer. Key design choices: - Single CacheWB channel avoids shim S2MM channel exhaustion (no packet switching needed) - Shared kwb_buf staging buffer prevents DMA race between CacheWB read and V2L1 write on the v buffer - scf.for loop in launch body enables compiler BD folding, preventing BD exhaustion at large sequence lengths (tested up to 12h x 4096) Compiler changes (AIRToAIEPass.cpp): - Fix packet BD attribute lookup for L1-to-L3 dma_packet channels (getExistingPacketFlowOpFromDevice searches both flow maps) - Place outbound MM2S lock acquire before channel put and release after channel put, enabling interleaved lock pattern for multiple puts sharing the same staging buffer Performance: 12 heads x 4096 seq_len achieves 2460 peak GFLOPS with zero overhead vs K-only writeback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update the interleaved KV cache layout to support dk_chunks > 1: - Per chunk stores [K_dk0, ..., K_dk(N-1), V_dv_lz] with N = dk_chunks - KV cache outer dimension combines (kv_head, dv_chunk) like V L3 layout - Launch body scf.for iterates cache_slots_per_chunk = dk_chunks + 1 - Host test constructs expected data with per-dk_tile K slots Currently dk=dv=128 fails at the aiecc level due to L1 memory exhaustion (kwb_buf staging buffer + extra Q saved buffer exceeds 64KB), not due to layout issues. The generalized layout is ready for when L1 capacity is freed (e.g., by eliminating the staging buffer via compiler lock fixes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove run_test.sh with hardcoded machine-specific paths - Fix lit test CHECK pattern: OVERALL: PASSED (not PASS!) - Fix misleading RoPE message in C++ profiler: host pre-rotation - Add missing C++ standard headers (chrono, cstring, cstdlib) - Document GQA duplicate-write behavior for gqa_group_size > unroll Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lock placement fix for outbound puts sharing a staging buffer has been moved to a separate PR (Xilinx#1515). This PR now contains only the programming example changes. The example requires PR Xilinx#1515 to be merged first for V cache write-back to work correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
66d0aba to
fdc78b2
Compare
Register flash_attention/kv_cache_prefill in the programming examples dashboard generator. Shows as NPU2-only (green) based on the lit test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
make runwith KWB enabled: K cache 0/65536 mismatches, attention correlation=0.9994make run --no-k-writeback: attention correlation=0.9994 (regression-free)run_npu2_makefile_peano_elf.liton NPU2 hardware--causalflag🤖 Generated with Claude Code