Skip to content

Commit 66d0aba

Browse files
erwei-xilinxclaude
andcommitted
Add KV cache prefill to operator dashboard
Register flash_attention/kv_cache_prefill in the programming examples dashboard generator. Shows as NPU2-only (green) based on the lit test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 15e9cf6 commit 66d0aba

2 files changed

Lines changed: 113 additions & 0 deletions

File tree

programming_examples/dashboard.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
<!-- This file is auto-generated by generate_readme.py. Do not edit manually. -->
2+
3+
# MLIR-AIR Programming Examples
4+
5+
These programming examples demonstrate how to leverage the AIR design flow with mlir-air Python bindings and the mlir-air intermediate representation (IR) to build applications targeting AI Engines on AMD NPUs.
6+
7+
## Operator Dashboard
8+
9+
| Category | Operation | Datatype(s) | NPU1 | NPU2 | Design Example |
10+
|:---------|:----------|:------------|:----:|:----:|:---------------|
11+
| Linear Algebra | [Matrix Multiplication](matrix_multiplication/) | bf16, i16, i8 | 🟢 | 🟢 | [matrix_multiplication/](matrix_multiplication/) |
12+
| Linear Algebra | [Vector-Matrix Multiplication](vector_matrix_multiplication/) | bf16 | 🟢 | 🟢 | [vector_matrix_multiplication/](vector_matrix_multiplication/) |
13+
| Linear Algebra | [Matrix-Vector Multiplication](matrix_vector_multiplication/bf16/) | bf16 || 🟢 | [matrix_vector_multiplication/bf16/](matrix_vector_multiplication/bf16/) |
14+
| Linear Algebra | [AXPY](axpy/) | bf16 | 🟢 | 🟢 | [axpy/](axpy/) |
15+
| Element-wise | [Element-wise Add](eltwise_add/) | f32 | 🟢 | 🟢 | [eltwise_add/](eltwise_add/) |
16+
| Element-wise | [Element-wise Add (with L2)](eltwise_add_with_l2/) | f32 | 🟢 | 🟢 | [eltwise_add_with_l2/](eltwise_add_with_l2/) |
17+
| Element-wise | [Element-wise Add (bf16)](primitives/vector_examples/vector_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_add/](primitives/vector_examples/vector_add/) |
18+
| Element-wise | [Element-wise Mul](primitives/vector_examples/vector_mul/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_mul/](primitives/vector_examples/vector_mul/) |
19+
| Activation/Math | [SiLU](silu/) | bf16 || 🟢 | [silu/](silu/) |
20+
| Activation/Math | [GELU](gelu/) | bf16 || 🟢 | [gelu/](gelu/) |
21+
| Activation/Math | [Softmax](softmax/) | bf16 | 🟢 | 🟢 | [softmax/](softmax/) |
22+
| Activation/Math | [Sine / Cosine](sine_cosine/) | bf16 | 🟢 || [sine_cosine/](sine_cosine/) |
23+
| Activation/Math | [RELU](relu/) | bf16 | 🟢 | 🟢 | [relu/](relu/) |
24+
| Activation/Math | [Leaky RELU](leaky_relu/) | bf16 | 🟢 | 🟢 | [leaky_relu/](leaky_relu/) |
25+
| Activation/Math | [Sigmoid](sigmoid/) | bf16 || 🟢 | [sigmoid/](sigmoid/) |
26+
| Activation/Math | [Tanh](primitives/vector_examples/vector_tanh/) | bf16 || 🟢 | [primitives/vector_examples/vector_tanh/](primitives/vector_examples/vector_tanh/) |
27+
| Normalization | [Layer Normalization](layer_norm/) | bf16 || 🟢 | [layer_norm/](layer_norm/) |
28+
| Normalization | [RMS Normalization](rms_norm/) | bf16 || 🟢 | [rms_norm/](rms_norm/) |
29+
| Normalization | [Weighted RMS Normalization](weighted_rms_norm/) | bf16 || 🟢 | [weighted_rms_norm/](weighted_rms_norm/) |
30+
| Aggregation | [Reduction (Add)](primitives/vector_examples/vector_reduce_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_add/](primitives/vector_examples/vector_reduce_add/) |
31+
| Pooling | [MaxPool](primitives/vector_examples/vector_reduce_max/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_max/](primitives/vector_examples/vector_reduce_max/) |
32+
| Pooling | [AveragePool](average_pool/) | bf16 | 🟢 | 🟢 | [average_pool/](average_pool/) |
33+
| LLM Kernels | [Multi-Head Attention (LLaMA2)](llama2_mha/) | bf16 | 🟢 || [llama2_mha/](llama2_mha/) |
34+
| LLM Kernels | [SwiGLU](swiglu/) | bf16 || 🟢 | [swiglu/](swiglu/) |
35+
| LLM Kernels | [FFN SwiGLU (Decode)](ffn_swiglu/decode/) | bf16 || 🟢 | [ffn_swiglu/decode/](ffn_swiglu/decode/) |
36+
| LLM Kernels | [FFN SwiGLU (Prefill)](ffn_swiglu/prefill/) | bf16 || 🟢 | [ffn_swiglu/prefill/](ffn_swiglu/prefill/) |
37+
| LLM Kernels | [RoPE (LUT-based)](rope_lut/) | bf16 || 🟢 | [rope_lut/](rope_lut/) |
38+
| LLM Kernels | [RoPE (On-chip Sin/Cos)](rope_sincos/) | bf16 | 🟢 | 🟢 | [rope_sincos/](rope_sincos/) |
39+
| Attention | [Flash Attention (Dataflow)](flash_attention/dataflow_based/) | bf16 | 🟢 | 🟢 | [flash_attention/dataflow_based/](flash_attention/dataflow_based/) |
40+
| Attention | [Flash Attention (Kernel Fusion)](flash_attention/kernel_fusion_based/) | bf16 | 🟢 | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) |
41+
| Attention | [Grouped Query Attention (GQA)](flash_attention/kernel_fusion_based/) | bf16 | 🟢 | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) |
42+
| Attention | [Flash Attention + KV Cache Prefill](flash_attention/kv_cache_prefill/) | bf16 || 🟢 | [flash_attention/kv_cache_prefill/](flash_attention/kv_cache_prefill/) |
43+
| Data Movement | [Passthrough (DMA)](passthrough/passthrough_dma/) | u8, i8, i16, u16, f32, bf16 | 🟢 | 🟢 | [passthrough/passthrough_dma/](passthrough/passthrough_dma/) |
44+
| Data Movement | [Passthrough (Channel)](passthrough/passthrough_channel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_channel/](passthrough/passthrough_channel/) |
45+
| Data Movement | [Passthrough (Kernel)](passthrough/passthrough_kernel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_kernel/](passthrough/passthrough_kernel/) |
46+
| Data Movement | [Shim DMA 2D](shim_dma_2d/) | i32 | 🟢 | 🟢 | [shim_dma_2d/](shim_dma_2d/) |
47+
| Data Movement | [Data Transfer Transpose](data_transfer_transpose/) | u32 | 🟢 | 🟢 | [data_transfer_transpose/](data_transfer_transpose/) |
48+
| Data Movement | [Transpose (bf16)](data_transfer_transpose/dma_bf16/) | bf16 || 🟢 | [data_transfer_transpose/dma_bf16/](data_transfer_transpose/dma_bf16/) |
49+
| Data Movement | [Matrix Scalar Add](matrix_scalar_add/) | i32 | 🟢 | 🟢 | [matrix_scalar_add/](matrix_scalar_add/) |
50+
| Communication | [Channel Examples](channel_examples/) | i32 | 🟢 | 🟢 | [channel_examples/](channel_examples/) |
51+
| Communication | [3D Channel with Segment Unroll](channel_examples/channel_3d_segment_unroll/) | i32 || 🟢 | [channel_examples/channel_3d_segment_unroll/](channel_examples/channel_3d_segment_unroll/) |
52+
| Communication | [Broadcast Selective Capture](channel_examples/broadcast_selective_capture/) | i32 | 🟢 | 🟢 | [channel_examples/broadcast_selective_capture/](channel_examples/broadcast_selective_capture/) |
53+
| Communication | [Multi-Segment Examples](multi_segment/) | i32 | 🟡 | 🟡 | [multi_segment/](multi_segment/) |
54+
| Communication | [Cascade Reduction](cascade_reduction/) | i32 | 🟢 | 🟢 | [cascade_reduction/](cascade_reduction/) |
55+
| Memory | [Segment Alloc](segment_alloc/) | i32 | 🟢 | 🟢 | [segment_alloc/](segment_alloc/) |
56+
| Spatial | [Segment Unroll](segment_unroll/) | i32 | 🟢 | 🟢 | [segment_unroll/](segment_unroll/) |
57+
| Dataflow | [Herd Dataflow](herd_dataflow/) | bf16 | 🟢 | 🟢 | [herd_dataflow/](herd_dataflow/) |
58+
| Control Flow | [Conditional Branching](conditional_branching/) | i32 | 🟢 | 🟢 | [conditional_branching/](conditional_branching/) |
59+
| CNN | [2D Convolution](conv2d/) | i32 | 🟢 | 🟢 | [conv2d/](conv2d/) |
60+
| CNN | [Bottleneck](bottleneck/) | bf16 | 🟢 | 🟢 | [bottleneck/](bottleneck/) |
61+
| ML Pipeline | [MNIST-FC (Broadcast Bias Add)](mnist_fc/broadcast_bias_add/) | f32 || 🟢 | [mnist_fc/broadcast_bias_add/](mnist_fc/broadcast_bias_add/) |
62+
| ML Pipeline | [MNIST-FC (ReLU 2D)](mnist_fc/relu/) | f32/bf16 || 🟢 | [mnist_fc/relu/](mnist_fc/relu/) |
63+
| ML Pipeline | [MNIST-FC (Argmax)](mnist_fc/argmax/) | f32→i32 || 🟢 | [mnist_fc/argmax/](mnist_fc/argmax/) |
64+
| ML Pipeline | [MNIST-FC (Integration)](mnist_fc/integration/) | f32 || 🟢 | [mnist_fc/integration/](mnist_fc/integration/) |
65+
| Memory | [Shared L1 Buffer](shared_l1/) | bf16 | 🟢 || [shared_l1/](shared_l1/) |
66+
| Quantization | [Dequant (AWQ int4→bf16)](dequant_awq/) | int4/bf16 || 🟢 | [dequant_awq/](dequant_awq/) |
67+
| Primitives | [Scalar/Vector Operations](primitives/) | various | 🟢 | 🟢 | [primitives/](primitives/) |
68+
69+
### Status Legend
70+
71+
- 🟢 Supported and tested
72+
- 🟡 Work in progress
73+
- ⚪ Not yet supported
74+
75+
**NPU1** = AMD Ryzen AI (Phoenix, AIE2) &nbsp;&nbsp; **NPU2** = AMD Ryzen AI (Strix, AIE2P)
76+
77+
## Getting Started
78+
79+
See the top-level [README](../README.md) for environment setup and build instructions. Once your environment is configured:
80+
81+
```bash
82+
# Example: run matrix multiplication (bf16, 4x4 herd, 512x512x512)
83+
cd matrix_multiplication/bf16
84+
make run4x4
85+
86+
# Print generated MLIR without running
87+
make print
88+
```
89+
90+
Most examples with a `Makefile` support `make run` (compile and execute on hardware) and `make print` (generate MLIR only). Examples without a Makefile can be run directly with Python:
91+
92+
```bash
93+
python3 run.py # compile and run (XRTRunner)
94+
python3 run.py --print-module-only # print IR only
95+
```
96+
97+
## Benchmarking
98+
99+
The [matrix multiplication](matrix_multiplication/) examples include sweep infrastructure for measuring end-to-end latency across problem sizes:
100+
101+
```bash
102+
cd matrix_multiplication/bf16
103+
make sweep4x4 # sweep problem sizes 256-2048 with a 4x4 herd
104+
make profile # profile a single 1024^3 problem on hardware
105+
```
106+
107+
Sweep results are saved as CSV files for analysis. See the [bf16 README](matrix_multiplication/bf16/README.md) for details on tile size configuration and architecture selection.

programming_examples/generate_readme.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -216,6 +216,12 @@
216216
"path": "flash_attention/kernel_fusion_based",
217217
"datatypes": "bf16",
218218
},
219+
{
220+
"category": "Attention",
221+
"name": "Flash Attention + KV Cache Prefill",
222+
"path": "flash_attention/kv_cache_prefill",
223+
"datatypes": "bf16",
224+
},
219225
{
220226
"category": "Data Movement",
221227
"name": "Passthrough (DMA)",

0 commit comments

Comments
 (0)