Add KV cache prefill to operator dashboard

erwei-xilinx · claude · erwei-xilinx · commit 66d0aba7bd2e · 2026-04-08T14:41:28.000-07:00
Register flash_attention/kv_cache_prefill in the programming examples
dashboard generator. Shows as NPU2-only (green) based on the lit test.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/programming_examples/dashboard.md b/programming_examples/dashboard.md
@@ -0,0 +1,107 @@
+<!-- This file is auto-generated by generate_readme.py. Do not edit manually. -->
+
+# MLIR-AIR Programming Examples
+
+These programming examples demonstrate how to leverage the AIR design flow with mlir-air Python bindings and the mlir-air intermediate representation (IR) to build applications targeting AI Engines on AMD NPUs.
+
+## Operator Dashboard
+
+| Category | Operation | Datatype(s) | NPU1 | NPU2 | Design Example |
+|:---------|:----------|:------------|:----:|:----:|:---------------|
+| Linear Algebra | [Matrix Multiplication](matrix_multiplication/) | bf16, i16, i8 | 🟢 | 🟢 | [matrix_multiplication/](matrix_multiplication/) |
+| Linear Algebra | [Vector-Matrix Multiplication](vector_matrix_multiplication/) | bf16 | 🟢 | 🟢 | [vector_matrix_multiplication/](vector_matrix_multiplication/) |
+| Linear Algebra | [Matrix-Vector Multiplication](matrix_vector_multiplication/bf16/) | bf16 | ⚪ | 🟢 | [matrix_vector_multiplication/bf16/](matrix_vector_multiplication/bf16/) |
+| Linear Algebra | [AXPY](axpy/) | bf16 | 🟢 | 🟢 | [axpy/](axpy/) |
+| Element-wise | [Element-wise Add](eltwise_add/) | f32 | 🟢 | 🟢 | [eltwise_add/](eltwise_add/) |
+| Element-wise | [Element-wise Add (with L2)](eltwise_add_with_l2/) | f32 | 🟢 | 🟢 | [eltwise_add_with_l2/](eltwise_add_with_l2/) |
+| Element-wise | [Element-wise Add (bf16)](primitives/vector_examples/vector_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_add/](primitives/vector_examples/vector_add/) |
+| Element-wise | [Element-wise Mul](primitives/vector_examples/vector_mul/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_mul/](primitives/vector_examples/vector_mul/) |
+| Activation/Math | [SiLU](silu/) | bf16 | ⚪ | 🟢 | [silu/](silu/) |
+| Activation/Math | [GELU](gelu/) | bf16 | ⚪ | 🟢 | [gelu/](gelu/) |
+| Activation/Math | [Softmax](softmax/) | bf16 | 🟢 | 🟢 | [softmax/](softmax/) |
+| Activation/Math | [Sine / Cosine](sine_cosine/) | bf16 | 🟢 | ⚪ | [sine_cosine/](sine_cosine/) |
+| Activation/Math | [RELU](relu/) | bf16 | 🟢 | 🟢 | [relu/](relu/) |
+| Activation/Math | [Leaky RELU](leaky_relu/) | bf16 | 🟢 | 🟢 | [leaky_relu/](leaky_relu/) |
+| Activation/Math | [Sigmoid](sigmoid/) | bf16 | ⚪ | 🟢 | [sigmoid/](sigmoid/) |
+| Activation/Math | [Tanh](primitives/vector_examples/vector_tanh/) | bf16 | ⚪ | 🟢 | [primitives/vector_examples/vector_tanh/](primitives/vector_examples/vector_tanh/) |
+| Normalization | [Layer Normalization](layer_norm/) | bf16 | ⚪ | 🟢 | [layer_norm/](layer_norm/) |
+| Normalization | [RMS Normalization](rms_norm/) | bf16 | ⚪ | 🟢 | [rms_norm/](rms_norm/) |
+| Normalization | [Weighted RMS Normalization](weighted_rms_norm/) | bf16 | ⚪ | 🟢 | [weighted_rms_norm/](weighted_rms_norm/) |
+| Aggregation | [Reduction (Add)](primitives/vector_examples/vector_reduce_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_add/](primitives/vector_examples/vector_reduce_add/) |
+| Pooling | [MaxPool](primitives/vector_examples/vector_reduce_max/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_max/](primitives/vector_examples/vector_reduce_max/) |
+| Pooling | [AveragePool](average_pool/) | bf16 | 🟢 | 🟢 | [average_pool/](average_pool/) |
+| LLM Kernels | [Multi-Head Attention (LLaMA2)](llama2_mha/) | bf16 | 🟢 | ⚪ | [llama2_mha/](llama2_mha/) |
+| LLM Kernels | [SwiGLU](swiglu/) | bf16 | ⚪ | 🟢 | [swiglu/](swiglu/) |
+| LLM Kernels | [FFN SwiGLU (Decode)](ffn_swiglu/decode/) | bf16 | ⚪ | 🟢 | [ffn_swiglu/decode/](ffn_swiglu/decode/) |
+| LLM Kernels | [FFN SwiGLU (Prefill)](ffn_swiglu/prefill/) | bf16 | ⚪ | 🟢 | [ffn_swiglu/prefill/](ffn_swiglu/prefill/) |
+| LLM Kernels | [RoPE (LUT-based)](rope_lut/) | bf16 | ⚪ | 🟢 | [rope_lut/](rope_lut/) |
+| LLM Kernels | [RoPE (On-chip Sin/Cos)](rope_sincos/) | bf16 | 🟢 | 🟢 | [rope_sincos/](rope_sincos/) |
+| Attention | [Flash Attention (Dataflow)](flash_attention/dataflow_based/) | bf16 | 🟢 | 🟢 | [flash_attention/dataflow_based/](flash_attention/dataflow_based/) |
+| Attention | [Flash Attention (Kernel Fusion)](flash_attention/kernel_fusion_based/) | bf16 | 🟢 | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) |
+| Attention | [Grouped Query Attention (GQA)](flash_attention/kernel_fusion_based/) | bf16 | 🟢 | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) |
+| Attention | [Flash Attention + KV Cache Prefill](flash_attention/kv_cache_prefill/) | bf16 | ⚪ | 🟢 | [flash_attention/kv_cache_prefill/](flash_attention/kv_cache_prefill/) |
+| Data Movement | [Passthrough (DMA)](passthrough/passthrough_dma/) | u8, i8, i16, u16, f32, bf16 | 🟢 | 🟢 | [passthrough/passthrough_dma/](passthrough/passthrough_dma/) |
+| Data Movement | [Passthrough (Channel)](passthrough/passthrough_channel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_channel/](passthrough/passthrough_channel/) |
+| Data Movement | [Passthrough (Kernel)](passthrough/passthrough_kernel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_kernel/](passthrough/passthrough_kernel/) |
+| Data Movement | [Shim DMA 2D](shim_dma_2d/) | i32 | 🟢 | 🟢 | [shim_dma_2d/](shim_dma_2d/) |
+| Data Movement | [Data Transfer Transpose](data_transfer_transpose/) | u32 | 🟢 | 🟢 | [data_transfer_transpose/](data_transfer_transpose/) |
+| Data Movement | [Transpose (bf16)](data_transfer_transpose/dma_bf16/) | bf16 | ⚪ | 🟢 | [data_transfer_transpose/dma_bf16/](data_transfer_transpose/dma_bf16/) |
+| Data Movement | [Matrix Scalar Add](matrix_scalar_add/) | i32 | 🟢 | 🟢 | [matrix_scalar_add/](matrix_scalar_add/) |
+| Communication | [Channel Examples](channel_examples/) | i32 | 🟢 | 🟢 | [channel_examples/](channel_examples/) |
+| Communication | [3D Channel with Segment Unroll](channel_examples/channel_3d_segment_unroll/) | i32 | ⚪ | 🟢 | [channel_examples/channel_3d_segment_unroll/](channel_examples/channel_3d_segment_unroll/) |
+| Communication | [Broadcast Selective Capture](channel_examples/broadcast_selective_capture/) | i32 | 🟢 | 🟢 | [channel_examples/broadcast_selective_capture/](channel_examples/broadcast_selective_capture/) |
+| Communication | [Multi-Segment Examples](multi_segment/) | i32 | 🟡 | 🟡 | [multi_segment/](multi_segment/) |
+| Communication | [Cascade Reduction](cascade_reduction/) | i32 | 🟢 | 🟢 | [cascade_reduction/](cascade_reduction/) |
+| Memory | [Segment Alloc](segment_alloc/) | i32 | 🟢 | 🟢 | [segment_alloc/](segment_alloc/) |
+| Spatial | [Segment Unroll](segment_unroll/) | i32 | 🟢 | 🟢 | [segment_unroll/](segment_unroll/) |
+| Dataflow | [Herd Dataflow](herd_dataflow/) | bf16 | 🟢 | 🟢 | [herd_dataflow/](herd_dataflow/) |
+| Control Flow | [Conditional Branching](conditional_branching/) | i32 | 🟢 | 🟢 | [conditional_branching/](conditional_branching/) |
+| CNN | [2D Convolution](conv2d/) | i32 | 🟢 | 🟢 | [conv2d/](conv2d/) |
+| CNN | [Bottleneck](bottleneck/) | bf16 | 🟢 | 🟢 | [bottleneck/](bottleneck/) |
+| ML Pipeline | [MNIST-FC (Broadcast Bias Add)](mnist_fc/broadcast_bias_add/) | f32 | ⚪ | 🟢 | [mnist_fc/broadcast_bias_add/](mnist_fc/broadcast_bias_add/) |
+| ML Pipeline | [MNIST-FC (ReLU 2D)](mnist_fc/relu/) | f32/bf16 | ⚪ | 🟢 | [mnist_fc/relu/](mnist_fc/relu/) |
+| ML Pipeline | [MNIST-FC (Argmax)](mnist_fc/argmax/) | f32→i32 | ⚪ | 🟢 | [mnist_fc/argmax/](mnist_fc/argmax/) |
+| ML Pipeline | [MNIST-FC (Integration)](mnist_fc/integration/) | f32 | ⚪ | 🟢 | [mnist_fc/integration/](mnist_fc/integration/) |
+| Memory | [Shared L1 Buffer](shared_l1/) | bf16 | 🟢 | ⚪ | [shared_l1/](shared_l1/) |
+| Quantization | [Dequant (AWQ int4→bf16)](dequant_awq/) | int4/bf16 | ⚪ | 🟢 | [dequant_awq/](dequant_awq/) |
+| Primitives | [Scalar/Vector Operations](primitives/) | various | 🟢 | 🟢 | [primitives/](primitives/) |
+
+### Status Legend
+
+- 🟢 Supported and tested
+- 🟡 Work in progress
+- ⚪ Not yet supported
+
+**NPU1** = AMD Ryzen AI (Phoenix, AIE2) &nbsp;&nbsp; **NPU2** = AMD Ryzen AI (Strix, AIE2P)
+
+## Getting Started
+
+See the top-level [README](../README.md) for environment setup and build instructions. Once your environment is configured:
+
+```bash
+# Example: run matrix multiplication (bf16, 4x4 herd, 512x512x512)
+cd matrix_multiplication/bf16
+make run4x4
+
+# Print generated MLIR without running
+make print
+```
+
+Most examples with a `Makefile` support `make run` (compile and execute on hardware) and `make print` (generate MLIR only). Examples without a Makefile can be run directly with Python:
+
+```bash
+python3 run.py                    # compile and run (XRTRunner)
+python3 run.py --print-module-only  # print IR only
+```
+
+## Benchmarking
+
+The [matrix multiplication](matrix_multiplication/) examples include sweep infrastructure for measuring end-to-end latency across problem sizes:
+
+```bash
+cd matrix_multiplication/bf16
+make sweep4x4    # sweep problem sizes 256-2048 with a 4x4 herd
+make profile     # profile a single 1024^3 problem on hardware
+```
+
+Sweep results are saved as CSV files for analysis. See the [bf16 README](matrix_multiplication/bf16/README.md) for details on tile size configuration and architecture selection.
diff --git a/programming_examples/generate_readme.py b/programming_examples/generate_readme.py
@@ -216,6 +216,12 @@
         "path": "flash_attention/kernel_fusion_based",
         "datatypes": "bf16",
     },
+    {
+        "category": "Attention",
+        "name": "Flash Attention + KV Cache Prefill",
+        "path": "flash_attention/kv_cache_prefill",
+        "datatypes": "bf16",
+    },
     {
         "category": "Data Movement",
         "name": "Passthrough (DMA)",