|
| 1 | +<!-- This file is auto-generated by generate_readme.py. Do not edit manually. --> |
| 2 | + |
| 3 | +# MLIR-AIR Programming Examples |
| 4 | + |
| 5 | +These programming examples demonstrate how to leverage the AIR design flow with mlir-air Python bindings and the mlir-air intermediate representation (IR) to build applications targeting AI Engines on AMD NPUs. |
| 6 | + |
| 7 | +## Operator Dashboard |
| 8 | + |
| 9 | +| Category | Operation | Datatype(s) | NPU1 | NPU2 | Design Example | |
| 10 | +|:---------|:----------|:------------|:----:|:----:|:---------------| |
| 11 | +| Linear Algebra | [Matrix Multiplication](matrix_multiplication/) | bf16, i16, i8 | 🟢 | 🟢 | [matrix_multiplication/](matrix_multiplication/) | |
| 12 | +| Linear Algebra | [Vector-Matrix Multiplication](vector_matrix_multiplication/) | bf16 | 🟢 | 🟢 | [vector_matrix_multiplication/](vector_matrix_multiplication/) | |
| 13 | +| Linear Algebra | [Matrix-Vector Multiplication](matrix_vector_multiplication/bf16/) | bf16 | ⚪ | 🟢 | [matrix_vector_multiplication/bf16/](matrix_vector_multiplication/bf16/) | |
| 14 | +| Linear Algebra | [AXPY](axpy/) | bf16 | 🟢 | 🟢 | [axpy/](axpy/) | |
| 15 | +| Element-wise | [Element-wise Add](eltwise_add/) | f32 | 🟢 | 🟢 | [eltwise_add/](eltwise_add/) | |
| 16 | +| Element-wise | [Element-wise Add (with L2)](eltwise_add_with_l2/) | f32 | 🟢 | 🟢 | [eltwise_add_with_l2/](eltwise_add_with_l2/) | |
| 17 | +| Element-wise | [Element-wise Add (bf16)](primitives/vector_examples/vector_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_add/](primitives/vector_examples/vector_add/) | |
| 18 | +| Element-wise | [Element-wise Mul](primitives/vector_examples/vector_mul/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_mul/](primitives/vector_examples/vector_mul/) | |
| 19 | +| Activation/Math | [SiLU](silu/) | bf16 | ⚪ | 🟢 | [silu/](silu/) | |
| 20 | +| Activation/Math | [GELU](gelu/) | bf16 | ⚪ | 🟢 | [gelu/](gelu/) | |
| 21 | +| Activation/Math | [Softmax](softmax/) | bf16 | 🟢 | 🟢 | [softmax/](softmax/) | |
| 22 | +| Activation/Math | [Sine / Cosine](sine_cosine/) | bf16 | 🟢 | ⚪ | [sine_cosine/](sine_cosine/) | |
| 23 | +| Activation/Math | [RELU](relu/) | bf16 | 🟢 | 🟢 | [relu/](relu/) | |
| 24 | +| Activation/Math | [Leaky RELU](leaky_relu/) | bf16 | 🟢 | 🟢 | [leaky_relu/](leaky_relu/) | |
| 25 | +| Activation/Math | [Sigmoid](sigmoid/) | bf16 | ⚪ | 🟢 | [sigmoid/](sigmoid/) | |
| 26 | +| Activation/Math | [Tanh](primitives/vector_examples/vector_tanh/) | bf16 | ⚪ | 🟢 | [primitives/vector_examples/vector_tanh/](primitives/vector_examples/vector_tanh/) | |
| 27 | +| Normalization | [Layer Normalization](layer_norm/) | bf16 | ⚪ | 🟢 | [layer_norm/](layer_norm/) | |
| 28 | +| Normalization | [RMS Normalization](rms_norm/) | bf16 | ⚪ | 🟢 | [rms_norm/](rms_norm/) | |
| 29 | +| Normalization | [Weighted RMS Normalization](weighted_rms_norm/) | bf16 | ⚪ | 🟢 | [weighted_rms_norm/](weighted_rms_norm/) | |
| 30 | +| Aggregation | [Reduction (Add)](primitives/vector_examples/vector_reduce_add/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_add/](primitives/vector_examples/vector_reduce_add/) | |
| 31 | +| Pooling | [MaxPool](primitives/vector_examples/vector_reduce_max/) | bf16 | 🟢 | 🟢 | [primitives/vector_examples/vector_reduce_max/](primitives/vector_examples/vector_reduce_max/) | |
| 32 | +| Pooling | [AveragePool](average_pool/) | bf16 | 🟢 | 🟢 | [average_pool/](average_pool/) | |
| 33 | +| LLM Kernels | [Multi-Head Attention (LLaMA2)](llama2_mha/) | bf16 | 🟢 | ⚪ | [llama2_mha/](llama2_mha/) | |
| 34 | +| LLM Kernels | [SwiGLU](swiglu/) | bf16 | ⚪ | 🟢 | [swiglu/](swiglu/) | |
| 35 | +| LLM Kernels | [FFN SwiGLU (Decode)](ffn_swiglu/decode/) | bf16 | ⚪ | 🟢 | [ffn_swiglu/decode/](ffn_swiglu/decode/) | |
| 36 | +| LLM Kernels | [FFN SwiGLU (Prefill)](ffn_swiglu/prefill/) | bf16 | ⚪ | 🟢 | [ffn_swiglu/prefill/](ffn_swiglu/prefill/) | |
| 37 | +| LLM Kernels | [RoPE (LUT-based)](rope_lut/) | bf16 | ⚪ | 🟢 | [rope_lut/](rope_lut/) | |
| 38 | +| LLM Kernels | [RoPE (On-chip Sin/Cos)](rope_sincos/) | bf16 | 🟢 | 🟢 | [rope_sincos/](rope_sincos/) | |
| 39 | +| Attention | [Flash Attention (Dataflow)](flash_attention/dataflow_based/) | bf16 | 🟢 | 🟢 | [flash_attention/dataflow_based/](flash_attention/dataflow_based/) | |
| 40 | +| Attention | [Flash Attention (Kernel Fusion)](flash_attention/kernel_fusion_based/) | bf16 | 🟢 | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) | |
| 41 | +| Attention | [Grouped Query Attention (GQA)](flash_attention/kernel_fusion_based/) | bf16 | 🟢 | 🟢 | [flash_attention/kernel_fusion_based/](flash_attention/kernel_fusion_based/) | |
| 42 | +| Attention | [Flash Attention + KV Cache Prefill](flash_attention/kv_cache_prefill/) | bf16 | ⚪ | 🟢 | [flash_attention/kv_cache_prefill/](flash_attention/kv_cache_prefill/) | |
| 43 | +| Data Movement | [Passthrough (DMA)](passthrough/passthrough_dma/) | u8, i8, i16, u16, f32, bf16 | 🟢 | 🟢 | [passthrough/passthrough_dma/](passthrough/passthrough_dma/) | |
| 44 | +| Data Movement | [Passthrough (Channel)](passthrough/passthrough_channel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_channel/](passthrough/passthrough_channel/) | |
| 45 | +| Data Movement | [Passthrough (Kernel)](passthrough/passthrough_kernel/) | u8 | 🟢 | 🟢 | [passthrough/passthrough_kernel/](passthrough/passthrough_kernel/) | |
| 46 | +| Data Movement | [Shim DMA 2D](shim_dma_2d/) | i32 | 🟢 | 🟢 | [shim_dma_2d/](shim_dma_2d/) | |
| 47 | +| Data Movement | [Data Transfer Transpose](data_transfer_transpose/) | u32 | 🟢 | 🟢 | [data_transfer_transpose/](data_transfer_transpose/) | |
| 48 | +| Data Movement | [Transpose (bf16)](data_transfer_transpose/dma_bf16/) | bf16 | ⚪ | 🟢 | [data_transfer_transpose/dma_bf16/](data_transfer_transpose/dma_bf16/) | |
| 49 | +| Data Movement | [Matrix Scalar Add](matrix_scalar_add/) | i32 | 🟢 | 🟢 | [matrix_scalar_add/](matrix_scalar_add/) | |
| 50 | +| Communication | [Channel Examples](channel_examples/) | i32 | 🟢 | 🟢 | [channel_examples/](channel_examples/) | |
| 51 | +| Communication | [3D Channel with Segment Unroll](channel_examples/channel_3d_segment_unroll/) | i32 | ⚪ | 🟢 | [channel_examples/channel_3d_segment_unroll/](channel_examples/channel_3d_segment_unroll/) | |
| 52 | +| Communication | [Broadcast Selective Capture](channel_examples/broadcast_selective_capture/) | i32 | 🟢 | 🟢 | [channel_examples/broadcast_selective_capture/](channel_examples/broadcast_selective_capture/) | |
| 53 | +| Communication | [Multi-Segment Examples](multi_segment/) | i32 | 🟡 | 🟡 | [multi_segment/](multi_segment/) | |
| 54 | +| Communication | [Cascade Reduction](cascade_reduction/) | i32 | 🟢 | 🟢 | [cascade_reduction/](cascade_reduction/) | |
| 55 | +| Memory | [Segment Alloc](segment_alloc/) | i32 | 🟢 | 🟢 | [segment_alloc/](segment_alloc/) | |
| 56 | +| Spatial | [Segment Unroll](segment_unroll/) | i32 | 🟢 | 🟢 | [segment_unroll/](segment_unroll/) | |
| 57 | +| Dataflow | [Herd Dataflow](herd_dataflow/) | bf16 | 🟢 | 🟢 | [herd_dataflow/](herd_dataflow/) | |
| 58 | +| Control Flow | [Conditional Branching](conditional_branching/) | i32 | 🟢 | 🟢 | [conditional_branching/](conditional_branching/) | |
| 59 | +| CNN | [2D Convolution](conv2d/) | i32 | 🟢 | 🟢 | [conv2d/](conv2d/) | |
| 60 | +| CNN | [Bottleneck](bottleneck/) | bf16 | 🟢 | 🟢 | [bottleneck/](bottleneck/) | |
| 61 | +| ML Pipeline | [MNIST-FC (Broadcast Bias Add)](mnist_fc/broadcast_bias_add/) | f32 | ⚪ | 🟢 | [mnist_fc/broadcast_bias_add/](mnist_fc/broadcast_bias_add/) | |
| 62 | +| ML Pipeline | [MNIST-FC (ReLU 2D)](mnist_fc/relu/) | f32/bf16 | ⚪ | 🟢 | [mnist_fc/relu/](mnist_fc/relu/) | |
| 63 | +| ML Pipeline | [MNIST-FC (Argmax)](mnist_fc/argmax/) | f32→i32 | ⚪ | 🟢 | [mnist_fc/argmax/](mnist_fc/argmax/) | |
| 64 | +| ML Pipeline | [MNIST-FC (Integration)](mnist_fc/integration/) | f32 | ⚪ | 🟢 | [mnist_fc/integration/](mnist_fc/integration/) | |
| 65 | +| Memory | [Shared L1 Buffer](shared_l1/) | bf16 | 🟢 | ⚪ | [shared_l1/](shared_l1/) | |
| 66 | +| Quantization | [Dequant (AWQ int4→bf16)](dequant_awq/) | int4/bf16 | ⚪ | 🟢 | [dequant_awq/](dequant_awq/) | |
| 67 | +| Primitives | [Scalar/Vector Operations](primitives/) | various | 🟢 | 🟢 | [primitives/](primitives/) | |
| 68 | + |
| 69 | +### Status Legend |
| 70 | + |
| 71 | +- 🟢 Supported and tested |
| 72 | +- 🟡 Work in progress |
| 73 | +- ⚪ Not yet supported |
| 74 | + |
| 75 | +**NPU1** = AMD Ryzen AI (Phoenix, AIE2) **NPU2** = AMD Ryzen AI (Strix, AIE2P) |
| 76 | + |
| 77 | +## Getting Started |
| 78 | + |
| 79 | +See the top-level [README](../README.md) for environment setup and build instructions. Once your environment is configured: |
| 80 | + |
| 81 | +```bash |
| 82 | +# Example: run matrix multiplication (bf16, 4x4 herd, 512x512x512) |
| 83 | +cd matrix_multiplication/bf16 |
| 84 | +make run4x4 |
| 85 | + |
| 86 | +# Print generated MLIR without running |
| 87 | +make print |
| 88 | +``` |
| 89 | + |
| 90 | +Most examples with a `Makefile` support `make run` (compile and execute on hardware) and `make print` (generate MLIR only). Examples without a Makefile can be run directly with Python: |
| 91 | + |
| 92 | +```bash |
| 93 | +python3 run.py # compile and run (XRTRunner) |
| 94 | +python3 run.py --print-module-only # print IR only |
| 95 | +``` |
| 96 | + |
| 97 | +## Benchmarking |
| 98 | + |
| 99 | +The [matrix multiplication](matrix_multiplication/) examples include sweep infrastructure for measuring end-to-end latency across problem sizes: |
| 100 | + |
| 101 | +```bash |
| 102 | +cd matrix_multiplication/bf16 |
| 103 | +make sweep4x4 # sweep problem sizes 256-2048 with a 4x4 herd |
| 104 | +make profile # profile a single 1024^3 problem on hardware |
| 105 | +``` |
| 106 | + |
| 107 | +Sweep results are saved as CSV files for analysis. See the [bf16 README](matrix_multiplication/bf16/README.md) for details on tile size configuration and architecture selection. |
0 commit comments