Description
Benchmark target numbers
First up some assumptions and theoretical limits. To do a 512x512x512 matmul on a single core
phoenix, bf16, with clock at 1.6e9 (see #1167 (comment) and Xilinx/mlir-aie#2017 (comment))
512*512*512 / (4*4*8 * 1.6e9) = 655 microseconds
strix, i8, with clock at 1e9:
512*512*512 / (8*8*8 * 1.0e9) = 262 microseconds
To do the above 100 times (this is what the benchmark does, see here)
phoenix : 65'500 [us]
strix: 26'200 [us]
Benchmark performance:
Looking at https://nod-ai.github.io/iree-amd-aie/results_history_npu1.html
we see
phoenix, direct codegen: 227'000 [us] : 29% of peak.
phoenix, direct codegen, with unroll and jam: 172'000 [us] : 39% of peak (see #1167)
phoenix, chess ukernel: 152'000 [us]: 43% of peak
strix https://nod-ai.github.io/iree-amd-aie/results_history_npu4.html :
ukernel chess: 76'000 [us] : 34 %
ukernel peano: 97'000 [us] : 26 %
Summary
- If my analysis is correct, microkernel performance needs improvement. Bf16 on phoenix should be better than 43% etc.
- With unroll and jam on phoenix, direct codegen is within 10 % of ukernel performance, should also be improved.