Skip to content

Matmul core performance for [strix, phoenix] x [peano codegen, peano ukernel, chess ukernel] #1198

Open
@newling

Description

@newling

Benchmark target numbers

First up some assumptions and theoretical limits. To do a 512x512x512 matmul on a single core

phoenix, bf16, with clock at 1.6e9 (see #1167 (comment) and Xilinx/mlir-aie#2017 (comment))
512*512*512 / (4*4*8 * 1.6e9) = 655 microseconds

strix, i8, with clock at 1e9:
512*512*512 / (8*8*8 * 1.0e9) = 262 microseconds

To do the above 100 times (this is what the benchmark does, see here)
phoenix : 65'500 [us]
strix: 26'200 [us]

Benchmark performance:

Looking at https://nod-ai.github.io/iree-amd-aie/results_history_npu1.html

we see
phoenix, direct codegen: 227'000 [us] : 29% of peak.
phoenix, direct codegen, with unroll and jam: 172'000 [us] : 39% of peak (see #1167)
phoenix, chess ukernel: 152'000 [us]: 43% of peak

strix https://nod-ai.github.io/iree-amd-aie/results_history_npu4.html :
ukernel chess: 76'000 [us] : 34 %
ukernel peano: 97'000 [us] : 26 %

Summary

  • If my analysis is correct, microkernel performance needs improvement. Bf16 on phoenix should be better than 43% etc.
  • With unroll and jam on phoenix, direct codegen is within 10 % of ukernel performance, should also be improved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions