[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
Conversation
| #else | ||
|
|
||
| ov::element::Type getRuntimePrecision() const override; | ||
| Algorithm algorithm = Algorithm::GatherMatmulDefault; | ||
| size_t numExperts = 0; | ||
|
|
||
| std::vector<ExecutorPtr> executor; | ||
| std::vector<MemoryArgs> memArgsFC; | ||
|
|
||
| MemoryPtr m_weightsMemory = nullptr; | ||
| MemoryPtr m_tmpInpBuffer = nullptr; | ||
| MemoryDescPtr m_tmpInputDesc = nullptr; | ||
| MemoryDescPtr m_tmpOutputDesc = nullptr; | ||
|
|
||
| #endif |
There was a problem hiding this comment.
Some fields are clearly duplicated between if and else branches. Should we narrow the scope?
There was a problem hiding this comment.
I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.
| continue; | ||
| } | ||
|
|
||
| parallel_for(num_valid_rows, [&](size_t m) { |
There was a problem hiding this comment.
It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.
There was a problem hiding this comment.
Do we really need to keep keep_dims=true followed by Squeeze operation under define?
I looks like it won't be a problem to support this variation in generic fashion. Just to reduce the code complexity.
There was a problem hiding this comment.
Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?
|
Hi @maxnick. Thanks for the comment I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code. I will update this PR with the new refactored logic in the coming week once its approved internally. Will move the relevant tests to common scope as well. |
[ About ]
Enable Gathermatmul op-fusion transformation on ARM at IR level.
Support KleidiAI implementation of GatherMatmul operation
Bug fix related to KleidiAI execution of matmul in F32 precision
[ Background ]
GatherMatmul is specialized matmul operation where the weights are gathered before multiplication [ used in MoE architecture ]
In OSS version, the implementation is based of OneDNN based GEMV execution, which falls short on two aspect:
Gemv is less optimized compared to gemm for prefill phase and decode phase when the batch sizes are higher
lowp int8/int4 not supported
We solve for the above 2 points with KleidiAI based GEMM implementation. Currently in this PR only F32 is supported, the subsequent PR will support lowp.
[Benchmark Results]
[ Accuracy ]
This work is contributed by @ashwins990 and @abhijain1204fujitsu