Skip to content

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303

Open
abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai
Open

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai

Conversation

@abhijain1204fujitsu
Copy link
Contributor

@abhijain1204fujitsu abhijain1204fujitsu commented Feb 25, 2026

[ About ]

Enable Gathermatmul op-fusion transformation on ARM at IR level.
Support KleidiAI implementation of GatherMatmul operation
Bug fix related to KleidiAI execution of matmul in F32 precision

[ Background ]
GatherMatmul is specialized matmul operation where the weights are gathered before multiplication [ used in MoE architecture ]
In OSS version, the implementation is based of OneDNN based GEMV execution, which falls short on two aspect:
Gemv is less optimized compared to gemm for prefill phase and decode phase when the batch sizes are higher
lowp int8/int4 not supported
We solve for the above 2 points with KleidiAI based GEMM implementation. Currently in this PR only F32 is supported, the subsequent PR will support lowp.

[Benchmark Results]

image
  • For TTFT we are better by more than 30%
  • For TPOT we will be better at larger batch sizes

Since we gather and scatter to perform gemm, there is additional overhead over GEMV. From TTFT results we can infer that we will perform better at higher batch sizes.

[ Accuracy ]

  • [f32 model] -> tested okay
  • [ lowp model ] [ decompression path ]

[ Dynamic_quantization_group_size = UINT64_MAX ] --> tested okay

[ only inference precision set to F32 ] --> incorrect output

This work is contributed by @ashwins990 and @abhijain1204fujitsu

@abhijain1204fujitsu abhijain1204fujitsu requested review from a team as code owners February 25, 2026 03:14
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Feb 25, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Feb 25, 2026
@alvoron alvoron self-assigned this Mar 4, 2026
@maxnick maxnick modified the milestones: 2026.0, 2026.1 Mar 4, 2026
@maxnick maxnick self-assigned this Mar 4, 2026
Comment on lines +80 to +94
#else

ov::element::Type getRuntimePrecision() const override;
Algorithm algorithm = Algorithm::GatherMatmulDefault;
size_t numExperts = 0;

std::vector<ExecutorPtr> executor;
std::vector<MemoryArgs> memArgsFC;

MemoryPtr m_weightsMemory = nullptr;
MemoryPtr m_tmpInpBuffer = nullptr;
MemoryDescPtr m_tmpInputDesc = nullptr;
MemoryDescPtr m_tmpOutputDesc = nullptr;

#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fields are clearly duplicated between if and else branches. Should we narrow the scope?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.

continue;
}

parallel_for(num_valid_rows, [&](size_t m) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep keep_dims=true followed by Squeeze operation under define?
I looks like it won't be a problem to support this variation in generic fashion. Just to reduce the code complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?

@ashwins990
Copy link
Contributor

Hi @maxnick. Thanks for the comment

I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code.

I will update this PR with the new refactored logic in the coming week once its approved internally.

Will move the relevant tests to common scope as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin ExternalPR External contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants