[Arm] Enable Gather MatMul with KleidiAI Microkernels by abhijain1204fujitsu · Pull Request #34303 · openvinotoolkit/openvino

abhijain1204fujitsu · 2026-02-25T03:14:52Z

[ About ]

Enable Gathermatmul op-fusion transformation on ARM at IR level.
Support KleidiAI implementation of GatherMatmul operation
Bug fix related to KleidiAI execution of matmul in F32 precision

[ Background ]
GatherMatmul is specialized matmul operation where the weights are gathered before multiplication [ used in MoE architecture ]
In OSS version, the implementation is based of OneDNN based GEMV execution, which falls short on two aspect:
Gemv is less optimized compared to gemm for prefill phase and decode phase when the batch sizes are higher
lowp int8/int4 not supported
We solve for the above 2 points with KleidiAI based GEMM implementation. Currently in this PR only F32 is supported, the subsequent PR will support lowp.

[Benchmark Results]

For TTFT we are better by more than 30%
For TPOT we will be better at larger batch sizes

Since we gather and scatter to perform gemm, there is additional overhead over GEMV. From TTFT results we can infer that we will perform better at higher batch sizes.

[ Accuracy ]

[f32 model] -> tested okay
[ lowp model ] [ decompression path ]

[ Dynamic_quantization_group_size = UINT64_MAX ] --> tested okay

[ only inference precision set to F32 ] --> incorrect output

This work is contributed by @ashwins990 and @abhijain1204fujitsu

maxnick · 2026-03-12T15:39:38Z

src/plugins/intel_cpu/src/nodes/gathermatmul.h

+#else
+
+    ov::element::Type getRuntimePrecision() const override;
+    Algorithm algorithm = Algorithm::GatherMatmulDefault;
+    size_t numExperts = 0;
+
+    std::vector<ExecutorPtr> executor;
+    std::vector<MemoryArgs> memArgsFC;
+
+    MemoryPtr m_weightsMemory = nullptr;
+    MemoryPtr m_tmpInpBuffer = nullptr;
+    MemoryDescPtr m_tmpInputDesc = nullptr;
+    MemoryDescPtr m_tmpOutputDesc = nullptr;
+
+#endif


Some fields are clearly duplicated between if and else branches. Should we narrow the scope?

maxnick · 2026-03-12T15:41:01Z

src/plugins/intel_cpu/src/nodes/gathermatmul_arm.cpp

I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.

maxnick · 2026-03-12T15:49:44Z

src/plugins/intel_cpu/src/nodes/gathermatmul_arm.cpp

+                continue;
+            }
+
+            parallel_for(num_valid_rows, [&](size_t m) {


It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.

maxnick · 2026-03-12T16:07:57Z

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/moe_matmuls_fusion.cpp

Do we really need to keep keep_dims=true followed by Squeeze operation under define?
I looks like it won't be a problem to support this variation in generic fashion. Just to reduce the code complexity.

maxnick · 2026-03-12T16:08:55Z

src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/arm/moe.cpp

Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?

ashwins990 · 2026-03-12T16:30:07Z

Hi @maxnick. Thanks for the comment

I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code.

I will update this PR with the new refactored logic in the coming week once its approved internally.

Will move the relevant tests to common scope as well.

ashwins990 added 3 commits February 24, 2026 13:49

initial commit

5d13b44

Added single inference condition; clang tidy fixes

1060e30

rebase fixes

c4f4a52

abhijain1204fujitsu requested review from a team as code owners February 25, 2026 03:14

github-actions bot added the category: CPU OpenVINO CPU plugin label Feb 25, 2026

sys-openvino-ci added the ExternalPR External contributor label Feb 25, 2026

alvoron self-assigned this Mar 4, 2026

maxnick modified the milestones: 2026.0, 2026.1 Mar 4, 2026

maxnick self-assigned this Mar 4, 2026

maxnick requested changes Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai

abhijain1204fujitsu commented Feb 25, 2026 •

edited

Loading

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

ashwins990 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

abhijain1204fujitsu commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ashwins990 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

abhijain1204fujitsu commented Feb 25, 2026 •

edited

Loading