xe: grp_gemm: Add microkernel based implementation of grouped GEMM #4505

umar456 · 2026-01-07T20:04:14Z

This pull request introduces support for a new "grouped micro GEMM" (General Matrix Multiply) implementation, along with corresponding kernel code, and enhancements to the testing utilities to support grouped memory.

…gemm

Only supports power of 2 shapes and f16xf16:f32 types

- Adjust global work size calculation for accurate group partitioning - Correct m_all calculation to divide by num_groups for per-group M

mzhukova · 2026-01-07T20:27:00Z

tests/gtests/test_grouped_gemm.cpp

@@ -0,0 +1,168 @@
+
+#include "dnnl_test_common.hpp"


what is missing from benchdnn, so that this test is required..?

Nothing is missing. I want more control over how the data is initialized so that its easier to test/debug new configurations. I will probably remove this in the future.

I want more control over how the data is initialized

Could you clarify, so that I could possibly make it better/easier?

I will probably remove this in the future

I agree, we would need this removed prior merging if it is covered by benchdnn..

It's difficult to initialize the input data with known values to identify incorrect results. For example I want to find out why I am getting an incorrect value at a certain location in the destination buffer. With my own tests I can make my inputs to be known values to debug. I don't know of a way to do that with benchdnn.

Does --buffer-prefix work?

mzhukova added 30 commits December 16, 2025 12:42

api, common: new grouped mem encoding

f2358bf

tests: api tests for grouped mem

185d0ed

examples: draft of moe w grouped gemm (WIP)

ad9a9b2

common: extend util for grouped encoding

9eaa1c6

examples: fixup example

e20c7e4

cpu: add reference grouped gemm impl with limited support

ab1ce87

common: hacky bypass of matmul checks (TODO: update properly)

dc3b01e

cpu: minor adjustment

5b93da4

tests: benchdnn: support grouped mem creation

88edc26

tests: benchdnn: support --grouped in parser

1862f13

tests: benchdnn: update number of dims required in case of grouped gemm

24f70d5

tests: benchdnn: update create_md to handle grouped src, dst

6d0d42c

tests: benchdnn: add fill_group_data

6155038

tests: benchdnn: add ref cpu matmul with plain memory

ca013e0

tests: benchdnn: w/a for comparison function

5eb6499

tests: benchdnn: (wip) create mems for grouped gemm

c676403

cpu: matmul: fix build warning

8fec0e6

examples: update grouped gemm/moe example to support gpu eng

77a58e7

tests: gtests: allow gpu execution

c57e38e

examples: update mem creation for gpu eng

3fa428b

common: extend out storage access macros for multi buffer

361fa27

gpu: add ref ocl impl for grouped gemm with limited support

3990d27

cpu: move cpu ref grouped gemm to ref impl

875f424

common, cpu, gpu, tests: support row-wise f32 src scales for grouped …

8e9af4b

…gemm

gpu: early exit in case of reshape for grouped

b389b29

gpu: allow bias in ref gpu ocl for grouped gemm

d922238

tests: benchdnn: simple bias support

7520f04

tests: benchdnn: simple inputs for grouped gemm

24076f9

all: grand api migration

90fb497

tests: benchdnn: (wip) hacky support of int/grouped scales

74a381b

umar456 added 8 commits December 30, 2025 10:14

WIP: simple test

3291d3f

tests: Enable grouped memory in the write_to_dnnl_memory func

ba3ce85

tests: Enable print_mem and get_groupe_mem for grouped memory

50e69a1

xe: grp_gemm: Enable microkenrel implementation for grouped gemm

4981683

Only supports power of 2 shapes and f16xf16:f32 types

xe: grp_gemm: Update ugemm layouts

ffb2aa6

xe: grp_gemm: Enable large grf and enable catalog strategies

04fefd1

xe: grp_gemm: Add support for f32 and bf16 dt for all inputs

898c908

xe: grp_gemm: Correct grouped GEMM launch config calculation

bda20dc

- Adjust global work size calculation for accurate group partitioning - Correct m_all calculation to divide by num_groups for per-group M

github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Jan 7, 2026

mzhukova reviewed Jan 7, 2026

View reviewed changes

mzhukova force-pushed the mzhukova/main/poc-grouped-mem branch 2 times, most recently from be4fed9 to 0ea335d Compare January 10, 2026 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xe: grp_gemm: Add microkernel based implementation of grouped GEMM #4505

xe: grp_gemm: Add microkernel based implementation of grouped GEMM #4505

Uh oh!

umar456 commented Jan 7, 2026

Uh oh!

mzhukova Jan 7, 2026

Uh oh!

umar456 Jan 7, 2026 •

edited

Loading

Uh oh!

mzhukova Jan 7, 2026

Uh oh!

umar456 Jan 7, 2026

Uh oh!

Simonsays095 Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xe: grp_gemm: Add microkernel based implementation of grouped GEMM #4505

Are you sure you want to change the base?

xe: grp_gemm: Add microkernel based implementation of grouped GEMM #4505

Uh oh!

Conversation

umar456 commented Jan 7, 2026

Uh oh!

mzhukova Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

umar456 Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzhukova Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

umar456 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Simonsays095 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

umar456 Jan 7, 2026 •

edited

Loading