XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Jianhui-Li · 2025-07-08T06:43:21Z

Please review these guidelines to help with the review process:

Have you provided a meaningful PR description?
Have you added a test, a reproducer, or a reference to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
If this PR is a work in progress, are you filing the PR as a draft?
Have you organized your commits logically and ensured each can be built by itself?

Garra1980 · 2025-07-08T13:59:02Z

docs/rfcs/XeGPU.md


+## XeGPU operations to access share local memory
+Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). 
+User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.


Since we're talking about WG-level here I think #1033 should me merged before this one

docs/rfcs/XeGPU.md

chencha3

LGTM

chencha3 · 2025-08-14T14:45:08Z

docs/rfcs/XeGPU.md

+
+Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+
+When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.


What is the purpose of making memref operand optional?

removed. was there to ensure consistent with the earlier definition.

docs/rfcs/XeGPU.md

Garra1980 · 2025-09-29T14:10:24Z

cc @dchigarev

Garra1980 · 2025-09-29T14:44:03Z

docs/rfcs/XeGPU.md

+#dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
+
+%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>


Garra1980 · 2025-09-30T14:49:12Z

docs/rfcs/XeGPU.md

+
+The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
+
+It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.


Is it correct that "transpose+convert_layout" pattern should be generated by higher dialects?

akroviakov · 2025-10-24T14:44:35Z

docs/rfcs/XeGPU.md

+
+**Subgroup to Lane distribution**
+
+This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while store_matrix with vector input represents chunked loads.


For simplicity, the lane layout assignment pass is omitted

This, and the lack of examples, make understanding the offset calculation unclear.
I only see the + %lane_id difference to the unrolling snippet for matrix stores. Does it only apply to [1,16] lane_layout? What about lane_data? It seems that the offset will have to be calculated similarly to the wg-to-sg distribution (requires code changes to the layoutAttr), where we delinearize laneId and pick a proper lane_data multiple. Is that the intent?

Matrix loads simply received @subgroupBlockIO, so the rule is simply the presence of @block in mem_desc?
If it is not blocked, we do the same offset calculation, as for store_matrix?

Does it only apply to [1,16] lane_layout? What about lane_data?

It should apply to any valid lane_layout + lane_data. Like lane_layout = [2, 8] and lane_data = [2, 1].

It seems that the offset will have to be calculated similarly to the wg-to-sg distribution (requires code changes to the layoutAttr), where we delinearize laneId and pick a proper lane_data multiple.

Yes.

For an n-dimensional coordinate space, where each dimension i has a corresponding offset, the distributed coordinates are defined as follows.

First, compute the distribution parameters:
lane_distribution_unit_size[i] = lane_layout[i] * lane_data[i] //a single distribution unit along dim i
dist_num[i] = sg_data[i] / lane_distribution_unit_size[i] //how many distribution for dimension i

Then, for each iteration round = 0 .. dist_num[i], the distributed offset along dimension i is:
dist_offset[i] = base_offset[i] + (lane_id[i] * lane_data[i] + round * lane_dsitribution_unit_size[i] )%sg_data[i] //%sg_data is needed to support reduction case, where %sg_data is same as lane_data[i].

The result distributed coordinators is a combination of possible offsets from each dimension.

Please write this as a utility function of the layout attributes. See the issue : https://github.com/intel-innersource/frameworks.ai.mlir.mlir-extensions/issues/1324

Matrix loads simply received @subgroupBlockIO, so the rule is simply the presence of @block in mem_desc?

Not sure what the question is. The @subgroupBlockIO should be set before the sg-to-wi distribution stage, when user set the inst_data, they should set @subgroupBlockIO and @block attribute accordingly, so user express the intent to use the subgroupBlockIO instruction. When @subgroupBlockIO attribute presents, we don't need to distribute the coordinate.

If it is not blocked, we do the same offset calculation, as for store_matrix?
As long as it is not marked as @subgroupBlockIO, we need to distribute the coordinate. It is not related to the block attribute associated with mem_desc. Since we can have regular load/store instruction access the blocked matrix in slm as long its memdesc marked with block parameters.

The @subgroupBlockIO should be set before the sg-to-wi distribution stage, when user set the inst_data, they should set @subgroupBlockIO and block attribute accordingly

Oh, then I got confused by the rfc, as @subgroupBlockIO only appears in the example IR of Subgroup to Lane distribution section, and @subgroupBlockIO is introduced in the Lane level attributes

The @subgroupBlockIO should be set before the sg-to-wi distribution stage

The verification for subgroupBlockIO contradicts this statement. We have:

return emitError() << "subgroup_block_io " "are only allowed when result is a 1D VectorType.";

But the example from the RFC suggests that a lane can load more than one element :

%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] @subgroupBlockIO : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>

so the sg-level shape can be 2D (i.e., the input of sg-to-wi) and be amended with @subgroupBlockIO. So we allow 2D shapes in the verification and shift the 1D vector check to the lowering?

yes. we should allow 2d.

Add matrix_desc and operations

1b6dc9c

Jianhui-Li changed the title ~~Add matrix_desc and operations~~ XeGPU RFC update: Add matrix_desc and operations for share local memory Jul 8, 2025

Jianhui-Li changed the title ~~XeGPU RFC update: Add matrix_desc and operations for share local memory~~ XeGPU RFC update: Add matrix_desc and operations for share local memory access Jul 8, 2025

Garra1980 reviewed Jul 8, 2025

View reviewed changes

Jianhui-Li added 10 commits July 8, 2025 09:04

Update XeGPU.md

39198a3

Update XeGPU.md

c1ab298

Update XeGPU.md

1bbcdaa

Update XeGPU.md

8906c08

Update XeGPU.md

3c0c8ad

Update XeGPU.md

bfc834b

Update XeGPU.md

be46fad

Update XeGPU.md

c75986c

Update XeGPU.md

d5b0684

Update XeGPU.md

3657d36

akroviakov reviewed Jul 11, 2025

View reviewed changes

docs/rfcs/XeGPU.md Outdated Show resolved Hide resolved

akroviakov reviewed Jul 11, 2025

View reviewed changes

docs/rfcs/XeGPU.md Outdated Show resolved Hide resolved

Jianhui-Li added 5 commits July 11, 2025 17:43

Update XeGPU.md

3b79e51

Update XeGPU.md

5dd778e

Update XeGPU.md

63bed6a

Update XeGPU.md

da43f5d

Update XeGPU.md

dbaea60

Jianhui-Li mentioned this pull request Aug 14, 2025

[mlir][xegpu] Add definitions of MemDescType and related ops. llvm/llvm-project#153273

Merged

chencha3 approved these changes Aug 14, 2025

View reviewed changes

Jianhui-Li and others added 3 commits August 15, 2025 07:19

Update XeGPU.md

be04d1d

Update XeGPU.md

53690ae

Nit fixes

d0cc268

Garra1980 reviewed Aug 15, 2025

View reviewed changes

docs/rfcs/XeGPU.md Show resolved Hide resolved

Fix pre-commit

e418745

Jianhui-Li changed the title ~~XeGPU RFC update: Add matrix_desc and operations for share local memory access~~ XeGPU RFC update: Add mem_desc and operations for share local memory access Aug 15, 2025

add lane level attributes

bf87ab2

Jianhui-Li added 4 commits September 25, 2025 18:47

add subgroup to lane distribution example

40ae990

update the lowering to xevm

978ce9d

minor fix

4867a23

add reduction example

450b6b8

Garra1980 reviewed Sep 29, 2025

View reviewed changes

Garra1980 reviewed Sep 30, 2025

View reviewed changes

Jianhui-Li added 2 commits October 14, 2025 16:23

Update XeGPU.md

d622bba

remove vec_len and vec_dir attribute

f45f67b

akroviakov reviewed Oct 24, 2025

View reviewed changes


		Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.

		When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.


		The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.

		It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.


		Subgroup to Lane distribution

		This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while store_matrix with vector input represents chunked loads.

XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Are you sure you want to change the base?

XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Uh oh!

Conversation

Jianhui-Li commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chencha3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Garra1980 commented Sep 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jianhui-Li Oct 24, 2025 •

edited

Loading