Skip to content

Conversation

@Jianhui-Li
Copy link
Contributor

Please review these guidelines to help with the review process:

  • Have you provided a meaningful PR description?
  • Have you added a test, a reproducer, or a reference to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?
  • Have you organized your commits logically and ensured each can be built by itself?

@Jianhui-Li Jianhui-Li changed the title Add matrix_desc and operations XeGPU RFC update: Add matrix_desc and operations for share local memory Jul 8, 2025
@Jianhui-Li Jianhui-Li changed the title XeGPU RFC update: Add matrix_desc and operations for share local memory XeGPU RFC update: Add matrix_desc and operations for share local memory access Jul 8, 2025

## XeGPU operations to access share local memory
Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout).
User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're talking about WG-level here I think #1033 should me merged before this one

Copy link
Contributor

@chencha3 chencha3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.

When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of making memref operand optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. was there to ensure consistent with the earlier definition.

@Jianhui-Li Jianhui-Li changed the title XeGPU RFC update: Add matrix_desc and operations for share local memory access XeGPU RFC update: Add mem_desc and operations for share local memory access Aug 15, 2025
@Garra1980
Copy link
Contributor

cc @dchigarev

#dpas_wg = {sg_layout = [8, 4], sg_data= [32, 32], order=[1, 0] }
%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
%a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%1->%at?


The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.

It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that "transpose+convert_layout" pattern should be generated by higher dialects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.


**Subgroup to Lane distribution**

This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while store_matrix with vector input represents chunked loads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, the lane layout assignment pass is omitted

This, and the lack of examples, make understanding the offset calculation unclear.
I only see the + %lane_id difference to the unrolling snippet for matrix stores. Does it only apply to [1,16] lane_layout? What about lane_data? It seems that the offset will have to be calculated similarly to the wg-to-sg distribution (requires code changes to the layoutAttr), where we delinearize laneId and pick a proper lane_data multiple. Is that the intent?

Matrix loads simply received @subgroupBlockIO, so the rule is simply the presence of @block in mem_desc?
If it is not blocked, we do the same offset calculation, as for store_matrix?

Copy link
Contributor Author

@Jianhui-Li Jianhui-Li Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it only apply to [1,16] lane_layout? What about lane_data?

It should apply to any valid lane_layout + lane_data. Like lane_layout = [2, 8] and lane_data = [2, 1].

It seems that the offset will have to be calculated similarly to the wg-to-sg distribution (requires code changes to the layoutAttr), where we delinearize laneId and pick a proper lane_data multiple.

Yes.

For an n-dimensional coordinate space, where each dimension i has a corresponding offset, the distributed coordinates are defined as follows.

First, compute the distribution parameters:
lane_distribution_unit_size[i] = lane_layout[i] * lane_data[i] //a single distribution unit along dim i
dist_num[i] = sg_data[i] / lane_distribution_unit_size[i] //how many distribution for dimension i

Then, for each iteration round = 0 .. dist_num[i], the distributed offset along dimension i is:
dist_offset[i] = base_offset[i] + (lane_id[i] * lane_data[i] + round * lane_dsitribution_unit_size[i] )%sg_data[i] //%sg_data is needed to support reduction case, where %sg_data is same as lane_data[i].

The result distributed coordinators is a combination of possible offsets from each dimension.

Please write this as a utility function of the layout attributes. See the issue : https://github.com/intel-innersource/frameworks.ai.mlir.mlir-extensions/issues/1324

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrix loads simply received @subgroupBlockIO, so the rule is simply the presence of @block in mem_desc?

Not sure what the question is. The @subgroupBlockIO should be set before the sg-to-wi distribution stage, when user set the inst_data, they should set @subgroupBlockIO and @block attribute accordingly, so user express the intent to use the subgroupBlockIO instruction. When @subgroupBlockIO attribute presents, we don't need to distribute the coordinate.

If it is not blocked, we do the same offset calculation, as for store_matrix?
As long as it is not marked as @subgroupBlockIO, we need to distribute the coordinate. It is not related to the block attribute associated with mem_desc. Since we can have regular load/store instruction access the blocked matrix in slm as long its memdesc marked with block parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @subgroupBlockIO should be set before the sg-to-wi distribution stage, when user set the inst_data, they should set @subgroupBlockIO and block attribute accordingly

Oh, then I got confused by the rfc, as @subgroupBlockIO only appears in the example IR of Subgroup to Lane distribution section, and @subgroupBlockIO is introduced in the Lane level attributes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @subgroupBlockIO should be set before the sg-to-wi distribution stage

The verification for subgroupBlockIO contradicts this statement. We have:

  return emitError() << "subgroup_block_io "
                       "are only allowed when result is a 1D VectorType.";

But the example from the RFC suggests that a lane can load more than one element :

%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] @subgroupBlockIO : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>

so the sg-level shape can be 2D (i.e., the input of sg-to-wi) and be amended with @subgroupBlockIO. So we allow 2D shapes in the verification and shift the 1D vector check to the lowering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we should allow 2d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants