-
Notifications
You must be signed in to change notification settings - Fork 46
XeGPU RFC update: Add mem_desc and operations for share local memory access #1092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
docs/rfcs/XeGPU.md
Outdated
|
|
||
| ## XeGPU operations to access share local memory | ||
| Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). | ||
| User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're talking about WG-level here I think #1033 should me merged before this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/rfcs/XeGPU.md
Outdated
|
|
||
| Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads. | ||
|
|
||
| When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of making memref operand optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. was there to ensure consistent with the earlier definition.
|
cc @dchigarev |
docs/rfcs/XeGPU.md
Outdated
| #dpas_wg = {sg_layout = [8, 4], sg_data= [32, 32], order=[1, 0] } | ||
| %at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16> | ||
| %a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%1->%at?
|
|
||
| The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads. | ||
|
|
||
| It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct that "transpose+convert_layout" pattern should be generated by higher dialects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|
|
||
| **Subgroup to Lane distribution** | ||
|
|
||
| This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while store_matrix with vector input represents chunked loads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity, the lane layout assignment pass is omitted
This, and the lack of examples, make understanding the offset calculation unclear.
I only see the + %lane_id difference to the unrolling snippet for matrix stores. Does it only apply to [1,16] lane_layout? What about lane_data? It seems that the offset will have to be calculated similarly to the wg-to-sg distribution (requires code changes to the layoutAttr), where we delinearize laneId and pick a proper lane_data multiple. Is that the intent?
Matrix loads simply received @subgroupBlockIO, so the rule is simply the presence of @block in mem_desc?
If it is not blocked, we do the same offset calculation, as for store_matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it only apply to [1,16] lane_layout? What about lane_data?
It should apply to any valid lane_layout + lane_data. Like lane_layout = [2, 8] and lane_data = [2, 1].
It seems that the offset will have to be calculated similarly to the wg-to-sg distribution (requires code changes to the layoutAttr), where we delinearize laneId and pick a proper lane_data multiple.
Yes.
For an n-dimensional coordinate space, where each dimension i has a corresponding offset, the distributed coordinates are defined as follows.
First, compute the distribution parameters:
lane_distribution_unit_size[i] = lane_layout[i] * lane_data[i] //a single distribution unit along dim i
dist_num[i] = sg_data[i] / lane_distribution_unit_size[i] //how many distribution for dimension i
Then, for each iteration round = 0 .. dist_num[i], the distributed offset along dimension i is:
dist_offset[i] = base_offset[i] + (lane_id[i] * lane_data[i] + round * lane_dsitribution_unit_size[i] )%sg_data[i] //%sg_data is needed to support reduction case, where %sg_data is same as lane_data[i].
The result distributed coordinators is a combination of possible offsets from each dimension.
Please write this as a utility function of the layout attributes. See the issue : https://github.com/intel-innersource/frameworks.ai.mlir.mlir-extensions/issues/1324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matrix loads simply received @subgroupBlockIO, so the rule is simply the presence of @block in mem_desc?
Not sure what the question is. The @subgroupBlockIO should be set before the sg-to-wi distribution stage, when user set the inst_data, they should set @subgroupBlockIO and @block attribute accordingly, so user express the intent to use the subgroupBlockIO instruction. When @subgroupBlockIO attribute presents, we don't need to distribute the coordinate.
If it is not blocked, we do the same offset calculation, as for store_matrix?
As long as it is not marked as @subgroupBlockIO, we need to distribute the coordinate. It is not related to the block attribute associated with mem_desc. Since we can have regular load/store instruction access the blocked matrix in slm as long its memdesc marked with block parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @subgroupBlockIO should be set before the sg-to-wi distribution stage, when user set the inst_data, they should set @subgroupBlockIO and block attribute accordingly
Oh, then I got confused by the rfc, as @subgroupBlockIO only appears in the example IR of Subgroup to Lane distribution section, and @subgroupBlockIO is introduced in the Lane level attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @subgroupBlockIO should be set before the sg-to-wi distribution stage
The verification for subgroupBlockIO contradicts this statement. We have:
return emitError() << "subgroup_block_io " "are only allowed when result is a 1D VectorType.";
But the example from the RFC suggests that a lane can load more than one element :
%a_dpas = xegpu.load_matrix %ma[%sg_idy * 32, 0] @subgroupBlockIO : mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
so the sg-level shape can be 2D (i.e., the input of sg-to-wi) and be amended with @subgroupBlockIO. So we allow 2D shapes in the verification and shift the 1D vector check to the lowering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. we should allow 2d.
Please review these guidelines to help with the review process: