Skip to content

Commit 978ce9d

Browse files
committed
update the lowering to xevm
1 parent 40ae990 commit 978ce9d

File tree

1 file changed

+54
-52
lines changed

1 file changed

+54
-52
lines changed

docs/rfcs/XeGPU.md

Lines changed: 54 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -336,19 +336,19 @@ To streamline programming of shared local memory (SLM) on Intel Xe architecture,
336336

337337
**Background and Motivation**
338338

339-
On Xe2 GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
339+
On Xe GPUs, SLM remains accessible for direct use by programmers. However, in tile-based programming — particularly when applying layout transformations such as transpose, re-layout — SLM is more commonly used as a backing store to facilitate structured tile movement across subgroups and lanes.
340340

341341
Prior to the introduction of mem_desc, SLM usage was modeled using the nd_tdesc type, which was originally designed for global memory access. As such, it lacked layout-specific attributes like blocking and stride metadata, which are essential for modeling tiled or transposed views in SLM. Developers were responsible for manually computing physical addresses — a process that became particularly complex when applying transformations such as transpose or blocking as required by chunked load or 1D block load.
342342

343343
This complexity was further compounded by hierarchical distribution, where workgroup-level tiles are subdivided across subgroups, instructions, and individual lanes — each step requiring separate address transformation logic. This made the code error-prone and difficult to optimize.
344344

345345
**Design and Semantics**
346346

347-
The mem_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the physical address materialization phase maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the mem_desc.
347+
The mem_desc type addresses these challenges by encoding layout transformations—such as transpose and blocking—as static attributes of the descriptor, and by clearly separating logical and physical address computation. The distribution and unrolling process operates on a conceptual row-major 2D matrix, enabling clean and structured logical access, while the XeVM lowering pass maps these logical coordinates to hardware-compliant SLM addresses, guided by the layout attributes attached to the mem_desc.
348348

349349
This separation simplifies distribution and unrolling passes and enables systematic, robust transformations during compilation. The descriptor encapsulates all necessary layout metadata to generate correct and efficient SLM access patterns — supporting both regular loads and 1D block loads — without requiring the user to write explicit address arithmetic.
350350

351-
**Basic Usage**
351+
**OP definition**
352352

353353
To represent a matrix stored in shared local memory (SLM), users must create a mem_desc object. Create_mem_desc initializes a mem_desc instance with memory layout attributes such as @block and @stride. These attributes define the blocking and striding parameters, which govern physical address computation when accessing shared local memory (SLM). The mem_desc_subview creates a subview on top of the mem_desc, inheriting all of its layout attributes. Load_matrix and store_matrix perform data movement between SLM and vector registers. xegpu.layout attribute is added to load_matrix and store_matrix to specify the mapping of lanes and registers to fragments of the matrix, guiding tile distribution based on the assumed row-major view of the matrix.
354354

@@ -385,6 +385,7 @@ xegpu.store_matrix vec_a, mem_desc_b[0, 0] : vector<256x128xbf6>, mem_desc<256x1
385385
xegpu.store_matrix %at, %mt[%sg_idy * 8, %sg_idx * 32] : vector<8x32xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
386386
```
387387

388+
**Lane level attributes**
388389
At the lane level, a load_matrix operation retrieves a single element from the matrix in slm, with the element address determined by the lane’s offset.
389390
If the `vec_len` and `vec_dir` attributes are present, the operation instead retrieves a vector of length `vec_len` along the direction specified by `vec_dir`.
390391
If the `subgroupBlockIO` attribute is present, the load is a cooperative subgroup operation. In this case, the operation consumes a uniform memory descriptor and uniform offsets,
@@ -433,11 +434,11 @@ In this flow:
433434

434435
3. The result is a matrix tile conforming to the #dpas_wg layout, ready for compute instructions such as DPAS.
435436

436-
**After optimization that targets the transpose-A pattern**
437+
**Cooperative Transpose Optimization pass targeting transpose-A pattern**
437438

438439
The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
439440

440-
It is generally preferred to detect the “transpose + convert_layout” pattern and fuse them earlier in the pipeline, as this affects the blocking strategy for load_matrix and store_matrix (which are the lowered forms of the logical layout conversion and transpose). Early fusion enables better alignment with optimal hardware load instructions.
441+
It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.
441442

442443
```mlir
443444
#Coop_t_wg = { sg_layout = [4, 8], sg_data = [8, 32], order = [0, 1] } // original layout
@@ -458,13 +459,15 @@ gpu.barrier
458459
In this example, the xegpu.layout is extended to support instruction-level blocking. The basic blocking assumes 16 lanes, and each lane handles 2 f16 elements (32 bits). This basic instruction blocking does not try to block memory layout. It lowers to instructions like chunked store and load_gather.
459460

460461
```mlir
461-
#Coop_t_wg = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 32], order = [0, 1] }
462+
#Load_t_wg = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 32], order = [0, 1] }
463+
#Coop_t_wg = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [1, 16], order = [0, 1] }
462464
#dpas_t_wg = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [1, 32], order = [1, 0] }
463465
464-
%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
466+
%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Load_t_wg> -> vector<32x256xf16>
467+
%at2 = xegpu.conv_layout %at #coop_t_wg
465468
%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
466469
%m = xegpu.create_mem_desc %m : memref<16384xi8, 3> -> mem_desc<32x256xf16, @strides=[1, 32]>
467-
xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, mem_desc<32x256xf16, @strides=[1, 32]>
470+
xegpu.store_matrix %at2, %mt[0, 0] #Coop_t_wg: vector<32x256xf16>, mem_desc<32x256xf16, @strides=[1, 32]>
468471
469472
gpu.barrier
470473
@@ -484,13 +487,15 @@ This pattern demonstrates a more optimized strategy for instruction-level blocki
484487
During lowering, store_matrix is lowered to store_chunk if the matrix has strides, and load_matrix is lowered to 1D block load if the matrix has a blocked layout.
485488

486489
```mlir
490+
#Load_t_wg = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 32], order = [0, 1] }
487491
#Coop_t_wg = { sg_layout = [4, 8], sg_data = [8, 32], inst_data = [8, 16], order = [0, 1] }
488492
#dpas_t_wg = { sg_layout = [8, 4], sg_data = [32, 32], inst_data = [16, 16], order = [1, 0] }
489493
490-
%at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
494+
%at = xegpu.load_nd %tdesc : tensor_desc<32x256xf16, #Load_t_wg> -> vector<32x256xf16>
495+
%at2 = xegpu.conv_layout %at #coop_t_wg
491496
%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
492497
%mt = xegpu.create_mem_desc %m : memref<16384xi8, 3> -> mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
493-
xegpu.store_matrix %at, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
498+
xegpu.store_matrix %at2, %mt[0, 0] #Coop_t_wg : vector<32x256xf16>, mem_desc<32x256xf16, @block=[16, 16], @strides=[1, 32]>
494499
495500
gpu.barrier
496501
%ma = xegpu.create_mem_desc %m : memref<16384xi8, 3> -> mem_desc<256x32xf16, @block=[16, 16]>
@@ -501,7 +506,7 @@ gpu.barrier
501506

502507
This example illustrates how load_matrix and store_matrix are distributed from workgroup to subgroups. After distribution, the sg_layout and sg_data attributes are removed from the layout specification, leaving only the inst_data attribute.
503508

504-
The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the final lowering stage (e.g., MaterializeSLMAccess) are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the mem_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
509+
The distribution process assumes matrix stored in row-major contiguous layout, and performes indexing using logical coordinates. These logical coordinates are used throughout tile distribution and layout transformations. Only at the XeVM lowering stage are physical offsets computed using memory layout attributes such as @strides and @block. A key property of the mem_desc data type is that logical tile decomposition does not alter the block or stride metadata, making logical address computation straightforward.
505510

506511
```mlir
507512
#load_t_inst = { inst_data = [8, 32] }
@@ -556,6 +561,8 @@ gpu.barrier
556561

557562
**Subgroup to Lane distribution**
558563

564+
This example illustrates how `load_matrix` and `store_matrix` operations are distributed from subgroup to lane. For simplicity, the lane layout assignment pass is omitted. After distribution, these operations work on 1D vectors or scalars. The lane-level attribute `subgroupBlockIO` is used to represent 1D block loads, while `vec_len` and `vec_dir` indicate chunked loads.
565+
559566
```mlir
560567
%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
561568
: memref<4096x4096xf16> -> tensor_desc<8x32xf16>
@@ -581,72 +588,67 @@ gpu.barrier
581588
: mem_desc<256x32xf16, @block=[16, 16]> -> vector<16xf16>
582589
```
583590

584-
**MaterializeSLMAccess: Lowering mem_desc to Physical Memory Access**
591+
**XeGPU lowering to XeVM**
585592

586-
This step lowers high-level mem_desc operations (store_matrix, load_matrix) into low-level memory operations (store_chunk, load_1d) over shared local memory. It performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
593+
This step lowers lane level mem_desc operations (store_matrix, load_matrix) into XeVM/LLVM operations. At this point, the XeVM code performs full address materialization using the matrix's layout attributes (@strides, @block) and logical lane coordinates.
587594

588595
Key Concepts:
589-
- Chunked Store: Each thread stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to store_chunk.
596+
- **Chunked Load/Store**: Each thread loads or stores a small fragment (e.g., 8×1) using the logical offset composed with layout metadata. Lowered to llvm.load/llvm.store with a vector operand.
590597

591-
- 1D Block Load: A transposed layout (e.g., 256×32) is blocked as 16×16 tiles. Contiguous blocks are loaded using load_1d, which requires computing the physical offset of the first element per 1D block.
592-
593-
- Offset Calculation: Logical per-lane coordinates are transformed into logical block coordinates, then to physical offsets using block size and strides.
598+
- **1D Block Load/Store:** In a transposed layout (e.g., 256×32), the matrix is blocked into 16×16 tiles. Elements within each block are contiguous in memory, allowing efficient loading via `XeVM.blockload`. All lanes use the same uniform block address and cooperatively load a contiguous block, with each lane retrieving multiple elements at a stride equal to the subgroup size. The uniform block address is computed by applying the layout metadata (as a function) to the logical base offset of the tile.
594599

595600
```mlir
596-
%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
601+
// psudo code
602+
//%tdesc_sg = xegpu.create_nd_tdesc %base[%widy * 32 + %sg_idy * 8, %widx * 256 + %sg_idx * 32]
597603
: memref<4096x4096xf16> -> tensor_desc<8x32xf16>
598-
%at = xegpu.load_nd %tdesc_sg : tensor_desc<8x32xf16> -> vector<8x32xf16>
599-
%at0 = vector.extract %at[0, 0] : vector<8x32xf16> -> vector<8x16xf16>
600-
%at1 = vector.extract %at[0, 16] : vector<8x32xf16> -> vector<8x16xf16>
601-
602-
// Shared local memory buffer
603-
%m = memref.alloca() {alignment = 1024} : memref<16384xi8, 3>
604+
//%at = xegpu.load_nd %tdesc_sg : tensor_desc<8x32xf16> -> vector<16xf16>
605+
%at0 = vector.extract %at[0] : vector<16xf16> -> vector<8xf16>
606+
%at1 = vector.extract %at[8] : vector<16xf16> -> vector<8xf16>
607+
%m_i8 = llvm.alloca 16384 {alignment = 1024} : !llvm.ptr<i8, 3>
608+
%m = llvm.bitcast %m_i8 : !llvm.ptr<i8, 3> to !llvm.ptr<f16, 3>
604609
605610
// ---------------------- Chunked Store ----------------------
606-
// The transpose is added as we remove the transpose attribute out from chunked load/store and expect an explict data transpose.
607-
// it will be no op after lane distribution since each lane owns same data when [8,1] is transpose to [1, 8]
608-
%at0_t = vector.transpose %at0 : vector<8x16xf16> -> vector<16x8xf16>
609-
610-
// Compute blocked offset vectors for SLM store
611-
%blk_y=sg_idy*8 /16: index
612-
%blk_in_y=sg_idy*8 %16: index
613-
%sg_idx_vec = %sg_idx*32 + [0..15] : vector<16xindex>
614-
%blk_x=%sg_idx_vec /16: vector<16xindex >
615-
%blk_in_x=%sg_idx_vec %16: vector<16xindex >
611+
// Compute blocked offset for each lane
612+
%blk_y = sg_idy*8 / 16: index
613+
%blk_in_y = sg_idy*8 % 16: index
614+
%blk_x = (%sg_idx*32 + %lane_id) / 16: index
615+
%blk_in_x = (%sg_idx*32 + %lane_id) % 16: index
616616
617617
// calculate physic addresses with pre-computed strides of the blocked matrix.
618618
// [32x256, strides=1x32] blocked as [2x16x16x16, strides=256x512x1x16]
619-
%offset_vec0 = %blk_y * 256+ + %blk_x * 512 + %blk_in_y + %blk_in_x*16
620-
xegpu.store %at0_t, %m, %offset_vec0 @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
619+
%offset = %blk_y * 256+ + %blk_x * 512 + %blk_in_y + %blk_in_x*16
620+
%addr = %m + %offset : !llvm.ptr<f16, 3>
621+
llvm.store %at0_t, %addr: vector<8xf16>, !llvm.ptr<f16, 3>
621622
622623
// Repeat for second tile
623-
%at1_t = vector.transpose %at1 : vector<8x16xf16> -> vector<16x8xf16>
624-
%sg_idx_vec2 = %sg_idx*32 + [16..31] : vector<16xindex>
625-
%blk_x2=%sg_idx_vec2 /16: vector<16xindex >
626-
%blk_in_x2=%sg_idx_vec2 %16: vector<16xindex >
627-
%offset_vec1 = %blk_y * 256+ + %blk_x2 * 512 + %blk_in_y+ %blk_in_x2*16
628-
xegpu.store %at1_t, %m, %offset_vec1: @chunk_size=8: vector<16x8xf16>, memref<8192xf16, 3>, vector<16xindex>
624+
%blk_x1 = (%sg_idx*32 + 16 + %lane_id) / 16: index
625+
%blk_in_x1 = (%sg_idx*32 + 16 + %lane_id) % 16: index
626+
%offset1 = %blk_y * 256+ + %blk_x1 * 512 + %blk_in_y+ %blk_in_x1*16
627+
%addr1 = %m + %offset1 : !llvm.ptr<f16, 3>
628+
llvm.store %at1_t, %m: @chunk_size=8: vector<8xf16>, !llvm.ptr<f16, 3>
629629
630630
gpu.barrier
631631
632632
// ---------------------- Load 1D Block ----------------------
633633
// Compute per-block physical offsets
634634
// pre-computed strides of the blocked matrix: [256x32] blocked as [16x2x16x16, strides=512x256x16x1]
635-
// sg_idx*32 coord to blocked matrix ccord: sg_idx*32%32/16 (0), sg_idx*32%32%16 (0). %32 due matrix shape[1] is 32
636-
// sg_idy*32 coord to blocked matrix coord: sg_idy*32/16, sg_idy*32%16 (0)
637-
// then map to physical addr using stride [2x16x16x16, strides=512x256x16x1], get sg_idy*32/16 *512
638-
%inst_start_offset0 = mul %sg_idy, 2 * 512
635+
// [sg_idy*32, sg_idx*32%32=0] coord to blocked matrix ccord: [sg_idy*32/16, 0, 0, 0]
636+
// then map to physical addr using stride [2x16x16x16, strides=512x256x16x1],
637+
// get sg_idy*32/16*512 = sg_idy*1024
638+
%inst_start_offset0 = mul %sg_idy, 1024
639639
%inst_start_offset1 = add %inst_start_offset0, 256
640640
%inst_start_offset2 = add %inst_start_offset0, 512
641641
%inst_start_offset3 = add %inst_start_offset0, 768
642+
%addr0 = %m + %inst_start_offset0 : !llvm.ptr<f16, 3>
643+
%addr1 = %m + %inst_start_offset1 : !llvm.ptr<f16, 3>
644+
%addr2 = %m + %inst_start_offset2 : !llvm.ptr<f16, 3>
645+
%addr3 = %m + %inst_start_offset3 : !llvm.ptr<f16, 3>
642646
643-
%a_dpas_0 = xegpu.load_nd %m, %inst_start_offset0 : memref<8192xf16, 3>, index -> vector<256xf16>
644-
%a_dpas_1 = xegpu.load_nd %m, %inst_start_offset1 : memref<8192xf16, 3>, index -> vector<256xf16>
645-
%a_dpas_2 = xegpu.load_nd %m, %inst_start_offset2 : memref<8192xf16, 3>, index -> vector<256xf16>
646-
%a_dpas_3 = xegpu.load_nd %m, %inst_start_offset3 : memref<8192xf16, 3>, index -> vector<256xf16>
647-
```
647+
%a_dpas_0 = xevm.blockload %m, %addr0 : !llvm.ptr<f16, 3> -> vector<16xf16>
648+
%a_dpas_1 = xevm.blockload %m, %addr1 : !llvm.ptr<f16, 3> -> vector<16xf16>
649+
%a_dpas_2 = xevm.blockload %m, %addr2 : !llvm.ptr<f16, 3> -> vector<16xf16>
650+
%a_dpas_3 = xevm.blockload %m, %addr3 : !llvm.ptr<f16, 3> -> vector<16xf16>
648651
649-
## XeGPU Attributes to support Work Item Level semantics
650652
651653
**Attribute xegpu.sg_map**
652654

0 commit comments

Comments
 (0)