-
Notifications
You must be signed in to change notification settings - Fork 75
[LoadStoreOpToLLVM] Improve the 2D block IO lowering for DPAS and DotOp layout. #5425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the 2D block IO lowering for DPAS (Dot Product Accumulate Systolic) and DotOp layouts by extending support from just OperandB to all DPAS operand types (OperandA, OperandB, and OperandC). The implementation refactors the existing if-else structure into a switch statement and adds detailed documentation explaining the data flow and optimization patterns for DPAS operations.
Key changes:
- Extends DPAS layout handling to support OperandA and OperandC in addition to OperandB
- Refactors conditional logic from if-else to switch statement with proper default case handling
- Adds comprehensive inline documentation explaining the three-type system and data flow optimization
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| LoadStoreOpToLLVM.cpp | Extends DPAS operand handling with switch statement and adds detailed documentation of data flow patterns |
| tensor-pointer-load-block-2d.mlir | Updates test expectations to include new shuffle vector and bitcast operations for DPAS operand handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
05e4534 to
0aa1b3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
458d741 to
6705f55
Compare
…Op layout. Signed-off-by: Lu,Chengjun <[email protected]>
| default: | ||
| llvm_unreachable("unexpected OpIdx type."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unnecessary because DpasEncodingAttr::OpIdx is an enum class with 3 enumerators, which have a corresponding case in the switch statement.
|
@chengjunlu does the PR improves any benchmark ? |
I will collect the data. |
In the 4kx4kx4k of the gemm_tensor_of_ptr_benchmark, the changes can improve the register spilling of the configuration: There are about 1.5k register spilling without the change for the same configuration: The key difference is that the original code generate extra bitcast from/to i32 bewteen load and dpas which cannot be optimized: With this change, the bitcast is eliminated. The IGC can generate more efficient code without useless bitcasts in which there are The performance improved of the 4kx4kx4k case is: The benchmark runner is still in progress: |
In the DPAS layout, three data types are involved in the block load and dot product (DPAS) computation flow:
For non-DPAS layouts, only the first two types are used. The data flow proceeds as follows:
load2DGenXTypevalues.tt.dot(DPAS) operation consumes packed operands to perform the dot product.During optimization, redundant pack/unpack and bitcast operations are removed, resulting in a simplified sequence:
Conceptually, the combination of
packedDPASOperandTypeandshufflevectordetermines how input data maps to the DPAS computation flow.