-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[NVPTX] Add TLI hook for load slice cost and implement it #131847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | ||
; RUN: llc < %s | FileCheck %s | ||
|
||
target triple = "nvptx64-unknown-unknown" | ||
|
||
;; Verify that 64-bit loads are not split into more 32-bit | ||
;; loads. Loads are more expensive than shifts/conversions. | ||
define float @test(ptr %in) { | ||
; | ||
; CHECK-LABEL: test( | ||
; CHECK: { | ||
; CHECK-NEXT: .reg .b32 %r<5>; | ||
; CHECK-NEXT: .reg .f32 %f<8>; | ||
; CHECK-NEXT: .reg .b64 %rd<4>; | ||
; CHECK-EMPTY: | ||
; CHECK-NEXT: // %bb.0: | ||
; CHECK-NEXT: ld.param.u64 %rd1, [test_param_0]; | ||
; CHECK-NEXT: ld.u64 %rd2, [%rd1]; | ||
; CHECK-NEXT: ld.u64 %rd3, [%rd1+8]; | ||
; CHECK-NEXT: cvt.u32.u64 %r1, %rd2; | ||
; CHECK-NEXT: cvt.u32.u64 %r2, %rd3; | ||
; CHECK-NEXT: mov.b32 %f1, %r1; | ||
; CHECK-NEXT: mov.b32 %f2, %r2; | ||
; CHECK-NEXT: add.rn.f32 %f3, %f1, %f2; | ||
; CHECK-NEXT: { .reg .b32 tmp; mov.b64 {tmp, %r3}, %rd2; } | ||
; CHECK-NEXT: { .reg .b32 tmp; mov.b64 {tmp, %r4}, %rd3; } | ||
; CHECK-NEXT: mov.b32 %f4, %r3; | ||
; CHECK-NEXT: mov.b32 %f5, %r4; | ||
Comment on lines
+20
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Future cleanup opportunity: We seem to be using too many instructions to do effectively nothing on the SASS level. FP operations accept https://godbolt.org/z/1bb5dMGsa
Or, alternatively, split
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, I completely agree we're producing pretty unsightly PTX here, even if the SASS is efficient. |
||
; CHECK-NEXT: add.rn.f32 %f6, %f4, %f5; | ||
; CHECK-NEXT: add.rn.f32 %f7, %f3, %f6; | ||
; CHECK-NEXT: st.param.f32 [func_retval0], %f7; | ||
; CHECK-NEXT: ret; | ||
%ptr0 = getelementptr inbounds i64, ptr %in, i64 0 | ||
%ptr1 = getelementptr inbounds i64, ptr %in, i64 1 | ||
|
||
%load0 = load i64, ptr %ptr0, align 8 | ||
%load1 = load i64, ptr %ptr1, align 8 | ||
%trunc_lo_0 = trunc i64 %load0 to i32 | ||
%trunc_lo_1 = trunc i64 %load1 to i32 | ||
%float_lo_0 = bitcast i32 %trunc_lo_0 to float | ||
%float_lo_1 = bitcast i32 %trunc_lo_1 to float | ||
%add_lo = fadd float %float_lo_0, %float_lo_1 | ||
|
||
%shift0 = lshr i64 %load0, 32 | ||
%shift1 = lshr i64 %load1, 32 | ||
%trunc_hi_0 = trunc i64 %shift0 to i32 | ||
%trunc_hi_1 = trunc i64 %shift1 to i32 | ||
%float_hi_0 = bitcast i32 %trunc_hi_0 to float | ||
%float_hi_1 = bitcast i32 %trunc_hi_1 to float | ||
%add_hi = fadd float %float_hi_0, %float_hi_1 | ||
|
||
%res = fadd float %add_lo, %add_hi | ||
ret float %res | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit suggestion: "prioritize expensive operations." -> "prefer avoiding expensive ops" or "prefer less expensive ops".
Where does the magic cost scaling factor of 20 come from? I can't match it to the logic this weight function replaces.
It would be prudent to make the new defaults behave close to how they worked before the change. This is in the code path that will likely affect almost everything. We don't want any surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 20 is quite arbitrary. In the previous implementation we first compared expensive ops and only checked inexpensive ops if expensive ops were equal. This meant that expensive ops always trumped inexpensive ops. I don't think that previous logic really made sense as there must be some number of inexpensive ops that is more costly than an expensive op. I chose 20 because in practice I don't think there are going to be any cases where there are 20 inexpensive ops, so this should essentially get us the same behavior. I agree this is a bit weird though, I'm happy to try out any alternatives you can suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cost heuristics are prone to cause unexpected surprises, and I do not see a good way to validate the currently selected value without exposing it to the real world use cases. Perhaps we can make that scaling factor tunable as an escape hatch in case the current value produces suboptimal results for some users.