[BACKEND] Perform tree reductions on in-thread values#9220
Conversation
4c14120 to
74c5682
Compare
147422e to
a649212
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a649212079
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
a649212 to
ae06838
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae06838ea6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto acc = useTernary | ||
| ? treeReduceTernary(op.getLoc(), rewriter, | ||
| op.getCombineOp(), std::move(vals)) | ||
| : treeReduceBinary(op.getLoc(), rewriter, | ||
| op.getCombineOp(), std::move(vals)); |
There was a problem hiding this comment.
Preserve sequential order for in-thread reductions
This switch to a tree reduction for per-thread values changes the evaluation order even when the axis never crosses lanes (i.e., only reduceWithinThreads runs). The interpreter’s ReduceOps.generic_reduce performs a left‑to‑right fold along the axis (python/triton/runtime/interpreter.py:950‑975), so custom non‑associative combine functions (e.g., subtraction/division or NaN‑sensitive float ops) will now produce different results on GPU versus the interpreter and versus the previous linear implementation in this single‑thread case. If sequential order is part of the expected semantics for these reductions, consider keeping the linear fold unless the combiner is known to be associative/commutative.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think codex is right here, we have an isAssociative check in reduce so I guess we are supposed to support that use case?
There was a problem hiding this comment.
that assumes a layout where the registers are ordered, blocked or so. The associative way of doing it for a general layout is much more inefficient (you chase the basea in order and you might not be able to reduce all the registers in one go...). We already broke this assumption in the previous lowering really
There was a problem hiding this comment.
The reduce op assumes the lambda is associative. The lowering would be wrong otherwise. The isAssociative check we have is used to know if we should rematerialize, it is kind of an ad hoc workaround for the fact that with floating point we break this rule. But without assuming associativity we cannot do any efficient lowering
There was a problem hiding this comment.
I don't understand your point, Thomas. Do we want to do the lowering assuming just associativity? If so, we cannot perform it efficiently for generic linear layouts, we should just support blocked layouts. Consider otherwise a layout like
ttgl.DistributedLinearLayout(
reg_bases=[[1, 0], [4, 0]],
lane_bases=[[0, 1], [2, 0], [0, 2], [16, 0], [0, 4]],
warp_bases=[[32, 0], [8, 0]],
block_bases=[],
shape=[64, 8],
)
and a reduction over dim=0.
If we didn't transform it to be a blocked layout and we just assume the operation to be associative (and not commutative), then we would need to (naively)
- First reduce regs[0] and [1]
- Then perform 1 shuffle and a register reduction
- Then go to shmem to move the basis [8, 0] to be in the lane bases and perform 1 shuffle and a reduction
- Then another shuffle and a reduction for basis [16, 0]
- Finally go to shmem, to move the basis [32, 0] to the lanes, and another shuffle and a register reduction
The previous lowering then assumes that the operation is also commutative on top of associative to be able to effectively "reorder the bases before performing the reduction", and so we do.
There was a problem hiding this comment.
@lezcano I think that it's reasonable to make the assumption of commutativity in addition to associativity.
The only time that we've ever done reductions over non-commutative functions is over axes of length <= 2, and for those your implementation works for arbitrary f(x, y). We did this (a reduction of _take_first over an axis of length 1) as a hack for implementing unsplat before it was added as a TTIR primitive: a7a89c7
There was a problem hiding this comment.
I don't understand your point, Thomas. Do we want to do the lowering assuming just associativity?
no I just meant we should assume associativity.
About commutativity until now we were requiring it and we have a test here for non-commutative case:
triton/python/test/unit/language/test_core.py
Line 2598 in bcbcabd
But yeah it sounds like we need to relax this restriction to do efficient codegen for linear layouts.
This is already broken in current codegen right?
There was a problem hiding this comment.
we should probably update our doc and this test
There was a problem hiding this comment.
This is already broken in current codegen right?
Yep, it was just true if the layout happened to be regular enough (i.e. blocked)
There was a problem hiding this comment.
will update the test and docs
apgoucher
left a comment
There was a problem hiding this comment.
@lezcano Approved, although I'm unsure a priori of the utility of the ternary tree reduction for integer operations. After all, in those cases the binary operation is exactly associative, so InstCombine and ptxas will each have the opportunity to completely rewrite your expression.
|
fair point. I'll check tomorrow the generated sass |
We generate ternary trees for suitable integer ops and binary trees for
everything else.
We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up
to some gluon attention kernels.
stack-info: PR: #9220, branch: lezcano/stack/7
| auto acc = useTernary | ||
| ? treeReduceTernary(op.getLoc(), rewriter, | ||
| op.getCombineOp(), std::move(vals)) | ||
| : treeReduceBinary(op.getLoc(), rewriter, | ||
| op.getCombineOp(), std::move(vals)); |
There was a problem hiding this comment.
I think codex is right here, we have an isAssociative check in reduce so I guess we are supposed to support that use case?
ae06838 to
ce4c938
Compare
d3adbad to
532c5ff
Compare
|
Regarding the vectorisation strategy discussed in #9220 (comment), I went with a different one. The idea is to always vectorise vec[i] and vec[i+1] to avoid packing / unpacking (PRMT/MOVs) in SASS. @peterbell10 @apgoucher this is ready for another review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 532c5ffad3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
532c5ff to
7687a5e
Compare
peterbell10
left a comment
There was a problem hiding this comment.
LGTM, would be interested to see performance impact.
I will re-run perf before merging, but I initially I saw between a 4% and 5.2% speed-up on the attention gluon kernel. |
7687a5e to
974fc04
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 974fc04565
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| case InThreadVectorizeOpKind::MinimumF: | ||
| result = LLVM::MinimumOp::create(builder, loc, vecTy, lhs, rhs); | ||
| break; | ||
| case InThreadVectorizeOpKind::MaximumF: | ||
| result = LLVM::MaximumOp::create(builder, loc, vecTy, lhs, rhs); |
There was a problem hiding this comment.
Preserve NaN-propagating min/max in vectorized reductions
The vectorized combine region always emits LLVM::MinimumOp/LLVM::MaximumOp for arith::MinimumFOp/MaximumFOp. Elsewhere, min/max lowering gates these ops on hwNanPropagationSupported and falls back to a NaN-emulating path when the target doesn’t propagate NaNs (see MinMaxFOpConversion in ElementwiseOpToLLVM.cpp). By bypassing that check here, f16/bf16 reductions on targets without NaN-propagating min/max (e.g. NVIDIA < sm80 or AMD targets where supportMaximumMinimum is false) can drop NaNs, producing different results than the scalar combine path. Consider plumbing the target capability or reusing the same helper for min/max to keep NaN semantics consistent.
Useful? React with 👍 / 👎.
974fc04 to
0423232
Compare
We generate ternary trees for suitable integer ops and binary trees for
everything else.
We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up
to some gluon attention kernels.
stack-info: PR: #9220, branch: lezcano/stack/7
Stacked PRs:
[BACKEND] Perform tree reductions on in-thread values
We generate ternary trees for suitable integer ops and binary trees for
everything else.
We manually generate
{add,mul}.{f16,f32}x2ops. This brings a speed-upto some gluon attention kernels.