[BACKEND] Perform tree reductions on in-thread values by lezcano · Pull Request #9220 · triton-lang/triton

lezcano · 2026-01-14T14:35:45Z

Stacked PRs:

[BACKEND] Perform tree reductions on in-thread values

We generate ternary trees for suitable integer ops and binary trees for
everything else.

We manually generate {add,mul}.{f16,f32}x2 ops. This brings a speed-up
to some gluon attention kernels.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a649212079

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae06838ea6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-14T19:58:31Z

+      auto acc = useTernary
+                     ? treeReduceTernary(op.getLoc(), rewriter,
+                                         op.getCombineOp(), std::move(vals))
+                     : treeReduceBinary(op.getLoc(), rewriter,
+                                        op.getCombineOp(), std::move(vals));


Preserve sequential order for in-thread reductions

This switch to a tree reduction for per-thread values changes the evaluation order even when the axis never crosses lanes (i.e., only reduceWithinThreads runs). The interpreter’s ReduceOps.generic_reduce performs a left‑to‑right fold along the axis (python/triton/runtime/interpreter.py:950‑975), so custom non‑associative combine functions (e.g., subtraction/division or NaN‑sensitive float ops) will now produce different results on GPU versus the interpreter and versus the previous linear implementation in this single‑thread case. If sequential order is part of the expected semantics for these reductions, consider keeping the linear fold unless the combiner is known to be associative/commutative.

Useful? React with 👍 / 👎.

I think codex is right here, we have an isAssociative check in reduce so I guess we are supposed to support that use case?

that assumes a layout where the registers are ordered, blocked or so. The associative way of doing it for a general layout is much more inefficient (you chase the basea in order and you might not be able to reduce all the registers in one go...). We already broke this assumption in the previous lowering really

The reduce op assumes the lambda is associative. The lowering would be wrong otherwise. The isAssociative check we have is used to know if we should rematerialize, it is kind of an ad hoc workaround for the fact that with floating point we break this rule. But without assuming associativity we cannot do any efficient lowering

I don't understand your point, Thomas. Do we want to do the lowering assuming just associativity? If so, we cannot perform it efficiently for generic linear layouts, we should just support blocked layouts. Consider otherwise a layout like

ttgl.DistributedLinearLayout( reg_bases=[[1, 0], [4, 0]], lane_bases=[[0, 1], [2, 0], [0, 2], [16, 0], [0, 4]], warp_bases=[[32, 0], [8, 0]], block_bases=[], shape=[64, 8], )

and a reduction over dim=0.
If we didn't transform it to be a blocked layout and we just assume the operation to be associative (and not commutative), then we would need to (naively)

First reduce regs[0] and [1]

Then perform 1 shuffle and a register reduction

Then go to shmem to move the basis [8, 0] to be in the lane bases and perform 1 shuffle and a reduction

Then another shuffle and a reduction for basis [16, 0]

Finally go to shmem, to move the basis [32, 0] to the lanes, and another shuffle and a register reduction

The previous lowering then assumes that the operation is also commutative on top of associative to be able to effectively "reorder the bases before performing the reduction", and so we do.

@lezcano I think that it's reasonable to make the assumption of commutativity in addition to associativity.

The only time that we've ever done reductions over non-commutative functions is over axes of length <= 2, and for those your implementation works for arbitrary f(x, y). We did this (a reduction of _take_first over an axis of length 1) as a hack for implementing unsplat before it was added as a TTIR primitive: a7a89c7

I don't understand your point, Thomas. Do we want to do the lowering assuming just associativity?

no I just meant we should assume associativity.

About commutativity until now we were requiring it and we have a test here for non-commutative case:

triton/python/test/unit/language/test_core.py

Line 2598 in bcbcabd

def get_first_element(a, b):

But yeah it sounds like we need to relax this restriction to do efficient codegen for linear layouts.

This is already broken in current codegen right?

we should probably update our doc and this test

This is already broken in current codegen right?

Yep, it was just true if the layout happened to be regular enough (i.e. blocked)

will update the test and docs

apgoucher

@lezcano Approved, although I'm unsure a priori of the utility of the ternary tree reduction for integer operations. After all, in those cases the binary operation is exactly associative, so InstCombine and ptxas will each have the opportunity to completely rewrite your expression.

lezcano · 2026-01-14T21:21:40Z

fair point. I'll check tomorrow the generated sass

We generate ternary trees for suitable integer ops and binary trees for everything else. We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up to some gluon attention kernels. stack-info: PR: #9220, branch: lezcano/stack/7

peterbell10 · 2026-01-27T21:19:45Z

+      auto acc = useTernary
+                     ? treeReduceTernary(op.getLoc(), rewriter,
+                                         op.getCombineOp(), std::move(vals))
+                     : treeReduceBinary(op.getLoc(), rewriter,
+                                        op.getCombineOp(), std::move(vals));


I think codex is right here, we have an isAssociative check in reduce so I guess we are supposed to support that use case?

lezcano · 2026-01-29T18:57:49Z

Regarding the vectorisation strategy discussed in #9220 (comment), I went with a different one.

The idea is to always vectorise vec[i] and vec[i+1] to avoid packing / unpacking (PRMT/MOVs) in SASS.

@peterbell10 @apgoucher this is ready for another review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 532c5ffad3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

peterbell10

LGTM, would be interested to see performance impact.

lezcano · 2026-01-29T22:43:00Z

LGTM, would be interested to see performance impact.

I will re-run perf before merging, but I initially I saw between a 4% and 5.2% speed-up on the attention gluon kernel.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 974fc04565

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-29T23:03:30Z

+  case InThreadVectorizeOpKind::MinimumF:
+    result = LLVM::MinimumOp::create(builder, loc, vecTy, lhs, rhs);
+    break;
+  case InThreadVectorizeOpKind::MaximumF:
+    result = LLVM::MaximumOp::create(builder, loc, vecTy, lhs, rhs);


Preserve NaN-propagating min/max in vectorized reductions

The vectorized combine region always emits LLVM::MinimumOp/LLVM::MaximumOp for arith::MinimumFOp/MaximumFOp. Elsewhere, min/max lowering gates these ops on hwNanPropagationSupported and falls back to a NaN-emulating path when the target doesn’t propagate NaNs (see MinMaxFOpConversion in ElementwiseOpToLLVM.cpp). By bypassing that check here, f16/bf16 reductions on targets without NaN-propagating min/max (e.g. NVIDIA < sm80 or AMD targets where supportMaximumMinimum is false) can drop NaNs, producing different results than the scalar combine path. Consider plumbing the target capability or reusing the same helper for min/max to keep NaN semantics consistent.

Useful? React with 👍 / 👎.

We generate ternary trees for suitable integer ops and binary trees for everything else. We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up to some gluon attention kernels. stack-info: PR: #9220, branch: lezcano/stack/7

lezcano requested a review from ptillet as a code owner January 14, 2026 14:35

lezcano force-pushed the lezcano/stack/6 branch from 4c14120 to 74c5682 Compare January 14, 2026 14:35

lezcano force-pushed the lezcano/stack/7 branch from 147422e to a649212 Compare January 14, 2026 14:36

This was referenced Jan 14, 2026

[BACKEND] Improve and simplify ReduceOp's lowering #9219

Merged

[BACKEND] Implement support for cross-CTA tt.reduce #9221

Merged

chatgpt-codex-connector Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated

lezcano marked this pull request as draft January 14, 2026 19:28

lezcano changed the base branch from lezcano/stack/6 to main January 14, 2026 19:28

lezcano force-pushed the lezcano/stack/7 branch from a649212 to ae06838 Compare January 14, 2026 19:28

lezcano changed the base branch from main to lezcano/stack/6 January 14, 2026 19:29

lezcano marked this pull request as ready for review January 14, 2026 19:29

lezcano marked this pull request as draft January 14, 2026 19:49

lezcano changed the base branch from lezcano/stack/6 to main January 14, 2026 19:49

lezcano changed the base branch from main to lezcano/stack/6 January 14, 2026 19:49

lezcano marked this pull request as ready for review January 14, 2026 19:50

lezcano mentioned this pull request Jan 14, 2026

[BACKEND] Perform tree reductions on in-thread values #9209

Closed

lezcano requested review from ThomasRaoux and apgoucher January 14, 2026 19:51

chatgpt-codex-connector Bot reviewed Jan 14, 2026

View reviewed changes

apgoucher approved these changes Jan 14, 2026

View reviewed changes

lezcano mentioned this pull request Jan 22, 2026

[BACKEND] Perform tree in-thread reductions #7814

Closed

peterbell10 reviewed Jan 27, 2026

View reviewed changes

lezcano marked this pull request as draft January 28, 2026 10:45

lezcano changed the base branch from lezcano/stack/6 to main January 28, 2026 10:45

lezcano force-pushed the lezcano/stack/7 branch from ae06838 to ce4c938 Compare January 28, 2026 10:45

lezcano changed the base branch from main to lezcano/stack/6 January 28, 2026 10:45

This was referenced Jan 28, 2026

[BACKEND] Support generic multi-cta convert_layouts #9317

Merged

[Membar] Membar pass for clusters #9318

Merged

lezcano marked this pull request as draft January 29, 2026 18:56

lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 18:56

lezcano force-pushed the lezcano/stack/7 branch from d3adbad to 532c5ff Compare January 29, 2026 18:56

lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 18:56

lezcano marked this pull request as ready for review January 29, 2026 18:56

lezcano requested a review from Jokeren as a code owner January 29, 2026 18:56

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread lib/Analysis/Utility.cpp

lezcano marked this pull request as draft January 29, 2026 20:53

lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 20:53

lezcano force-pushed the lezcano/stack/7 branch from 532c5ff to 7687a5e Compare January 29, 2026 20:54

lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 20:54

lezcano marked this pull request as ready for review January 29, 2026 20:54

peterbell10 approved these changes Jan 29, 2026

View reviewed changes

Comment thread lib/Analysis/Utility.cpp Outdated

Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated

lezcano marked this pull request as draft January 29, 2026 22:52

lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 22:52

lezcano force-pushed the lezcano/stack/7 branch from 7687a5e to 974fc04 Compare January 29, 2026 22:52

lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 22:52

lezcano marked this pull request as ready for review January 29, 2026 22:52

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

lezcano marked this pull request as draft January 29, 2026 23:40

lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 23:40

lezcano force-pushed the lezcano/stack/7 branch from 974fc04 to 0423232 Compare January 29, 2026 23:40

lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 23:40

lezcano marked this pull request as ready for review January 29, 2026 23:41

lezcano marked this pull request as draft January 29, 2026 23:43

lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 23:43

ThomasRaoux approved these changes Feb 5, 2026

View reviewed changes

Conversation

lezcano commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!