Skip to content

[BACKEND] Perform tree reductions on in-thread values#9220

Merged
lezcano merged 1 commit intomainfrom
lezcano/stack/7
Feb 6, 2026
Merged

[BACKEND] Perform tree reductions on in-thread values#9220
lezcano merged 1 commit intomainfrom
lezcano/stack/7

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented Jan 14, 2026

Stacked PRs:


[BACKEND] Perform tree reductions on in-thread values

We generate ternary trees for suitable integer ops and binary trees for
everything else.

We manually generate {add,mul}.{f16,f32}x2 ops. This brings a speed-up
to some gluon attention kernels.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a649212079

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
@lezcano lezcano marked this pull request as draft January 14, 2026 19:28
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 14, 2026 19:28
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 14, 2026 19:29
@lezcano lezcano marked this pull request as ready for review January 14, 2026 19:29
@lezcano lezcano marked this pull request as draft January 14, 2026 19:49
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 14, 2026 19:49
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 14, 2026 19:49
@lezcano lezcano marked this pull request as ready for review January 14, 2026 19:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae06838ea6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +313 to +317
auto acc = useTernary
? treeReduceTernary(op.getLoc(), rewriter,
op.getCombineOp(), std::move(vals))
: treeReduceBinary(op.getLoc(), rewriter,
op.getCombineOp(), std::move(vals));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve sequential order for in-thread reductions

This switch to a tree reduction for per-thread values changes the evaluation order even when the axis never crosses lanes (i.e., only reduceWithinThreads runs). The interpreter’s ReduceOps.generic_reduce performs a left‑to‑right fold along the axis (python/triton/runtime/interpreter.py:950‑975), so custom non‑associative combine functions (e.g., subtraction/division or NaN‑sensitive float ops) will now produce different results on GPU versus the interpreter and versus the previous linear implementation in this single‑thread case. If sequential order is part of the expected semantics for these reductions, consider keeping the linear fold unless the combiner is known to be associative/commutative.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think codex is right here, we have an isAssociative check in reduce so I guess we are supposed to support that use case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that assumes a layout where the registers are ordered, blocked or so. The associative way of doing it for a general layout is much more inefficient (you chase the basea in order and you might not be able to reduce all the registers in one go...). We already broke this assumption in the previous lowering really

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reduce op assumes the lambda is associative. The lowering would be wrong otherwise. The isAssociative check we have is used to know if we should rematerialize, it is kind of an ad hoc workaround for the fact that with floating point we break this rule. But without assuming associativity we cannot do any efficient lowering

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your point, Thomas. Do we want to do the lowering assuming just associativity? If so, we cannot perform it efficiently for generic linear layouts, we should just support blocked layouts. Consider otherwise a layout like

ttgl.DistributedLinearLayout(
    reg_bases=[[1, 0], [4, 0]],
    lane_bases=[[0, 1], [2, 0], [0, 2], [16, 0], [0, 4]],
    warp_bases=[[32, 0], [8, 0]],
    block_bases=[],
    shape=[64, 8],
)

and a reduction over dim=0.
If we didn't transform it to be a blocked layout and we just assume the operation to be associative (and not commutative), then we would need to (naively)

  • First reduce regs[0] and [1]
  • Then perform 1 shuffle and a register reduction
  • Then go to shmem to move the basis [8, 0] to be in the lane bases and perform 1 shuffle and a reduction
  • Then another shuffle and a reduction for basis [16, 0]
  • Finally go to shmem, to move the basis [32, 0] to the lanes, and another shuffle and a register reduction

The previous lowering then assumes that the operation is also commutative on top of associative to be able to effectively "reorder the bases before performing the reduction", and so we do.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lezcano I think that it's reasonable to make the assumption of commutativity in addition to associativity.

The only time that we've ever done reductions over non-commutative functions is over axes of length <= 2, and for those your implementation works for arbitrary f(x, y). We did this (a reduction of _take_first over an axis of length 1) as a hack for implementing unsplat before it was added as a TTIR primitive: a7a89c7

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your point, Thomas. Do we want to do the lowering assuming just associativity?

no I just meant we should assume associativity.

About commutativity until now we were requiring it and we have a test here for non-commutative case:

def get_first_element(a, b):

But yeah it sounds like we need to relax this restriction to do efficient codegen for linear layouts.

This is already broken in current codegen right?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably update our doc and this test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already broken in current codegen right?

Yep, it was just true if the layout happened to be regular enough (i.e. blocked)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update the test and docs

Copy link
Copy Markdown
Collaborator

@apgoucher apgoucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lezcano Approved, although I'm unsure a priori of the utility of the ternary tree reduction for integer operations. After all, in those cases the binary operation is exactly associative, so InstCombine and ptxas will each have the opportunity to completely rewrite your expression.

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Jan 14, 2026

fair point. I'll check tomorrow the generated sass

lezcano added a commit that referenced this pull request Jan 26, 2026
We generate ternary trees for suitable integer ops and binary trees for
everything else.

We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up
to some gluon attention kernels.

stack-info: PR: #9220, branch: lezcano/stack/7
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
Comment on lines +313 to +317
auto acc = useTernary
? treeReduceTernary(op.getLoc(), rewriter,
op.getCombineOp(), std::move(vals))
: treeReduceBinary(op.getLoc(), rewriter,
op.getCombineOp(), std::move(vals));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think codex is right here, we have an isAssociative check in reduce so I guess we are supposed to support that use case?

@lezcano lezcano marked this pull request as draft January 28, 2026 10:45
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 28, 2026 10:45
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 28, 2026 10:45
@lezcano lezcano marked this pull request as draft January 29, 2026 18:56
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 18:56
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 18:56
@lezcano lezcano marked this pull request as ready for review January 29, 2026 18:56
@lezcano lezcano requested a review from Jokeren as a code owner January 29, 2026 18:56
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Jan 29, 2026

Regarding the vectorisation strategy discussed in #9220 (comment), I went with a different one.

The idea is to always vectorise vec[i] and vec[i+1] to avoid packing / unpacking (PRMT/MOVs) in SASS.

@peterbell10 @apgoucher this is ready for another review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 532c5ffad3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Analysis/Utility.cpp
@lezcano lezcano marked this pull request as draft January 29, 2026 20:53
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 20:53
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 20:54
@lezcano lezcano marked this pull request as ready for review January 29, 2026 20:54
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would be interested to see performance impact.

Comment thread lib/Analysis/Utility.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Jan 29, 2026

LGTM, would be interested to see performance impact.

I will re-run perf before merging, but I initially I saw between a 4% and 5.2% speed-up on the attention gluon kernel.

@lezcano lezcano marked this pull request as draft January 29, 2026 22:52
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 22:52
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 22:52
@lezcano lezcano marked this pull request as ready for review January 29, 2026 22:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 974fc04565

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Analysis/Utility.cpp
Comment on lines +202 to +206
case InThreadVectorizeOpKind::MinimumF:
result = LLVM::MinimumOp::create(builder, loc, vecTy, lhs, rhs);
break;
case InThreadVectorizeOpKind::MaximumF:
result = LLVM::MaximumOp::create(builder, loc, vecTy, lhs, rhs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve NaN-propagating min/max in vectorized reductions

The vectorized combine region always emits LLVM::MinimumOp/LLVM::MaximumOp for arith::MinimumFOp/MaximumFOp. Elsewhere, min/max lowering gates these ops on hwNanPropagationSupported and falls back to a NaN-emulating path when the target doesn’t propagate NaNs (see MinMaxFOpConversion in ElementwiseOpToLLVM.cpp). By bypassing that check here, f16/bf16 reductions on targets without NaN-propagating min/max (e.g. NVIDIA < sm80 or AMD targets where supportMaximumMinimum is false) can drop NaNs, producing different results than the scalar combine path. Consider plumbing the target capability or reusing the same helper for min/max to keep NaN semantics consistent.

Useful? React with 👍 / 👎.

@lezcano lezcano marked this pull request as draft January 29, 2026 23:40
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 23:40
@lezcano lezcano changed the base branch from main to lezcano/stack/6 January 29, 2026 23:40
@lezcano lezcano marked this pull request as ready for review January 29, 2026 23:41
@lezcano lezcano marked this pull request as draft January 29, 2026 23:43
@lezcano lezcano changed the base branch from lezcano/stack/6 to main January 29, 2026 23:43
We generate ternary trees for suitable integer ops and binary trees for
everything else.

We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up
to some gluon attention kernels.

stack-info: PR: #9220, branch: lezcano/stack/7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants