[BACKEND] Implement support for cross-CTA tt.reduce by lezcano · Pull Request #9221 · triton-lang/triton

lezcano · 2026-01-14T14:35:47Z

Stacked PRs:

[BACKEND] Implement support for cross-CTA tt.reduce

The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
convert_layout supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
targetInfo in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 555fd41ce3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ff455e951

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8

peterbell10 · 2026-01-27T22:03:42Z

+    // The proper way to lower reduce would be to lower it to:
+    // reduce_threads / reduce_lanes / convert_layout
+    // And let the AllocationAnalysis handle the shared memory allocation
+    // and membar the barriers


Hmm @ThomasRaoux this is seeming increasingly hacky. Would you be okay with decomposing ReduceOp into ReshapeOp + ReduceOp + ConvertLayoutOp before allocation analysis and just asserting that by the time we get to LLVM lowering the reduce doesn't cross warps or block boundaries? This way we don't add any new ops to the IR but avoid all this mess.

This would be quite nice, actually.
Also note that in many cases when we need to go to shared memory, the lowering is

reduce (threads and lanes) convert_layout reduce (lanes) convert_layout

so if we run a remove convert layout pass after that, that convert_layout could be merged into a different convert_layout.

But yeah, even just doing what peter suggest would be much neater than what this PR currently does.

@ThomasRaoux ping

oops sorry I missed this comment somehow.
In my opinion this use case is not important enough to create a precedent for doing progressive lowering of TTGIR to LLVM. I think progressive lowering is something we should be careful with as it can easily lead to leaking abstractions and having many informal level of IR. Of course we should be pragmatic but considering we don't even have use cases for CGA level reductions it feels like the wrong choice here IMO.

if we are worried about the hacky part here I would suggest we punt of cross CTA reductions

I don't think there are any correctness worries here, it's just a somewhat hacky piece of code. This is not blocking and could be done after the fact.

That being said, note that the proposal from Peter here would be to split a tt.reduce op into tt.reduces and ttg.convert_layouts so there wouldn't be any new ops involved here. It could even be done during the ttgir lowering itself, before the last remove convert layout pass.

I think progressive lowering is something we should be careful with as it can easily lead to leaking abstractions and having many informal level of IR.

I don't see how this has anything to do with leaky abstractions? I guess you could consider the ReduceOp not going through shared memory to be an informal level of IR, but in that case why object to adding a real IR op? We already have for example TTGIR -> NVGPU for nvidia ops, so what's wrong with having other ops do something similar.

I find this quite frustrating because splitting the IR is a simple and elegant solution, that combines nicely with all our other infrastructure. But to avoid some seemingly arbitrary rule, we instead are doing something very hacky.

I don't see how this has anything to do with leaky abstractions? I guess you could consider the ReduceOp not going through shared memory to be an informal level of IR, but in that case why object to adding a real IR op? We already have for example TTGIR -> NVGPU for nvidia ops, so what's wrong with having other ops do something similar.

Yes what I mean is that the reduce being "legal" at a given stage depends on the lowering rules happening in convert to LLVM. NVGPU is one extra level indeed but I don't think it is reason to keep adding more of those and going into the direction of progressive lowering with not super well defined rules on what is supported at what stage of the lowering.

I find this quite frustrating because splitting the IR is a simple and elegant solution, that combines nicely with all our other infrastructure. But to avoid some seemingly arbitrary rule, we instead are doing something very hacky.

Many design rules are arbitrary and I agree this is not ideal.

My suggestion is that we drop or punt on that feature for now. tbh I don't think we should add hacks/complexity for things we don't have a use for.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 229f05a930

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbac5d753c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-05T21:09:25Z

+  auto isReduced = [axis = axis](const LinearLayout &layout) {
+    return layout.getOutDimSizes().begin()[axis] == 1;
+  };
+  auto regLl = reducedRegLaneLayout(srcTy, axis);
+
+  // All the inputs have the same layout so, since we order them from largest
+  // bitsize to smallest, and the first one is aligned, by induction, they are
+  // all aligned, so we don't need to align the byte numbers returned here.
+  unsigned bytesRegToTmp = 0;
+  while (!isReduced(regLl)) {
+    auto tmpLl = getInterLayout(regLl, axis);
+    // We take the maximum of the elements and multiply by the total bitwidth.
+    // We do this as otherwise it's quite tricky to find the correct
+    // BaseOffsets in the lowering.
+    int bytes = 0;
+    for (auto inputTy : op.getInputTypes()) {
+      auto nelem =
+          getNumScratchElemsSwizzledCvt(regLl, tmpLl, getBitwidth(inputTy));
+      bytes += nelem * (getBitwidth(inputTy) / 8);
+    }
+    bytesRegToTmp = std::max<unsigned>(bytesRegToTmp, bytes);
+    regLl = zeroBasesAlongDimAndReorder(tmpLl, axis, kLane);
+  }


Skip scratch allocation for warp-synchronous reduces

For reductions where the axis has no warp/block bases (warp‑synchronous), the lowering never emits convert_layout or shared‑memory use—it only does reduceWithinWarps. This loop still computes a non‑zero scratch size by calling getNumScratchElemsSwizzledCvt on the identity layout, so AllocationAnalysis will reserve shared memory even though no shared memory is used. That extra allocation can reduce occupancy for common within‑warp reductions. Consider returning 0 when no inter‑warp reduction is needed (e.g., detect that getInterLayout doesn’t move any axis bases or that the reduction is warp‑synchronous).

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9be4041eb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-05T21:55:35Z

      offsets[idx] = offset;
-      offset += bytesPerOperand[idx];
+      auto inputTy = op.getInputTypes()[idx];
+      auto bytes = getNumScratchElemsSwizzledCvt(srcLayout, dstLayout,
+                                                 getBitwidth(inputTy)) *
+                   (getBitwidth(inputTy) / 8);


Prevent zero-sized smem offsets for sub-byte reductions

In getSmemBaseOffsets, bytes is computed as getBitwidth(inputTy) / 8, which truncates to 0 for sub‑byte element types (e.g., i1/fp4). For multi‑operand reductions this leaves offset unchanged, so different operands get the same shared‑memory base, and the convert_layout writes will overlap and corrupt results. The previous path rounded sub‑byte types up to at least 1 byte (via getReduceMemElemTy), so this is a new regression for sub‑byte reduces.

Useful? React with 👍 / 👎.

The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8

lezcano requested review from Jokeren, peterbell10 and ptillet as code owners January 14, 2026 14:35

lezcano force-pushed the lezcano/stack/7 branch from 147422e to a649212 Compare January 14, 2026 14:35

lezcano force-pushed the lezcano/stack/8 branch from ace4ffc to 555fd41 Compare January 14, 2026 14:36

This was referenced Jan 14, 2026

[BACKEND] Improve and simplify ReduceOp's lowering #9219

Merged

[BACKEND] Perform tree reductions on in-thread values #9220

Merged

chatgpt-codex-connector Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated

Comment thread lib/Tools/LayoutUtils.cpp

lezcano marked this pull request as draft January 14, 2026 19:28

lezcano changed the base branch from lezcano/stack/7 to main January 14, 2026 19:28

lezcano force-pushed the lezcano/stack/8 branch from 555fd41 to 7ff455e Compare January 14, 2026 19:28

lezcano changed the base branch from main to lezcano/stack/7 January 14, 2026 19:29

lezcano marked this pull request as ready for review January 14, 2026 19:29

lezcano requested review from antiagainst and zhanglx13 as code owners January 14, 2026 19:29

chatgpt-codex-connector Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread lib/Analysis/Allocation.cpp Outdated

lezcano marked this pull request as draft January 14, 2026 19:49

lezcano changed the base branch from lezcano/stack/7 to main January 14, 2026 19:49

lezcano force-pushed the lezcano/stack/8 branch from 7ff455e to c1fa779 Compare January 14, 2026 19:49

lezcano changed the base branch from main to lezcano/stack/7 January 14, 2026 19:50

lezcano marked this pull request as ready for review January 14, 2026 19:50

lezcano requested a review from ThomasRaoux January 14, 2026 19:51

peterbell10 reviewed Jan 27, 2026

View reviewed changes

lezcano marked this pull request as draft January 28, 2026 10:45

lezcano changed the base branch from lezcano/stack/7 to main January 28, 2026 10:45

lezcano force-pushed the lezcano/stack/8 branch from c1fa779 to 652c4ed Compare January 28, 2026 10:45

lezcano marked this pull request as draft January 29, 2026 12:20

lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 12:20

lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 12:20

lezcano marked this pull request as ready for review January 29, 2026 12:20

lezcano marked this pull request as draft January 29, 2026 12:53

lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 12:53

lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 12:53

lezcano mentioned this pull request Jan 29, 2026

[Membar] Fix non-trivial function smem offsets #9327

Merged

lezcano marked this pull request as ready for review January 29, 2026 12:54

lezcano marked this pull request as draft January 29, 2026 18:56

lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 18:56

lezcano force-pushed the lezcano/stack/8 branch from 3c2a244 to 5af1457 Compare January 29, 2026 18:56

lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 18:56

lezcano marked this pull request as ready for review January 29, 2026 18:56

lezcano marked this pull request as draft January 29, 2026 20:53

lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 20:53

lezcano force-pushed the lezcano/stack/8 branch from 5af1457 to 229f05a Compare January 29, 2026 20:54

lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 20:54

lezcano marked this pull request as ready for review January 29, 2026 20:54

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread lib/Analysis/Utility.cpp

lezcano marked this pull request as draft January 29, 2026 22:52

lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 22:52

lezcano force-pushed the lezcano/stack/8 branch from 229f05a to 590654a Compare January 29, 2026 22:52

lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 22:52

lezcano marked this pull request as ready for review January 29, 2026 22:52

lezcano marked this pull request as draft January 29, 2026 23:40

ThomasRaoux approved these changes Feb 5, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Feb 5, 2026

View reviewed changes

Conversation

lezcano commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!