Skip to content

[BACKEND] Implement support for cross-CTA tt.reduce#9221

Merged
lezcano merged 1 commit intomainfrom
lezcano/stack/8
Feb 6, 2026
Merged

[BACKEND] Implement support for cross-CTA tt.reduce#9221
lezcano merged 1 commit intomainfrom
lezcano/stack/8

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented Jan 14, 2026

Stacked PRs:


[BACKEND] Implement support for cross-CTA tt.reduce

The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
convert_layout supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
targetInfo in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

lezcano added a commit that referenced this pull request Jan 14, 2026
The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
`convert_layout` supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
`targetInfo` in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

stack-info: PR: #9221, branch: lezcano/stack/8
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 555fd41ce3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
Comment thread lib/Tools/LayoutUtils.cpp
@lezcano lezcano marked this pull request as draft January 14, 2026 19:28
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 14, 2026 19:28
lezcano added a commit that referenced this pull request Jan 14, 2026
The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
`convert_layout` supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
`targetInfo` in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

stack-info: PR: #9221, branch: lezcano/stack/8
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 14, 2026 19:29
@lezcano lezcano marked this pull request as ready for review January 14, 2026 19:29
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ff455e951

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Analysis/Allocation.cpp Outdated
@lezcano lezcano marked this pull request as draft January 14, 2026 19:49
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 14, 2026 19:49
lezcano added a commit that referenced this pull request Jan 14, 2026
The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
`convert_layout` supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
`targetInfo` in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

stack-info: PR: #9221, branch: lezcano/stack/8
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 14, 2026 19:50
@lezcano lezcano marked this pull request as ready for review January 14, 2026 19:50
@lezcano lezcano requested a review from ThomasRaoux January 14, 2026 19:51
lezcano added a commit that referenced this pull request Jan 26, 2026
The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
`convert_layout` supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
`targetInfo` in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

stack-info: PR: #9221, branch: lezcano/stack/8
Comment thread lib/Analysis/Allocation.cpp Outdated
Comment thread lib/Analysis/Utility.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
Comment thread lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp Outdated
// The proper way to lower reduce would be to lower it to:
// reduce_threads / reduce_lanes / convert_layout
// And let the AllocationAnalysis handle the shared memory allocation
// and membar the barriers
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm @ThomasRaoux this is seeming increasingly hacky. Would you be okay with decomposing ReduceOp into ReshapeOp + ReduceOp + ConvertLayoutOp before allocation analysis and just asserting that by the time we get to LLVM lowering the reduce doesn't cross warps or block boundaries? This way we don't add any new ops to the IR but avoid all this mess.

Copy link
Copy Markdown
Contributor Author

@lezcano lezcano Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be quite nice, actually.
Also note that in many cases when we need to go to shared memory, the lowering is

reduce (threads and lanes)
convert_layout
reduce (lanes)
convert_layout

so if we run a remove convert layout pass after that, that convert_layout could be merged into a different convert_layout.

But yeah, even just doing what peter suggest would be much neater than what this PR currently does.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry I missed this comment somehow.
In my opinion this use case is not important enough to create a precedent for doing progressive lowering of TTGIR to LLVM. I think progressive lowering is something we should be careful with as it can easily lead to leaking abstractions and having many informal level of IR. Of course we should be pragmatic but considering we don't even have use cases for CGA level reductions it feels like the wrong choice here IMO.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are worried about the hacky part here I would suggest we punt of cross CTA reductions

Copy link
Copy Markdown
Contributor Author

@lezcano lezcano Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are any correctness worries here, it's just a somewhat hacky piece of code. This is not blocking and could be done after the fact.

That being said, note that the proposal from Peter here would be to split a tt.reduce op into tt.reduces and ttg.convert_layouts so there wouldn't be any new ops involved here. It could even be done during the ttgir lowering itself, before the last remove convert layout pass.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think progressive lowering is something we should be careful with as it can easily lead to leaking abstractions and having many informal level of IR.

I don't see how this has anything to do with leaky abstractions? I guess you could consider the ReduceOp not going through shared memory to be an informal level of IR, but in that case why object to adding a real IR op? We already have for example TTGIR -> NVGPU for nvidia ops, so what's wrong with having other ops do something similar.

I find this quite frustrating because splitting the IR is a simple and elegant solution, that combines nicely with all our other infrastructure. But to avoid some seemingly arbitrary rule, we instead are doing something very hacky.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this has anything to do with leaky abstractions? I guess you could consider the ReduceOp not going through shared memory to be an informal level of IR, but in that case why object to adding a real IR op? We already have for example TTGIR -> NVGPU for nvidia ops, so what's wrong with having other ops do something similar.

Yes what I mean is that the reduce being "legal" at a given stage depends on the lowering rules happening in convert to LLVM. NVGPU is one extra level indeed but I don't think it is reason to keep adding more of those and going into the direction of progressive lowering with not super well defined rules on what is supported at what stage of the lowering.

I find this quite frustrating because splitting the IR is a simple and elegant solution, that combines nicely with all our other infrastructure. But to avoid some seemingly arbitrary rule, we instead are doing something very hacky.

Many design rules are arbitrary and I agree this is not ideal.

My suggestion is that we drop or punt on that feature for now. tbh I don't think we should add hacks/complexity for things we don't have a use for.

Comment thread lib/Tools/LayoutUtils.cpp Outdated
Comment thread third_party/amd/lib/Analysis/AMDGPUAllocation.cpp Outdated
@lezcano lezcano marked this pull request as draft January 28, 2026 10:45
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 28, 2026 10:45
@lezcano lezcano marked this pull request as draft January 29, 2026 12:20
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 12:20
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 12:20
@lezcano lezcano marked this pull request as ready for review January 29, 2026 12:20
@lezcano lezcano marked this pull request as draft January 29, 2026 12:53
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 12:53
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 12:53
@lezcano lezcano marked this pull request as ready for review January 29, 2026 12:54
@lezcano lezcano marked this pull request as draft January 29, 2026 18:56
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 18:56
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 18:56
@lezcano lezcano marked this pull request as ready for review January 29, 2026 18:56
@lezcano lezcano marked this pull request as draft January 29, 2026 20:53
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 20:53
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 20:54
@lezcano lezcano marked this pull request as ready for review January 29, 2026 20:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 229f05a930

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Analysis/Utility.cpp
@lezcano lezcano marked this pull request as draft January 29, 2026 22:52
@lezcano lezcano changed the base branch from lezcano/stack/7 to main January 29, 2026 22:52
@lezcano lezcano changed the base branch from main to lezcano/stack/7 January 29, 2026 22:52
@lezcano lezcano marked this pull request as ready for review January 29, 2026 22:52
@lezcano lezcano marked this pull request as draft January 29, 2026 23:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbac5d753c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Analysis/Utility.cpp
Comment on lines +111 to +133
auto isReduced = [axis = axis](const LinearLayout &layout) {
return layout.getOutDimSizes().begin()[axis] == 1;
};
auto regLl = reducedRegLaneLayout(srcTy, axis);

// All the inputs have the same layout so, since we order them from largest
// bitsize to smallest, and the first one is aligned, by induction, they are
// all aligned, so we don't need to align the byte numbers returned here.
unsigned bytesRegToTmp = 0;
while (!isReduced(regLl)) {
auto tmpLl = getInterLayout(regLl, axis);
// We take the maximum of the elements and multiply by the total bitwidth.
// We do this as otherwise it's quite tricky to find the correct
// BaseOffsets in the lowering.
int bytes = 0;
for (auto inputTy : op.getInputTypes()) {
auto nelem =
getNumScratchElemsSwizzledCvt(regLl, tmpLl, getBitwidth(inputTy));
bytes += nelem * (getBitwidth(inputTy) / 8);
}
bytesRegToTmp = std::max<unsigned>(bytesRegToTmp, bytes);
regLl = zeroBasesAlongDimAndReorder(tmpLl, axis, kLane);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip scratch allocation for warp-synchronous reduces

For reductions where the axis has no warp/block bases (warp‑synchronous), the lowering never emits convert_layout or shared‑memory use—it only does reduceWithinWarps. This loop still computes a non‑zero scratch size by calling getNumScratchElemsSwizzledCvt on the identity layout, so AllocationAnalysis will reserve shared memory even though no shared memory is used. That extra allocation can reduce occupancy for common within‑warp reductions. Consider returning 0 when no inter‑warp reduction is needed (e.g., detect that getInterLayout doesn’t move any axis bases or that the reduction is warp‑synchronous).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9be4041eb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Tools/LayoutUtils.cpp
Comment on lines 469 to +473
offsets[idx] = offset;
offset += bytesPerOperand[idx];
auto inputTy = op.getInputTypes()[idx];
auto bytes = getNumScratchElemsSwizzledCvt(srcLayout, dstLayout,
getBitwidth(inputTy)) *
(getBitwidth(inputTy) / 8);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent zero-sized smem offsets for sub-byte reductions

In getSmemBaseOffsets, bytes is computed as getBitwidth(inputTy) / 8, which truncates to 0 for sub‑byte element types (e.g., i1/fp4). For multi‑operand reductions this leaves offset unchanged, so different operands get the same shared‑memory base, and the convert_layout writes will overlap and corrupt results. The previous path rounded sub‑byte types up to at least 1 byte (via getReduceMemElemTy), so this is a new regression for sub‑byte reduces.

Useful? React with 👍 / 👎.

The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
`convert_layout` supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
`targetInfo` in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

stack-info: PR: #9221, branch: lezcano/stack/8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants