[BACKEND] Implement support for cross-CTA tt.reduce#9221
Conversation
The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8
147422e to
a649212
Compare
ace4ffc to
555fd41
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 555fd41ce3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8
555fd41 to
7ff455e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ff455e951
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8
7ff455e to
c1fa779
Compare
The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8
| // The proper way to lower reduce would be to lower it to: | ||
| // reduce_threads / reduce_lanes / convert_layout | ||
| // And let the AllocationAnalysis handle the shared memory allocation | ||
| // and membar the barriers |
There was a problem hiding this comment.
Hmm @ThomasRaoux this is seeming increasingly hacky. Would you be okay with decomposing ReduceOp into ReshapeOp + ReduceOp + ConvertLayoutOp before allocation analysis and just asserting that by the time we get to LLVM lowering the reduce doesn't cross warps or block boundaries? This way we don't add any new ops to the IR but avoid all this mess.
There was a problem hiding this comment.
This would be quite nice, actually.
Also note that in many cases when we need to go to shared memory, the lowering is
reduce (threads and lanes)
convert_layout
reduce (lanes)
convert_layout
so if we run a remove convert layout pass after that, that convert_layout could be merged into a different convert_layout.
But yeah, even just doing what peter suggest would be much neater than what this PR currently does.
There was a problem hiding this comment.
oops sorry I missed this comment somehow.
In my opinion this use case is not important enough to create a precedent for doing progressive lowering of TTGIR to LLVM. I think progressive lowering is something we should be careful with as it can easily lead to leaking abstractions and having many informal level of IR. Of course we should be pragmatic but considering we don't even have use cases for CGA level reductions it feels like the wrong choice here IMO.
There was a problem hiding this comment.
if we are worried about the hacky part here I would suggest we punt of cross CTA reductions
There was a problem hiding this comment.
I don't think there are any correctness worries here, it's just a somewhat hacky piece of code. This is not blocking and could be done after the fact.
That being said, note that the proposal from Peter here would be to split a tt.reduce op into tt.reduces and ttg.convert_layouts so there wouldn't be any new ops involved here. It could even be done during the ttgir lowering itself, before the last remove convert layout pass.
There was a problem hiding this comment.
I think progressive lowering is something we should be careful with as it can easily lead to leaking abstractions and having many informal level of IR.
I don't see how this has anything to do with leaky abstractions? I guess you could consider the ReduceOp not going through shared memory to be an informal level of IR, but in that case why object to adding a real IR op? We already have for example TTGIR -> NVGPU for nvidia ops, so what's wrong with having other ops do something similar.
I find this quite frustrating because splitting the IR is a simple and elegant solution, that combines nicely with all our other infrastructure. But to avoid some seemingly arbitrary rule, we instead are doing something very hacky.
There was a problem hiding this comment.
I don't see how this has anything to do with leaky abstractions? I guess you could consider the
ReduceOpnot going through shared memory to be an informal level of IR, but in that case why object to adding a real IR op? We already have for example TTGIR -> NVGPU for nvidia ops, so what's wrong with having other ops do something similar.
Yes what I mean is that the reduce being "legal" at a given stage depends on the lowering rules happening in convert to LLVM. NVGPU is one extra level indeed but I don't think it is reason to keep adding more of those and going into the direction of progressive lowering with not super well defined rules on what is supported at what stage of the lowering.
I find this quite frustrating because splitting the IR is a simple and elegant solution, that combines nicely with all our other infrastructure. But to avoid some seemingly arbitrary rule, we instead are doing something very hacky.
Many design rules are arbitrary and I agree this is not ideal.
My suggestion is that we drop or punt on that feature for now. tbh I don't think we should add hacks/complexity for things we don't have a use for.
c1fa779 to
652c4ed
Compare
3c2a244 to
5af1457
Compare
5af1457 to
229f05a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 229f05a930
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
229f05a to
590654a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbac5d753c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto isReduced = [axis = axis](const LinearLayout &layout) { | ||
| return layout.getOutDimSizes().begin()[axis] == 1; | ||
| }; | ||
| auto regLl = reducedRegLaneLayout(srcTy, axis); | ||
|
|
||
| // All the inputs have the same layout so, since we order them from largest | ||
| // bitsize to smallest, and the first one is aligned, by induction, they are | ||
| // all aligned, so we don't need to align the byte numbers returned here. | ||
| unsigned bytesRegToTmp = 0; | ||
| while (!isReduced(regLl)) { | ||
| auto tmpLl = getInterLayout(regLl, axis); | ||
| // We take the maximum of the elements and multiply by the total bitwidth. | ||
| // We do this as otherwise it's quite tricky to find the correct | ||
| // BaseOffsets in the lowering. | ||
| int bytes = 0; | ||
| for (auto inputTy : op.getInputTypes()) { | ||
| auto nelem = | ||
| getNumScratchElemsSwizzledCvt(regLl, tmpLl, getBitwidth(inputTy)); | ||
| bytes += nelem * (getBitwidth(inputTy) / 8); | ||
| } | ||
| bytesRegToTmp = std::max<unsigned>(bytesRegToTmp, bytes); | ||
| regLl = zeroBasesAlongDimAndReorder(tmpLl, axis, kLane); | ||
| } |
There was a problem hiding this comment.
Skip scratch allocation for warp-synchronous reduces
For reductions where the axis has no warp/block bases (warp‑synchronous), the lowering never emits convert_layout or shared‑memory use—it only does reduceWithinWarps. This loop still computes a non‑zero scratch size by calling getNumScratchElemsSwizzledCvt on the identity layout, so AllocationAnalysis will reserve shared memory even though no shared memory is used. That extra allocation can reduce occupancy for common within‑warp reductions. Consider returning 0 when no inter‑warp reduction is needed (e.g., detect that getInterLayout doesn’t move any axis bases or that the reduction is warp‑synchronous).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9be4041eb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| offsets[idx] = offset; | ||
| offset += bytesPerOperand[idx]; | ||
| auto inputTy = op.getInputTypes()[idx]; | ||
| auto bytes = getNumScratchElemsSwizzledCvt(srcLayout, dstLayout, | ||
| getBitwidth(inputTy)) * | ||
| (getBitwidth(inputTy) / 8); |
There was a problem hiding this comment.
Prevent zero-sized smem offsets for sub-byte reductions
In getSmemBaseOffsets, bytes is computed as getBitwidth(inputTy) / 8, which truncates to 0 for sub‑byte element types (e.g., i1/fp4). For multi‑operand reductions this leaves offset unchanged, so different operands get the same shared‑memory base, and the convert_layout writes will overlap and corrupt results. The previous path rounded sub‑byte types up to at least 1 byte (via getReduceMemElemTy), so this is a new regression for sub‑byte reduces.
Useful? React with 👍 / 👎.
The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. stack-info: PR: #9221, branch: lezcano/stack/8
Stacked PRs:
[BACKEND] Implement support for cross-CTA tt.reduce
The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
convert_layoutsupporting them, and it doesn't currently supportLinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
targetInfoin the case of cross-CTA memory reuse.In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.