Skip to content

[BACKEND] Support generic multi-cta convert_layouts#9317

Merged
lezcano merged 1 commit intomainfrom
lezcano/stack/9
Feb 9, 2026
Merged

[BACKEND] Support generic multi-cta convert_layouts#9317
lezcano merged 1 commit intomainfrom
lezcano/stack/9

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented Jan 28, 2026

Stacked PRs:


[BACKEND] Support generic multi-cta convert_layouts

We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the convert_layout, not the
convert_layout within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68f76bf7fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Conversion/TritonGPUToLLVM/Utility.cpp Outdated
@lezcano lezcano marked this pull request as draft January 28, 2026 12:15
@lezcano lezcano changed the base branch from lezcano/stack/8 to main January 28, 2026 12:15
@lezcano lezcano changed the base branch from main to lezcano/stack/8 January 28, 2026 12:15
@lezcano lezcano marked this pull request as ready for review January 28, 2026 12:15
@lezcano lezcano requested a review from ThomasRaoux January 28, 2026 12:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 977774ba5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.cpp Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.cpp Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated
Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated
Comment thread lib/Tools/LayoutUtils.cpp
llvm::SetVector<uint32_t> tileBases;
for (auto bases : llvm::make_second_range(addrLayout.getBases())) {
auto addrNamedBases = addrLayout.flattenOuts().getBases();
for (auto bases : llvm::make_second_range(addrNamedBases)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does flattening the out dim do here? nAdditive could be different for the offeset and block, so does this return the smaller one somehow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This treats blocks as an extension of the offset, which makes sense here. We treat the block as a concatenation of the memory address.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that answers the question? The block isn't calculated as an extension of the offset, so why should nAdditive be the same for both?

Copy link
Copy Markdown
Contributor Author

@lezcano lezcano Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows you to do this:

    auto idxAndBlock =
        reps.apply({{kReg, i}, {kLane, 0}, {kWarp, 0}, {kBlock, 0}});
    auto regIdxI8 = idxAndBlock[0].second * (bitwidth / 8);
    Value offset = b.xor_(regBaseI8, b.i32_val(regIdxI8));
    Value ctaOffset = b.i32_val(0);
    if (useBlockId) {
      ctaOffset = b.xor_(targetCtaId, b.i32_val(idxAndBlock[1].second));
    }
    offset = applyPadding(loc, rewriter, offset, paddingShifts);
    for (int j = 0; j < nAdditive; j += elemsPerVec) {
      // all these constants will go as immediate values to LDS/STS
      auto idxAndBlockAdd =
          reps.apply({{kReg, j}, {kLane, 0}, {kWarp, 0}, {kBlock, 0}});
      auto regIdxAddI8 = idxAndBlockAdd[0].second * (bitwidth / 8);
      // `actionAdditiveStrides` forces `regIdxAddI8` and `offset` to be bitwise
      // disjoint, so we can calculate their padding contributions separately.
      regIdxAddI8 = applyPadding(regIdxAddI8, paddingShifts);
      Value innerOffset = b.add(offset, b.i32_val(regIdxAddI8));
      std::optional<Value> innerCtaOffset;
      if (useBlockId) {
        innerCtaOffset = b.add(ctaOffset, b.i32_val(idxAndBlockAdd[1].second));
      }

just lift that into an add. I think that it may be an optimisation that is not super relevant for blocks, and we could very well write b.xor_ in the inner loop, but it certainly won't hurt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that. I specifically asked: why do you flatten the out dims? As your code snippet demonstrates, we effectively have two different sublayouts that compute two different outputs, each of which may have a different nAdditive value. So what effect does flattening the outputs have?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition algorithm says:
We allow to change this xor into an addition if the output bits of this given register are disjoint to those from lanes/warps/blocks.

The output bits here are defined, by that flattening, as the union of the shmem address + the block address.

You could then think that this is not optimal, as you could be stopping a register from being converted into an addition because a part of its block is touched by other register, but this is not even the case, as there is no swizzling at a block level, so flattening here is optimal for the part of the optimisation that we care about, which is the offsets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so if IIUC it's the minimum of the nAdditive for the two and you expect the block offsets to never be the minimum of the two, so it should have no effect in normal cases. Fair enough.

FYI the quote "We allow to change this xor into an addition..." is IMO a much better explanation than the comments currently in the code. You should consider adding it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the comment but not replace it. The issue is that the comment at the top is a bit more general, as it also includes the necessary ideas for to support this in the ldmatrix/stmatrix lowerings.

Comment thread lib/Tools/GenericSwizzling.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated
@lezcano lezcano marked this pull request as draft January 28, 2026 19:18
@lezcano lezcano changed the base branch from lezcano/stack/8 to main January 28, 2026 19:18
@lezcano lezcano changed the base branch from main to lezcano/stack/8 January 28, 2026 19:18
@lezcano lezcano marked this pull request as ready for review January 28, 2026 19:18
@lezcano lezcano marked this pull request as draft January 29, 2026 09:46
@lezcano lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 09:46
@lezcano lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 09:46
@lezcano lezcano marked this pull request as ready for review January 29, 2026 09:46
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de660ef4bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Allocation.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 714562d9d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
@lezcano lezcano marked this pull request as draft January 29, 2026 22:52
@lezcano lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 22:52
@lezcano lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 22:52
@lezcano lezcano marked this pull request as ready for review January 29, 2026 22:52
@lezcano lezcano marked this pull request as draft January 29, 2026 23:40
@lezcano lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 23:40
@lezcano lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 23:40
@lezcano lezcano marked this pull request as ready for review January 29, 2026 23:41
@lezcano lezcano marked this pull request as draft January 29, 2026 23:43
@lezcano lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 23:43
@lezcano lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 23:43
@lezcano lezcano marked this pull request as ready for review January 29, 2026 23:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57a329daf4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Analysis/Membar.cpp
lezcano added a commit that referenced this pull request Feb 2, 2026
We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the `convert_layout`, not the
`convert_layout` within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

stack-info: PR: #9317, branch: lezcano/stack/9
lezcano added a commit that referenced this pull request Feb 3, 2026
We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the `convert_layout`, not the
`convert_layout` within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

stack-info: PR: #9317, branch: lezcano/stack/9
lezcano added a commit that referenced this pull request Feb 4, 2026
We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the `convert_layout`, not the
`convert_layout` within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

stack-info: PR: #9317, branch: lezcano/stack/9
Comment on lines +271 to +272
auto kReg = str_attr("register");
auto kLane = str_attr("lane");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why all the changes to auto?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for consistency. I changed them when I was addressing #9317 (comment)

@lezcano lezcano marked this pull request as draft February 5, 2026 17:26
@lezcano lezcano changed the base branch from lezcano/stack/8 to main February 5, 2026 17:26
@lezcano lezcano changed the base branch from main to lezcano/stack/8 February 5, 2026 17:27
@lezcano lezcano marked this pull request as ready for review February 5, 2026 17:27
@lezcano lezcano marked this pull request as draft February 5, 2026 21:02
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still unaddressed comments

We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the `convert_layout`, not the
`convert_layout` within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

stack-info: PR: #9317, branch: lezcano/stack/9
Comment thread lib/Analysis/Utility.cpp
auto comp = dstLayout.invertAndCompose(srcLayout);
return parentTrivial && comp.isTrivialOver(dim) &&
srcLayout.getFreeVariableMasks()[dim] == 0 &&
dstLayout.getFreeVariableMasks()[dim] == 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lezcano how come we need check the free-variable here?

I come across a problem arising from multi-casting of matmul, where cluster-level-sync is inserted, however, no data movement across block at all.

If source and dest's "block" bases are indentical, but layout has a basis {0, 0} in it, then this function will return false for both isCvtDimSync(block) and isCvtDimSync(warp)? However, no data move across block level at all. In what situation we can ignore the free varaibles?

To make the question concrete, following is the case I came across, should we expect the function return true or false.

Thanks a lot in advance!

TEST(Analysis, isCvtDimSync) {
  MLIRContext ctx;
  auto S = [&](StringRef str) { return StringAttr::get(&ctx, str); };

  auto srcLayout = triton::LinearLayout(
      {{S("register"), {}},
       {S("lane"), {{0, 1}, {1, 0}, {2, 0}, {4, 0}, {8, 0}}},
       {S("warp"), {{16, 0}, {32, 0}}},
       {S("block"), {{0, 0}, {64, 0}}}},
      {S("dim0"), S("dim1")});

  auto dstLayout = triton::LinearLayout(
      {{S("register"), {{0, 1}}},
       {S("lane"), {{1, 0}, {2, 0}, {4, 0}, {8, 0}, {32, 0}}},
       {S("warp"), {{0, 0}, {16, 0}}},
       {S("block"), {{0, 0}, {64, 0}}}},
      {S("dim0"), S("dim1")});

  EXPECT_TRUE(isCvtDimSync(srcLayout, dstLayout, S("block")));
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not exactly correct.
The issue here is a bit subtle and dependent on our implementation:

When we have src[warp] = dst[warp] we can choose to lower it via shfl.sync. The issue is that sometimes this is not optimal and it's better to just go through shared memory, even if we could do it via shlf.sync.
Then, say, that src[warp] == [[64], [0]]. We use isCvtDimSync to figure out whether we need to insert a warp sync or a CTA sync. So the fix here is to remove

         srcLayout.getFreeVariableMasks()[dim] == 0 &&
         dstLayout.getFreeVariableMasks()[dim] == 0;

from this function, and then use a different one that takes into account free variables to set the correct synchronisation in the convert_layout lowering. Can you send a PR with a lit test?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the insightful comment. I will make a PR this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants