[BACKEND] Support generic multi-cta convert_layouts by lezcano · Pull Request #9317 · triton-lang/triton

lezcano · 2026-01-28T10:45:43Z

Stacked PRs:

[BACKEND] Support generic multi-cta convert_layouts

We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the convert_layout, not the
convert_layout within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68f76bf7fe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 977774ba5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

peterbell10 · 2026-01-28T16:10:28Z

  llvm::SetVector<uint32_t> tileBases;
-  for (auto bases : llvm::make_second_range(addrLayout.getBases())) {
+  auto addrNamedBases = addrLayout.flattenOuts().getBases();
+  for (auto bases : llvm::make_second_range(addrNamedBases)) {


What does flattening the out dim do here? nAdditive could be different for the offeset and block, so does this return the smaller one somehow?

This treats blocks as an extension of the offset, which makes sense here. We treat the block as a concatenation of the memory address.

I don't see how that answers the question? The block isn't calculated as an extension of the offset, so why should nAdditive be the same for both?

It allows you to do this:

auto idxAndBlock = reps.apply({{kReg, i}, {kLane, 0}, {kWarp, 0}, {kBlock, 0}}); auto regIdxI8 = idxAndBlock[0].second * (bitwidth / 8); Value offset = b.xor_(regBaseI8, b.i32_val(regIdxI8)); Value ctaOffset = b.i32_val(0); if (useBlockId) { ctaOffset = b.xor_(targetCtaId, b.i32_val(idxAndBlock[1].second)); } offset = applyPadding(loc, rewriter, offset, paddingShifts); for (int j = 0; j < nAdditive; j += elemsPerVec) { // all these constants will go as immediate values to LDS/STS auto idxAndBlockAdd = reps.apply({{kReg, j}, {kLane, 0}, {kWarp, 0}, {kBlock, 0}}); auto regIdxAddI8 = idxAndBlockAdd[0].second * (bitwidth / 8); // `actionAdditiveStrides` forces `regIdxAddI8` and `offset` to be bitwise // disjoint, so we can calculate their padding contributions separately. regIdxAddI8 = applyPadding(regIdxAddI8, paddingShifts); Value innerOffset = b.add(offset, b.i32_val(regIdxAddI8)); std::optional<Value> innerCtaOffset; if (useBlockId) { innerCtaOffset = b.add(ctaOffset, b.i32_val(idxAndBlockAdd[1].second)); }

just lift that into an add. I think that it may be an optimisation that is not super relevant for blocks, and we could very well write b.xor_ in the inner loop, but it certainly won't hurt?

Yes, I understand that. I specifically asked: why do you flatten the out dims? As your code snippet demonstrates, we effectively have two different sublayouts that compute two different outputs, each of which may have a different nAdditive value. So what effect does flattening the outputs have?

The addition algorithm says:
We allow to change this xor into an addition if the output bits of this given register are disjoint to those from lanes/warps/blocks.

The output bits here are defined, by that flattening, as the union of the shmem address + the block address.

You could then think that this is not optimal, as you could be stopping a register from being converted into an addition because a part of its block is touched by other register, but this is not even the case, as there is no swizzling at a block level, so flattening here is optimal for the part of the optimisation that we care about, which is the offsets.

Okay, so if IIUC it's the minimum of the nAdditive for the two and you expect the block offsets to never be the minimum of the two, so it should have no effect in normal cases. Fair enough.

FYI the quote "We allow to change this xor into an addition..." is IMO a much better explanation than the comments currently in the code. You should consider adding it.

I will add the comment but not replace it. The issue is that the comment at the top is a bit more general, as it also includes the necessary ideas for to support this in the ldmatrix/stmatrix lowerings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de660ef4bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 714562d9d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57a329daf4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. stack-info: PR: #9317, branch: lezcano/stack/9

ThomasRaoux · 2026-02-05T16:28:22Z

+    auto kReg = str_attr("register");
+    auto kLane = str_attr("lane");


nit: why all the changes to auto?

just for consistency. I changed them when I was addressing #9317 (comment)

peterbell10

There are still unaddressed comments

We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. stack-info: PR: #9317, branch: lezcano/stack/9

yangshuxin · 2026-03-10T19:33:46Z

+  auto comp = dstLayout.invertAndCompose(srcLayout);
+  return parentTrivial && comp.isTrivialOver(dim) &&
+         srcLayout.getFreeVariableMasks()[dim] == 0 &&
+         dstLayout.getFreeVariableMasks()[dim] == 0;


@lezcano how come we need check the free-variable here?

I come across a problem arising from multi-casting of matmul, where cluster-level-sync is inserted, however, no data movement across block at all.

If source and dest's "block" bases are indentical, but layout has a basis {0, 0} in it, then this function will return false for both isCvtDimSync(block) and isCvtDimSync(warp)? However, no data move across block level at all. In what situation we can ignore the free varaibles?

To make the question concrete, following is the case I came across, should we expect the function return true or false.

Thanks a lot in advance!

TEST(Analysis, isCvtDimSync) { MLIRContext ctx; auto S = [&](StringRef str) { return StringAttr::get(&ctx, str); }; auto srcLayout = triton::LinearLayout( {{S("register"), {}}, {S("lane"), {{0, 1}, {1, 0}, {2, 0}, {4, 0}, {8, 0}}}, {S("warp"), {{16, 0}, {32, 0}}}, {S("block"), {{0, 0}, {64, 0}}}}, {S("dim0"), S("dim1")}); auto dstLayout = triton::LinearLayout( {{S("register"), {{0, 1}}}, {S("lane"), {{1, 0}, {2, 0}, {4, 0}, {8, 0}, {32, 0}}}, {S("warp"), {{0, 0}, {16, 0}}}, {S("block"), {{0, 0}, {64, 0}}}}, {S("dim0"), S("dim1")}); EXPECT_TRUE(isCvtDimSync(srcLayout, dstLayout, S("block"))); }

Yeah, this is not exactly correct.
The issue here is a bit subtle and dependent on our implementation:

When we have src[warp] = dst[warp] we can choose to lower it via shfl.sync. The issue is that sometimes this is not optimal and it's better to just go through shared memory, even if we could do it via shlf.sync.
Then, say, that src[warp] == [[64], [0]]. We use isCvtDimSync to figure out whether we need to insert a warp sync or a CTA sync. So the fix here is to remove

srcLayout.getFreeVariableMasks()[dim] == 0 && dstLayout.getFreeVariableMasks()[dim] == 0;

from this function, and then use a different one that takes into account free variables to set the correct synchronisation in the convert_layout lowering. Can you send a PR with a lit test?

Thank you very much for the insightful comment. I will make a PR this week.

lezcano requested review from Jokeren, antiagainst, peterbell10, ptillet and zhanglx13 as code owners January 28, 2026 10:45

lezcano force-pushed the lezcano/stack/9 branch from 9a21e07 to 68f76bf Compare January 28, 2026 10:45

This was referenced Jan 28, 2026

[BACKEND] Improve and simplify ReduceOp's lowering #9219

Merged

[BACKEND] Perform tree reductions on in-thread values #9220

Merged

[BACKEND] Implement support for cross-CTA tt.reduce #9221

Merged

[Membar] Membar pass for clusters #9318

Merged

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread lib/Conversion/TritonGPUToLLVM/Utility.cpp Outdated

lezcano marked this pull request as draft January 28, 2026 12:15

lezcano changed the base branch from lezcano/stack/8 to main January 28, 2026 12:15

lezcano force-pushed the lezcano/stack/9 branch from 68f76bf to 977774b Compare January 28, 2026 12:15

lezcano changed the base branch from main to lezcano/stack/8 January 28, 2026 12:15

lezcano marked this pull request as ready for review January 28, 2026 12:15

lezcano requested a review from ThomasRaoux January 28, 2026 12:17

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp Outdated

peterbell10 reviewed Jan 28, 2026

View reviewed changes

lezcano marked this pull request as draft January 28, 2026 19:18

lezcano changed the base branch from lezcano/stack/8 to main January 28, 2026 19:18

lezcano force-pushed the lezcano/stack/9 branch from 977774b to 988c0b3 Compare January 28, 2026 19:18

lezcano changed the base branch from main to lezcano/stack/8 January 28, 2026 19:18

lezcano marked this pull request as ready for review January 28, 2026 19:18

lezcano marked this pull request as draft January 29, 2026 09:46

lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 09:46

lezcano force-pushed the lezcano/stack/9 branch from 988c0b3 to de660ef Compare January 29, 2026 09:46

lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 09:46

lezcano marked this pull request as ready for review January 29, 2026 09:46

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Allocation.cpp

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

lezcano marked this pull request as draft January 29, 2026 22:52

lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 22:52

lezcano force-pushed the lezcano/stack/9 branch from 714562d to 1c818e8 Compare January 29, 2026 22:52

lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 22:52

lezcano marked this pull request as ready for review January 29, 2026 22:52

lezcano marked this pull request as draft January 29, 2026 23:40

lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 23:40

lezcano force-pushed the lezcano/stack/9 branch from 1c818e8 to cb2fc7a Compare January 29, 2026 23:40

lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 23:40

lezcano marked this pull request as ready for review January 29, 2026 23:41

lezcano marked this pull request as draft January 29, 2026 23:43

lezcano changed the base branch from lezcano/stack/8 to main January 29, 2026 23:43

lezcano force-pushed the lezcano/stack/9 branch from cb2fc7a to 57a329d Compare January 29, 2026 23:43

lezcano changed the base branch from main to lezcano/stack/8 January 29, 2026 23:43

lezcano marked this pull request as ready for review January 29, 2026 23:44

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread lib/Analysis/Membar.cpp

ThomasRaoux approved these changes Feb 5, 2026

View reviewed changes

lezcano marked this pull request as draft February 5, 2026 17:26

lezcano changed the base branch from lezcano/stack/8 to main February 5, 2026 17:26

lezcano force-pushed the lezcano/stack/9 branch from 57a329d to a8d2f7c Compare February 5, 2026 17:27

lezcano changed the base branch from main to lezcano/stack/8 February 5, 2026 17:27

lezcano marked this pull request as ready for review February 5, 2026 17:27

lezcano marked this pull request as draft February 5, 2026 21:02

peterbell10 previously requested changes Feb 6, 2026

View reviewed changes

yangshuxin reviewed Mar 10, 2026

View reviewed changes

		auto kReg = str_attr("register");
		auto kLane = str_attr("lane");

Conversation

lezcano commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!