[Membar] Membar pass for clusters by lezcano · Pull Request #9318 · triton-lang/triton

lezcano · 2026-01-28T10:45:46Z

Stacked PRs:

[Membar] Membar pass for clusters

The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting mbarriers that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some mbarriers but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a39baac47d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6387320248

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ThomasRaoux

Looks good! Few nits

ThomasRaoux · 2026-01-28T14:33:40Z

@@ -0,0 +1,567 @@
+// RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar | FileCheck --dump-input=fail --dump-input-context=30 %s


super nit: having the IR after allocate-shared-memory would make it a bit more clear which buffers overlap

The examples are self-explanatory and most of them are commented as to what they want to achieve, so I'd say we leave it like this if that's alright?

right, my point is that if allocation changes then we would be testing the wrong thing.

Will do in a follow-up

The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8d7f7fd48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11

lezcano · 2026-01-29T20:57:53Z

@ThomasRaoux ready for another round of reviews

Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34c1ab65b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-29T23:09:19Z

+  auto bufferId = lhsSlice.getBufferId();
+  return bufferId != Allocation::InvalidBufferId &&
+         bufferId == rhsSlice.getBufferId() &&
+         allocation->isExplicitBuffer(bufferId);


Guard slice filter against foreign buffer IDs

The slice filter assumes every bufferId belongs to the current function’s Allocation, but blockInfo->join(curBlockInfo) merges slices from callees too. When a call is followed by another shared‑memory op, those callee slices can become the LHS in isIntersected, and allocation->isExplicitBuffer(bufferId) will hit bufferSet.at(bufferId) for an ID that doesn’t exist in the caller’s allocation (e.g., callee has more buffers than caller), leading to an out‑of‑range abort during barrier insertion. This shows up specifically with call ops plus later aliasing; consider checking membership before isExplicitBuffer or tagging slices with their allocation and skipping the filter for foreign IDs.

Useful? React with 👍 / 👎.

Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e91f49d71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-29T23:56:27Z

  TestMembar.cpp
 )
-target_link_libraries(TritonTestAnalysis PUBLIC MLIRPass TritonAnalysis)
+target_link_libraries(TritonTestAnalysis PUBLIC MLIRPass TritonAnalysis TritonNvidiaGPUTransforms)


Link NVIDIA allocation helper into test pass target

TestMembar.cpp now calls triton::nvidia_gpu::getNvidiaAllocationAnalysisScratchSizeFn, which is defined in the TritonNVIDIAGPUToLLVM library, but TritonTestAnalysis only links TritonAnalysis and TritonNvidiaGPUTransforms. In static builds (or when --no-undefined is enforced), this introduces an unresolved symbol at link time or a missing symbol at runtime. Please add TritonNVIDIAGPUToLLVM (or another library that exports the helper) to the TritonTestAnalysis link list.

Useful? React with 👍 / 👎.

ThomasRaoux

LGTM

ThomasRaoux · 2026-02-02T18:58:15Z

@@ -0,0 +1,567 @@
+// RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar | FileCheck --dump-input=fail --dump-input-context=30 %s


right, my point is that if allocation changes then we would be testing the wrong thing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a36d342636

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10

lezcano requested review from Jokeren, antiagainst, ptillet and zhanglx13 as code owners January 28, 2026 10:45

lezcano force-pushed the lezcano/stack/9 branch from 9a21e07 to 68f76bf Compare January 28, 2026 10:45

lezcano force-pushed the lezcano/stack/10 branch from c56e6a2 to a39baac Compare January 28, 2026 10:45

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp Outdated

lezcano marked this pull request as draft January 28, 2026 12:15

lezcano changed the base branch from lezcano/stack/9 to main January 28, 2026 12:15

lezcano force-pushed the lezcano/stack/10 branch from a39baac to 6387320 Compare January 28, 2026 12:15

lezcano changed the base branch from main to lezcano/stack/9 January 28, 2026 12:15

lezcano marked this pull request as ready for review January 28, 2026 12:15

lezcano requested a review from ThomasRaoux January 28, 2026 12:17

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp Outdated

ThomasRaoux reviewed Jan 28, 2026

View reviewed changes

lezcano marked this pull request as draft January 28, 2026 19:18

lezcano changed the base branch from lezcano/stack/9 to main January 28, 2026 19:18

lezcano force-pushed the lezcano/stack/10 branch from 6387320 to e8d7f7f Compare January 28, 2026 19:18

lezcano changed the base branch from main to lezcano/stack/9 January 28, 2026 19:18

lezcano marked this pull request as ready for review January 28, 2026 19:18

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp

lezcano marked this pull request as draft January 29, 2026 09:46

lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 09:46

lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 18:56

lezcano force-pushed the lezcano/stack/10 branch from c21fb8e to d9f9b90 Compare January 29, 2026 18:56

lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 18:56

lezcano marked this pull request as ready for review January 29, 2026 18:56

lezcano marked this pull request as draft January 29, 2026 20:53

lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 20:53

lezcano force-pushed the lezcano/stack/10 branch from d9f9b90 to b16bd70 Compare January 29, 2026 20:54

lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 20:54

lezcano marked this pull request as ready for review January 29, 2026 20:54

lezcano marked this pull request as draft January 29, 2026 22:52

lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 22:52

lezcano force-pushed the lezcano/stack/10 branch from b16bd70 to 34c1ab6 Compare January 29, 2026 22:52

lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 22:52

lezcano marked this pull request as ready for review January 29, 2026 22:52

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

lezcano marked this pull request as draft January 29, 2026 23:40

lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 23:40

lezcano force-pushed the lezcano/stack/10 branch from 34c1ab6 to 1c239d1 Compare January 29, 2026 23:40

lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 23:40

lezcano marked this pull request as ready for review January 29, 2026 23:41

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

ThomasRaoux approved these changes Feb 2, 2026

View reviewed changes

lezcano mentioned this pull request Feb 3, 2026

[WIP][BACKEND] Generalize the MemBar to consider cross-CTA ops #8834

Closed

chatgpt-codex-connector Bot reviewed Feb 5, 2026

View reviewed changes

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp

		@@ -0,0 +1,567 @@
		// RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar \| FileCheck --dump-input=fail --dump-input-context=30 %s

Conversation

lezcano commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!