Skip to content

[Membar] Membar pass for clusters#9318

Merged
lezcano merged 1 commit intomainfrom
lezcano/stack/10
Feb 9, 2026
Merged

[Membar] Membar pass for clusters#9318
lezcano merged 1 commit intomainfrom
lezcano/stack/10

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented Jan 28, 2026

Stacked PRs:


[Membar] Membar pass for clusters

The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting mbarriers that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some mbarriers but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

lezcano added a commit that referenced this pull request Jan 28, 2026
The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting `mbarrier`s that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some `mbarrier`s  but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

stack-info: PR: #9318, branch: lezcano/stack/10
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a39baac47d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp Outdated
@lezcano lezcano marked this pull request as draft January 28, 2026 12:15
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 28, 2026 12:15
lezcano added a commit that referenced this pull request Jan 28, 2026
The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting `mbarrier`s that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some `mbarrier`s  but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

stack-info: PR: #9318, branch: lezcano/stack/10
@lezcano lezcano changed the base branch from main to lezcano/stack/9 January 28, 2026 12:15
@lezcano lezcano marked this pull request as ready for review January 28, 2026 12:15
@lezcano lezcano requested a review from ThomasRaoux January 28, 2026 12:17
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6387320248

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp Outdated
Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Few nits

@@ -0,0 +1,567 @@
// RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar | FileCheck --dump-input=fail --dump-input-context=30 %s
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: having the IR after allocate-shared-memory would make it a bit more clear which buffers overlap

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples are self-explanatory and most of them are commented as to what they want to achieve, so I'd say we leave it like this if that's alright?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, my point is that if allocation changes then we would be testing the wrong thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in a follow-up

Comment thread include/triton/Dialect/TritonNvidiaGPU/Transforms/Passes.td Outdated
Comment thread include/triton/Analysis/Allocation.h Outdated
@lezcano lezcano marked this pull request as draft January 28, 2026 19:18
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 28, 2026 19:18
lezcano added a commit that referenced this pull request Jan 28, 2026
The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting `mbarrier`s that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some `mbarrier`s  but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

stack-info: PR: #9318, branch: lezcano/stack/10
@lezcano lezcano changed the base branch from main to lezcano/stack/9 January 28, 2026 19:18
@lezcano lezcano marked this pull request as ready for review January 28, 2026 19:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8d7f7fd48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp
@lezcano lezcano marked this pull request as draft January 29, 2026 09:46
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 09:46
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 18:56
lezcano added a commit that referenced this pull request Jan 29, 2026
Codex rightly identified that we were not considering the offsets of
functions in our membar analysis at #9318 (comment)

Codex then went on and fixed it and added a regression test.

stack-info: PR: #9327, branch: lezcano/stack/11
@lezcano lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 18:56
@lezcano lezcano marked this pull request as ready for review January 29, 2026 18:56
@lezcano lezcano marked this pull request as draft January 29, 2026 20:53
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 20:53
lezcano added a commit that referenced this pull request Jan 29, 2026
Codex rightly identified that we were not considering the offsets of
functions in our membar analysis at #9318 (comment)

Codex then went on and fixed it and added a regression test.

stack-info: PR: #9327, branch: lezcano/stack/11
@lezcano lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 20:54
@lezcano lezcano marked this pull request as ready for review January 29, 2026 20:54
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Jan 29, 2026

@ThomasRaoux ready for another round of reviews

@lezcano lezcano marked this pull request as draft January 29, 2026 22:52
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 22:52
lezcano added a commit that referenced this pull request Jan 29, 2026
Codex rightly identified that we were not considering the offsets of
functions in our membar analysis at #9318 (comment)

Codex then went on and fixed it and added a regression test.

stack-info: PR: #9327, branch: lezcano/stack/11
@lezcano lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 22:52
@lezcano lezcano marked this pull request as ready for review January 29, 2026 22:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34c1ab65b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +56 to +59
auto bufferId = lhsSlice.getBufferId();
return bufferId != Allocation::InvalidBufferId &&
bufferId == rhsSlice.getBufferId() &&
allocation->isExplicitBuffer(bufferId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard slice filter against foreign buffer IDs

The slice filter assumes every bufferId belongs to the current function’s Allocation, but blockInfo->join(curBlockInfo) merges slices from callees too. When a call is followed by another shared‑memory op, those callee slices can become the LHS in isIntersected, and allocation->isExplicitBuffer(bufferId) will hit bufferSet.at(bufferId) for an ID that doesn’t exist in the caller’s allocation (e.g., callee has more buffers than caller), leading to an out‑of‑range abort during barrier insertion. This shows up specifically with call ops plus later aliasing; consider checking membership before isExplicitBuffer or tagging slices with their allocation and skipping the filter for foreign IDs.

Useful? React with 👍 / 👎.

@lezcano lezcano marked this pull request as draft January 29, 2026 23:40
@lezcano lezcano changed the base branch from lezcano/stack/9 to main January 29, 2026 23:40
lezcano added a commit that referenced this pull request Jan 29, 2026
Codex rightly identified that we were not considering the offsets of
functions in our membar analysis at #9318 (comment)

Codex then went on and fixed it and added a regression test.

stack-info: PR: #9327, branch: lezcano/stack/11
@lezcano lezcano changed the base branch from main to lezcano/stack/9 January 29, 2026 23:40
@lezcano lezcano marked this pull request as ready for review January 29, 2026 23:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e91f49d71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

TestMembar.cpp
)
target_link_libraries(TritonTestAnalysis PUBLIC MLIRPass TritonAnalysis)
target_link_libraries(TritonTestAnalysis PUBLIC MLIRPass TritonAnalysis TritonNvidiaGPUTransforms)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Link NVIDIA allocation helper into test pass target

TestMembar.cpp now calls triton::nvidia_gpu::getNvidiaAllocationAnalysisScratchSizeFn, which is defined in the TritonNVIDIAGPUToLLVM library, but TritonTestAnalysis only links TritonAnalysis and TritonNvidiaGPUTransforms. In static builds (or when --no-undefined is enforced), this introduces an unresolved symbol at link time or a missing symbol at runtime. Please add TritonNVIDIAGPUToLLVM (or another library that exports the helper) to the TritonTestAnalysis link list.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread include/triton/Analysis/Allocation.h Outdated
@@ -0,0 +1,567 @@
// RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar | FileCheck --dump-input=fail --dump-input-context=30 %s
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, my point is that if allocation changes then we would be testing the wrong thing.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a36d342636

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Dialect/TritonNvidiaGPU/Transforms/ClusterBarrierInsertion.cpp
The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting `mbarrier`s that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some `mbarrier`s  but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

stack-info: PR: #9318, branch: lezcano/stack/10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants