Conversation
The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10
9a21e07 to
68f76bf
Compare
c56e6a2 to
a39baac
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a39baac47d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10
a39baac to
6387320
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6387320248
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ThomasRaoux
left a comment
There was a problem hiding this comment.
Looks good! Few nits
| @@ -0,0 +1,567 @@ | |||
| // RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar | FileCheck --dump-input=fail --dump-input-context=30 %s | |||
There was a problem hiding this comment.
super nit: having the IR after allocate-shared-memory would make it a bit more clear which buffers overlap
There was a problem hiding this comment.
The examples are self-explanatory and most of them are commented as to what they want to achieve, so I'd say we leave it like this if that's alright?
There was a problem hiding this comment.
right, my point is that if allocation changes then we would be testing the wrong thing.
There was a problem hiding this comment.
Will do in a follow-up
The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10
6387320 to
e8d7f7f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8d7f7fd48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11
c21fb8e to
d9f9b90
Compare
Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11
d9f9b90 to
b16bd70
Compare
|
@ThomasRaoux ready for another round of reviews |
Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11
b16bd70 to
34c1ab6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34c1ab65b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto bufferId = lhsSlice.getBufferId(); | ||
| return bufferId != Allocation::InvalidBufferId && | ||
| bufferId == rhsSlice.getBufferId() && | ||
| allocation->isExplicitBuffer(bufferId); |
There was a problem hiding this comment.
Guard slice filter against foreign buffer IDs
The slice filter assumes every bufferId belongs to the current function’s Allocation, but blockInfo->join(curBlockInfo) merges slices from callees too. When a call is followed by another shared‑memory op, those callee slices can become the LHS in isIntersected, and allocation->isExplicitBuffer(bufferId) will hit bufferSet.at(bufferId) for an ID that doesn’t exist in the caller’s allocation (e.g., callee has more buffers than caller), leading to an out‑of‑range abort during barrier insertion. This shows up specifically with call ops plus later aliasing; consider checking membership before isExplicitBuffer or tagging slices with their allocation and skipping the filter for foreign IDs.
Useful? React with 👍 / 👎.
Codex rightly identified that we were not considering the offsets of functions in our membar analysis at #9318 (comment) Codex then went on and fixed it and added a regression test. stack-info: PR: #9327, branch: lezcano/stack/11
34c1ab6 to
1c239d1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e91f49d71
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| TestMembar.cpp | ||
| ) | ||
| target_link_libraries(TritonTestAnalysis PUBLIC MLIRPass TritonAnalysis) | ||
| target_link_libraries(TritonTestAnalysis PUBLIC MLIRPass TritonAnalysis TritonNvidiaGPUTransforms) |
There was a problem hiding this comment.
Link NVIDIA allocation helper into test pass target
TestMembar.cpp now calls triton::nvidia_gpu::getNvidiaAllocationAnalysisScratchSizeFn, which is defined in the TritonNVIDIAGPUToLLVM library, but TritonTestAnalysis only links TritonAnalysis and TritonNvidiaGPUTransforms. In static builds (or when --no-undefined is enforced), this introduces an unresolved symbol at link time or a missing symbol at runtime. Please add TritonNVIDIAGPUToLLVM (or another library that exports the helper) to the TritonTestAnalysis link list.
Useful? React with 👍 / 👎.
| @@ -0,0 +1,567 @@ | |||
| // RUN: triton-opt %s -split-input-file --triton-nvidia-tma-lowering --allocate-shared-memory -test-print-membar | FileCheck --dump-input=fail --dump-input-context=30 %s | |||
There was a problem hiding this comment.
right, my point is that if allocation changes then we would be testing the wrong thing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a36d342636
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. stack-info: PR: #9318, branch: lezcano/stack/10
Stacked PRs:
[Membar] Membar pass for clusters
The main invariant here is that:
Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass
In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.
For now, we always emit a full cluster barrier. We can improve this in
the future by emitting
mbarriers that just synchronise subsets of theCTAs. For that we would need to be a bit more clever, as we would need
to allocate some
mbarriers but the Allocation pass has already run...We add a number of test cases with comments of which of them are
expected and which can be improved.