[BACKEND] Support generic multi-cta convert_layouts#9317
Conversation
9a21e07 to
68f76bf
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68f76bf7fe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
68f76bf to
977774b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 977774ba5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| llvm::SetVector<uint32_t> tileBases; | ||
| for (auto bases : llvm::make_second_range(addrLayout.getBases())) { | ||
| auto addrNamedBases = addrLayout.flattenOuts().getBases(); | ||
| for (auto bases : llvm::make_second_range(addrNamedBases)) { |
There was a problem hiding this comment.
What does flattening the out dim do here? nAdditive could be different for the offeset and block, so does this return the smaller one somehow?
There was a problem hiding this comment.
This treats blocks as an extension of the offset, which makes sense here. We treat the block as a concatenation of the memory address.
There was a problem hiding this comment.
I don't see how that answers the question? The block isn't calculated as an extension of the offset, so why should nAdditive be the same for both?
There was a problem hiding this comment.
It allows you to do this:
auto idxAndBlock =
reps.apply({{kReg, i}, {kLane, 0}, {kWarp, 0}, {kBlock, 0}});
auto regIdxI8 = idxAndBlock[0].second * (bitwidth / 8);
Value offset = b.xor_(regBaseI8, b.i32_val(regIdxI8));
Value ctaOffset = b.i32_val(0);
if (useBlockId) {
ctaOffset = b.xor_(targetCtaId, b.i32_val(idxAndBlock[1].second));
}
offset = applyPadding(loc, rewriter, offset, paddingShifts);
for (int j = 0; j < nAdditive; j += elemsPerVec) {
// all these constants will go as immediate values to LDS/STS
auto idxAndBlockAdd =
reps.apply({{kReg, j}, {kLane, 0}, {kWarp, 0}, {kBlock, 0}});
auto regIdxAddI8 = idxAndBlockAdd[0].second * (bitwidth / 8);
// `actionAdditiveStrides` forces `regIdxAddI8` and `offset` to be bitwise
// disjoint, so we can calculate their padding contributions separately.
regIdxAddI8 = applyPadding(regIdxAddI8, paddingShifts);
Value innerOffset = b.add(offset, b.i32_val(regIdxAddI8));
std::optional<Value> innerCtaOffset;
if (useBlockId) {
innerCtaOffset = b.add(ctaOffset, b.i32_val(idxAndBlockAdd[1].second));
}just lift that into an add. I think that it may be an optimisation that is not super relevant for blocks, and we could very well write b.xor_ in the inner loop, but it certainly won't hurt?
There was a problem hiding this comment.
Yes, I understand that. I specifically asked: why do you flatten the out dims? As your code snippet demonstrates, we effectively have two different sublayouts that compute two different outputs, each of which may have a different nAdditive value. So what effect does flattening the outputs have?
There was a problem hiding this comment.
The addition algorithm says:
We allow to change this xor into an addition if the output bits of this given register are disjoint to those from lanes/warps/blocks.
The output bits here are defined, by that flattening, as the union of the shmem address + the block address.
You could then think that this is not optimal, as you could be stopping a register from being converted into an addition because a part of its block is touched by other register, but this is not even the case, as there is no swizzling at a block level, so flattening here is optimal for the part of the optimisation that we care about, which is the offsets.
There was a problem hiding this comment.
Okay, so if IIUC it's the minimum of the nAdditive for the two and you expect the block offsets to never be the minimum of the two, so it should have no effect in normal cases. Fair enough.
FYI the quote "We allow to change this xor into an addition..." is IMO a much better explanation than the comments currently in the code. You should consider adding it.
There was a problem hiding this comment.
I will add the comment but not replace it. The issue is that the comment at the top is a bit more general, as it also includes the necessary ideas for to support this in the ldmatrix/stmatrix lowerings.
977774b to
988c0b3
Compare
988c0b3 to
de660ef
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: de660ef4bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 714562d9d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
714562d to
1c818e8
Compare
1c818e8 to
cb2fc7a
Compare
cb2fc7a to
57a329d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 57a329daf4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. stack-info: PR: #9317, branch: lezcano/stack/9
We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. stack-info: PR: #9317, branch: lezcano/stack/9
We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. stack-info: PR: #9317, branch: lezcano/stack/9
| auto kReg = str_attr("register"); | ||
| auto kLane = str_attr("lane"); |
There was a problem hiding this comment.
nit: why all the changes to auto?
There was a problem hiding this comment.
just for consistency. I changed them when I was addressing #9317 (comment)
57a329d to
a8d2f7c
Compare
peterbell10
left a comment
There was a problem hiding this comment.
There are still unaddressed comments
We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. stack-info: PR: #9317, branch: lezcano/stack/9
| auto comp = dstLayout.invertAndCompose(srcLayout); | ||
| return parentTrivial && comp.isTrivialOver(dim) && | ||
| srcLayout.getFreeVariableMasks()[dim] == 0 && | ||
| dstLayout.getFreeVariableMasks()[dim] == 0; |
There was a problem hiding this comment.
@lezcano how come we need check the free-variable here?
I come across a problem arising from multi-casting of matmul, where cluster-level-sync is inserted, however, no data movement across block at all.
If source and dest's "block" bases are indentical, but layout has a basis {0, 0} in it, then this function will return false for both isCvtDimSync(block) and isCvtDimSync(warp)? However, no data move across block level at all. In what situation we can ignore the free varaibles?
To make the question concrete, following is the case I came across, should we expect the function return true or false.
Thanks a lot in advance!
TEST(Analysis, isCvtDimSync) {
MLIRContext ctx;
auto S = [&](StringRef str) { return StringAttr::get(&ctx, str); };
auto srcLayout = triton::LinearLayout(
{{S("register"), {}},
{S("lane"), {{0, 1}, {1, 0}, {2, 0}, {4, 0}, {8, 0}}},
{S("warp"), {{16, 0}, {32, 0}}},
{S("block"), {{0, 0}, {64, 0}}}},
{S("dim0"), S("dim1")});
auto dstLayout = triton::LinearLayout(
{{S("register"), {{0, 1}}},
{S("lane"), {{1, 0}, {2, 0}, {4, 0}, {8, 0}, {32, 0}}},
{S("warp"), {{0, 0}, {16, 0}}},
{S("block"), {{0, 0}, {64, 0}}}},
{S("dim0"), S("dim1")});
EXPECT_TRUE(isCvtDimSync(srcLayout, dstLayout, S("block")));
}There was a problem hiding this comment.
Yeah, this is not exactly correct.
The issue here is a bit subtle and dependent on our implementation:
When we have src[warp] = dst[warp] we can choose to lower it via shfl.sync. The issue is that sometimes this is not optimal and it's better to just go through shared memory, even if we could do it via shlf.sync.
Then, say, that src[warp] == [[64], [0]]. We use isCvtDimSync to figure out whether we need to insert a warp sync or a CTA sync. So the fix here is to remove
srcLayout.getFreeVariableMasks()[dim] == 0 &&
dstLayout.getFreeVariableMasks()[dim] == 0;
from this function, and then use a different one that takes into account free variables to set the correct synchronisation in the convert_layout lowering. Can you send a PR with a lit test?
There was a problem hiding this comment.
Thank you very much for the insightful comment. I will make a PR this week.
Stacked PRs:
[BACKEND] Support generic multi-cta convert_layouts
We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.
We remove the legacy lowering.
The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.
We also activate multi-cta reductions in the process and test both
there.
TODO: Add some funky tests that just test the
convert_layout, not theconvert_layoutwithin the reduction.TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.