Skip to content

[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize#23657

Merged
efric merged 2 commits intomainfrom
users/efric/splitlaneloads
Mar 26, 2026
Merged

[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize#23657
efric merged 2 commits intomainfrom
users/efric/splitlaneloads

Conversation

@efric
Copy link
Copy Markdown
Member

@efric efric commented Mar 5, 2026

When prod(thread) < subgroupSize in MMASingleSubgroupLayout, there is implied broadcasting: multiple threads map to the same index in the thread layout and get the same data.

This patch adds an optional physicalLanesPerThread parameter (default 1) to populateCanonicalOffsetsSizesAndStrides, enabling callers to opt into element splitting so that broadcast lanes load disjoint slices rather than duplicate data. Existing callsites are unaffected since they use the default. physicalLanesPerThread is deliberately a caller provided parameter rather than derived from MMASingleSubgroupLayout, keeping the layout struct as a pure hardware description without encoding downstream splitting policy.

For a concrete example, consider a layout with subgroupSize = 64 and outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}

Here prod(thread) = 32, so physicalLanesPerThread = 2. The thread assignment looks like:

t0t1    t16t17    t32t33    t48t49
t2t3    t18t19    t34t35    t50t51
t4t5    t20t21    t36t37    t52t53
t6t7    t22t23    t38t39    t54t55
t8t9    t24t25    t40t41    t56t57
t10t11  t26t27    t42t43    t58t59
t12t13  t28t29    t44t45    t60t61
t14t15  t30t31    t46t47    t62t63

Without physicalLanesPerThread, each pair (e.g., t0 and t1) loads the same 16 elements. With physicalLanesPerThread = 2, each loads 8 unique elements instead.

The only existing hardware intrinsics which has this broadcast property naturally are WMMAR3 LHS and RHS operands (thread = {16, 1} or {1, 16}, subgroupSize = 32), and for the F16/BF16 variants, the accumulator as well. This mechanism is intended for virtual intrinsics such as VDMFMA (#23677), where broadcast lanes will be assigned disjoint K-slices.

Assisted by: Claude

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
@efric efric force-pushed the users/efric/splitlaneloads branch 2 times, most recently from cd7b094 to ea5cc11 Compare March 5, 2026 22:09
@efric efric changed the title [do not review] split lane loads prototype [GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize Mar 8, 2026
@efric efric marked this pull request as ready for review March 9, 2026 05:58
@efric efric requested a review from kuhar March 9, 2026 05:59
@Muzammiluddin-Syed-ECE
Copy link
Copy Markdown
Contributor

Oh interesting! So this almost feels like we've added another dimension to the MMASingleSubgroupLayout

broadcastFactor=2 with

outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}

is almost like

outer = {1, 1,1}, thread = {8, 4, 2}, tstrides = {1, 16, 1}, element = {1, 8, 1}

in distribution of data to threads.

The only difference is that this broadcastFactor requires tstrides[-1] to be 1. Like you can only perform element splitting across contiguous lanes. Also the broadcastFactor way probably plays more nicely with existing infra. Cool stuff

@krzysz00
Copy link
Copy Markdown
Contributor

One other note - I think for RDNA3, the requirement is (IIRC, I'll need to check this) is that the broadcast happens to wave halves - so what we might want to have for the default assignment is that t0 and (say) t32, t1 and t33, and so on (in your subgroup 64 and 32 threads example) get the same values.

And so the broadcast factor could be named something like threadsPerLogicalThread, saying that (in the VDMFMA case) two threads load the values that are being loaded onto one "thread". Maybe these aren't quite the right semantics, but it's a thought.

I'm partly thinking of gfx1250's layout for MXFP4 here.

Signed-off-by: Eric Feng <Eric.Feng@amd.com>
@efric
Copy link
Copy Markdown
Member Author

efric commented Mar 26, 2026

One other note - I think for RDNA3, the requirement is (IIRC, I'll need to check this) is that the broadcast happens to wave halves - so what we might want to have for the default assignment is that t0 and (say) t32, t1 and t33, and so on (in your subgroup 64 and 32 threads example) get the same values.

And so the broadcast factor could be named something like threadsPerLogicalThread, saying that (in the VDMFMA case) two threads load the values that are being loaded onto one "thread". Maybe these aren't quite the right semantics, but it's a thought.

I'm partly thinking of gfx1250's layout for MXFP4 here.

good point; i renamed it to physicalLanesPerThread to avoid using threads twice

Copy link
Copy Markdown
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure, this seems like a sensible representation for what you're trying to do, let's go for it

@efric efric merged commit 9f0b79d into main Mar 26, 2026
63 of 65 checks passed
@efric efric deleted the users/efric/splitlaneloads branch March 26, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants