[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize by efric · Pull Request #23657 · iree-org/iree

efric · 2026-03-05T01:50:25Z

When prod(thread) < subgroupSize in MMASingleSubgroupLayout, there is implied broadcasting: multiple threads map to the same index in the thread layout and get the same data.

This patch adds an optional physicalLanesPerThread parameter (default 1) to populateCanonicalOffsetsSizesAndStrides, enabling callers to opt into element splitting so that broadcast lanes load disjoint slices rather than duplicate data. Existing callsites are unaffected since they use the default. physicalLanesPerThread is deliberately a caller provided parameter rather than derived from MMASingleSubgroupLayout, keeping the layout struct as a pure hardware description without encoding downstream splitting policy.

For a concrete example, consider a layout with subgroupSize = 64 and outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}

Here prod(thread) = 32, so physicalLanesPerThread = 2. The thread assignment looks like:

t0t1    t16t17    t32t33    t48t49
t2t3    t18t19    t34t35    t50t51
t4t5    t20t21    t36t37    t52t53
t6t7    t22t23    t38t39    t54t55
t8t9    t24t25    t40t41    t56t57
t10t11  t26t27    t42t43    t58t59
t12t13  t28t29    t44t45    t60t61
t14t15  t30t31    t46t47    t62t63

Without physicalLanesPerThread, each pair (e.g., t0 and t1) loads the same 16 elements. With physicalLanesPerThread = 2, each loads 8 unique elements instead.

The only existing hardware intrinsics which has this broadcast property naturally are WMMAR3 LHS and RHS operands (thread = {16, 1} or {1, 16}, subgroupSize = 32), and for the F16/BF16 variants, the accumulator as well. This mechanism is intended for virtual intrinsics such as VDMFMA (#23677), where broadcast lanes will be assigned disjoint K-slices.

Assisted by: Claude

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Eric Feng <Eric.Feng@amd.com>

Muzammiluddin-Syed-ECE · 2026-03-24T14:42:04Z

Oh interesting! So this almost feels like we've added another dimension to the MMASingleSubgroupLayout

broadcastFactor=2 with

outer = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}

is almost like

outer = {1, 1,1}, thread = {8, 4, 2}, tstrides = {1, 16, 1}, element = {1, 8, 1}

in distribution of data to threads.

The only difference is that this broadcastFactor requires tstrides[-1] to be 1. Like you can only perform element splitting across contiguous lanes. Also the broadcastFactor way probably plays more nicely with existing infra. Cool stuff

krzysz00 · 2026-03-24T16:25:43Z

One other note - I think for RDNA3, the requirement is (IIRC, I'll need to check this) is that the broadcast happens to wave halves - so what we might want to have for the default assignment is that t0 and (say) t32, t1 and t33, and so on (in your subgroup 64 and 32 threads example) get the same values.

And so the broadcast factor could be named something like threadsPerLogicalThread, saying that (in the VDMFMA case) two threads load the values that are being loaded onto one "thread". Maybe these aren't quite the right semantics, but it's a thought.

I'm partly thinking of gfx1250's layout for MXFP4 here.

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

efric · 2026-03-26T18:40:12Z

One other note - I think for RDNA3, the requirement is (IIRC, I'll need to check this) is that the broadcast happens to wave halves - so what we might want to have for the default assignment is that t0 and (say) t32, t1 and t33, and so on (in your subgroup 64 and 32 threads example) get the same values.

And so the broadcast factor could be named something like threadsPerLogicalThread, saying that (in the VDMFMA case) two threads load the values that are being loaded onto one "thread". Maybe these aren't quite the right semantics, but it's a thought.

I'm partly thinking of gfx1250's layout for MXFP4 here.

good point; i renamed it to physicalLanesPerThread to avoid using threads twice

krzysz00

Yeah, sure, this seems like a sensible representation for what you're trying to do, let's go for it

split element lane loads

ea5cc11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Eric Feng <Eric.Feng@amd.com>

efric force-pushed the users/efric/splitlaneloads branch 2 times, most recently from cd7b094 to ea5cc11 Compare March 5, 2026 22:09

efric mentioned this pull request Mar 6, 2026

[GPU][Codegen] Implement virtual dense mfma (VDMFMA) #23677

Merged

efric changed the title ~~[do not review] split lane loads prototype~~ [GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize Mar 8, 2026

efric marked this pull request as ready for review March 9, 2026 05:58

efric requested review from Groverkss, Max191, krzysz00, nirvedhmeshram and qedawkins as code owners March 9, 2026 05:58

efric requested a review from kuhar March 9, 2026 05:59

rename broadcastFactor

245f0ff

Signed-off-by: Eric Feng <Eric.Feng@amd.com>

krzysz00 approved these changes Mar 26, 2026

View reviewed changes

efric merged commit 9f0b79d into main Mar 26, 2026
63 of 65 checks passed

efric deleted the users/efric/splitlaneloads branch March 26, 2026 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize#23657

[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize#23657
efric merged 2 commits intomainfrom
users/efric/splitlaneloads

efric commented Mar 5, 2026 •

edited

Loading

Uh oh!

Muzammiluddin-Syed-ECE commented Mar 24, 2026

Uh oh!

krzysz00 commented Mar 24, 2026

Uh oh!

efric commented Mar 26, 2026

Uh oh!

krzysz00 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

efric commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muzammiluddin-Syed-ECE commented Mar 24, 2026

Uh oh!

krzysz00 commented Mar 24, 2026

Uh oh!

efric commented Mar 26, 2026

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

efric commented Mar 5, 2026 •

edited

Loading