[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize#23657
[GPU][Codegen] Support unique per-lane load option when prod(threads) < subgroupsize#23657
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Eric Feng <Eric.Feng@amd.com>
cd7b094 to
ea5cc11
Compare
|
Oh interesting! So this almost feels like we've added another dimension to the MMASingleSubgroupLayout broadcastFactor=2 with is almost like in distribution of data to threads. The only difference is that this broadcastFactor requires tstrides[-1] to be 1. Like you can only perform element splitting across contiguous lanes. Also the broadcastFactor way probably plays more nicely with existing infra. Cool stuff |
|
One other note - I think for RDNA3, the requirement is (IIRC, I'll need to check this) is that the broadcast happens to wave halves - so what we might want to have for the default assignment is that t0 and (say) t32, t1 and t33, and so on (in your subgroup 64 and 32 threads example) get the same values. And so the broadcast factor could be named something like I'm partly thinking of gfx1250's layout for MXFP4 here. |
Signed-off-by: Eric Feng <Eric.Feng@amd.com>
good point; i renamed it to |
krzysz00
left a comment
There was a problem hiding this comment.
Yeah, sure, this seems like a sensible representation for what you're trying to do, let's go for it
When
prod(thread) < subgroupSizeinMMASingleSubgroupLayout, there is implied broadcasting: multiple threads map to the same index in the thread layout and get the same data.This patch adds an optional physicalLanesPerThread parameter (default 1) to
populateCanonicalOffsetsSizesAndStrides, enabling callers to opt into element splitting so that broadcast lanes load disjoint slices rather than duplicate data. Existing callsites are unaffected since they use the default.physicalLanesPerThreadis deliberately a caller provided parameter rather than derived fromMMASingleSubgroupLayout, keeping the layout struct as a pure hardware description without encoding downstream splitting policy.For a concrete example, consider a layout with
subgroupSize = 64andouter = {1, 1}, thread = {8, 4}, tstrides = {2, 16}, element = {1, 16}Here
prod(thread) = 32, sophysicalLanesPerThread = 2. The thread assignment looks like:Without
physicalLanesPerThread, each pair (e.g., t0 and t1) loads the same 16 elements. WithphysicalLanesPerThread = 2, each loads 8 unique elements instead.The only existing hardware intrinsics which has this broadcast property naturally are WMMAR3 LHS and RHS operands
(thread = {16, 1} or {1, 16}, subgroupSize = 32), and for the F16/BF16 variants, the accumulator as well. This mechanism is intended for virtual intrinsics such as VDMFMA (#23677), where broadcast lanes will be assigned disjoint K-slices.Assisted by: Claude