[FPSAN] Load TCGen shared operands directly by jeffniu-openai · Pull Request #10473 · triton-lang/triton

jeffniu-openai · 2026-06-04T06:54:55Z

PR description written by Codex

Summary

Load shared-memory TCGen MMA operands directly instead of round-tripping through global scratch.

Stack

Merge bottom-up:

Centralize ordinary dot K verification in DotOpInterface, share MMAv2 warp distribution between matmul acceleration and FPSan, and simplify FPSan tile selection using TTGIR's power-of-two shape invariant. Preserve bounded emulation tiles and existing i8 decomposition eligibility.

ThomasRaoux · 2026-06-05T15:44:10Z

  let assemblyFormat = "attr-dict `:` type($result)";
 }

+def TTI_ExperimentalLocalGatherOp


why can't just slice and local_load?

…local-gather-scatter-multicta

Keep partitioned shared gather and scatter unsupported until their multiple-base addressing is implemented. Pass target CTA IDs through the shared-memory target API and let NVIDIA select local versus cluster accesses. Add forced cross-CTA gather and scatter correctness coverage by swapping the CGA-distributed column bit. Validate with the compiler build, conversion lit tests, focused Gluon gather/scatter tests, and pre-commit.

…b.com/triton-lang/triton into jeffniu/local-gather-scatter-multicta

…-experimental-local-gather # Conflicts: # include/triton/Conversion/TritonGPUToLLVM/Utility.h # lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp # lib/Conversion/TritonGPUToLLVM/Utility.cpp

jeffniu-openai · 2026-06-07T01:49:03Z

Slice is too restricted to perform the partial subviews needed to load the tiny subtiles of the MMA. tti.experimental_local_gather has extra ND dynamic offsets that apply to the logical dimensions. Doing it another way requires generalizing memdesc_reshape but I can look into it

jeffniu-openai · 2026-06-07T06:43:12Z

#10526 would enable not adding an extra op

PR description written by Codex ## Summary - Add multi-CTA support to `ttg.local_gather` and `ttg.local_scatter`. ## Stack Merge bottom-up: - [ ] 👉 #10472 - [ ] #10473 - [ ] #10527 - [ ] #10532 - [ ] #10533 - [ ] #10542 - [ ] #10548

jeffniu-openai · 2026-06-09T23:14:40Z

Discussed this offline with @lezcano @ThomasRaoux and @peterbell10

the K dimension is swizzled and the emulation subtile can be smaller than the NVMMA swizzling tile, thus we cannot currently subslice along it
what the PR needs is to offset along M and N, and we can slice along K using local_gather as the elements are separately indexed, and therefore we can pass them through the inverse layout to get the right elements
I tried unrolling the loop to do that, but it regressed compile time by 3-5x and slightly regresses runtime, because it generates a lot of code and is challenging for the register allocator
we dropped [TritonGPU] Enable reshape of memdesc subviews #10526 as reshapes of subslices is broadly forbidden in the subslice model
in general to support subslicing (dynamic or static) into the NVVMA swizzling tile requires keep track of the whole subslice chain or change shared layouts to map from logical tensor to hardware, rather than the other way around
both require lots of work, and the former is unideal as that would mean subtile loops have to be unrolled

I'll move forward with the tti.experimental_local_gather op for now.

Context: #10473 landed after #10572 removed the old sliced-layout helpers, so the restacked #10527 called getSingleDimSliceEncoding and expandAllSlicedDims and failed during compilation on every CI GPU job. Implementation: derive the K-axis range type with getSlicedTensorType and expand it with reshapeAndBroadcast, matching the replacement API introduced by #10572 while preserving the direct result layout. Validation: - ninja -C build/cmake.macosx-11.0-arm64-cpython-3.12 triton-opt -j8 - lit -v test/TritonGPU/nvidia-fpsan.mlir test/TritonGPU/fpsan.mlir

PR description written by Codex Load shared MMA operands directly into their result layouts and reuse existing scale shadows, avoiding redundant layout conversions and scale snapshots, and also load accumulator directly into its MMA layout. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [ ] #10527 (this PR) - [ ] #10532 - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Reland of increased fpsan test coverage now that fpsan is faster ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [ ] #10532 (this PR) - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Optimize i8 decomposition by reordering the dots and eagerly combining into the accumulator-on-the-fly to minimize register pressure, and include a basic subtiling heuristic determined experimentally ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [ ] #10533 (this PR) - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Fix FPSAN TMEM emulation for initialized scratch synchronization, predicated stores, reduction loads, and scale-copy reinterpret views. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [x] #10533 - [ ] #10542 (this PR) - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Remove redundant payload sign clears before masked multiplies so NVPTX cannot fold them to `abs.f32`, which quiets signaling NaNs. Add a reduced one-warp regression that fails bitwise on PR #10542’s base and passes with the fix. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [x] #10533 - [x] #10542 - [ ] #10548 (this PR) - [ ] #10561 - [ ] #10559

jeffniu-openai requested a review from ptillet as a code owner June 4, 2026 06:54

jeffniu-openai mentioned this pull request Jun 4, 2026

[DNR][Do not merge] All FPSAN changes branch #10467

Closed

jeffniu-openai added 11 commits June 5, 2026 06:53

Accelerate FPSan MMA emulation with i8 decomposition

45e2e4b

Test FPSan TCGen MMA in warp partitions

a784db0

Support multi-CTA local gather and scatter

4dd75e4

Simplify multi-CTA gather and scatter lowering

80bf7cc

Preserve explicit cluster gather codegen

1ffc0a1

Apply pre-commit formatting

6dc53ac

Add instrumentation local gather for FPSan

4869637

Simplify instrumentation local gather

2c9405f

Apply pre-commit formatting

b1d02d7

Apply post-restack formatting

8cc70b2

jeffniu-openai force-pushed the jeffniu/local-gather-scatter-multicta branch from 0e4806e to 6dc53ac Compare June 5, 2026 08:49

jeffniu-openai requested review from Jokeren, antiagainst, peterbell10 and zhanglx13 as code owners June 5, 2026 08:49

jeffniu-openai force-pushed the jeffniu/tti-experimental-local-gather branch from 70309a6 to 8cc70b2 Compare June 5, 2026 08:49

ThomasRaoux reviewed Jun 5, 2026

View reviewed changes

peterbell10 reviewed Jun 5, 2026

View reviewed changes

Comment thread include/triton/Dialect/TritonInstrument/IR/TritonInstrumentOps.td

jeffniu-openai added 9 commits June 5, 2026 19:48

Merge remote-tracking branch 'refs/remotes/github/main' into jeffniu/…

5194768

…local-gather-scatter-multicta

[NVIDIA] Minimize multi-CTA shared dispatch

68037d4

[NVIDIA] Trim multi-CTA gather changes

3d0f65f

[NVIDIA] Restore multi-CTA lowering coverage

d098a0e

[NVIDIA] Simplify multi-CTA runtime test setup

96e29f0

[NVIDIA] Use nullable values for distributed shared memory

1e55b3c

merge

d9bd4b7

Merge branch 'jeffniu/local-gather-scatter-multicta' of https://githu…

2396557

…b.com/triton-lang/triton into jeffniu/local-gather-scatter-multicta

jeffniu-openai added 5 commits June 6, 2026 16:10

cleanup

ac28f5d

[NVIDIA] Always map distributed shared accesses

ac6263d

[GPUToLLVM] Lookup local address outputs by name

32f914d

[NVIDIA] Relax local gather barrier check

a54b8e9

Merge branch 'jeffniu/local-gather-scatter-multicta' into jeffniu/tti…

87bbf02

…-experimental-local-gather # Conflicts: # include/triton/Conversion/TritonGPUToLLVM/Utility.h # lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp # lib/Conversion/TritonGPUToLLVM/Utility.cpp

This was referenced Jun 7, 2026

[NVIDIA] Support multi-CTA local gather and scatter #10472

Merged

[FPSAN] Optimize shared MMA operand loads #10527

Merged

This was referenced Jun 8, 2026

[FPSAN] Broaden FPSan MMA dtype and minimum-shape coverage #10532

Merged

[FPSAN] Optimize i8 decomposition #10533

Merged

[FPSAN] Fix TMEM emulation correctness #10542

Merged

[FPSAN] Preserve NaN payload bits in encoding #10548

Merged

Base automatically changed from jeffniu/local-gather-scatter-multicta to main June 9, 2026 06:45

jeffniu-openai requested review from CRobeck and fywkevin as code owners June 9, 2026 06:45

Merge triton-lang/triton main into jeffniu/tti-experimental-local-gather

967ee8f

This was referenced Jun 10, 2026

[GSan][FPSan] Fix sanitizer correctness and performance #10561

Draft

[NVIDIA] Support local gather from multi-CTA subslices #10559

Open

pawelszczerbuk approved these changes Jun 11, 2026

View reviewed changes

jeffniu-openai merged commit ca6c8ca into main Jun 11, 2026
10 checks passed

jeffniu-openai deleted the jeffniu/tti-experimental-local-gather branch June 11, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FPSAN] Load TCGen shared operands directly#10473

[FPSAN] Load TCGen shared operands directly#10473
jeffniu-openai merged 26 commits into
mainfrom
jeffniu/tti-experimental-local-gather

jeffniu-openai commented Jun 4, 2026 •

edited

Loading

Uh oh!

ThomasRaoux Jun 5, 2026

Uh oh!

Uh oh!

jeffniu-openai commented Jun 7, 2026

Uh oh!

jeffniu-openai commented Jun 7, 2026

Uh oh!

jeffniu-openai commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jeffniu-openai commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Uh oh!

ThomasRaoux Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffniu-openai commented Jun 7, 2026

Uh oh!

jeffniu-openai commented Jun 7, 2026

Uh oh!

jeffniu-openai commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffniu-openai commented Jun 4, 2026 •

edited

Loading