[NVIDIA] Support multi-CTA local gather and scatter by jeffniu-openai · Pull Request #10472 · triton-lang/triton

jeffniu-openai · 2026-06-04T06:54:53Z

PR description written by Codex

Summary

Add multi-CTA support to ttg.local_gather and ttg.local_scatter.

Stack

Merge bottom-up:

Centralize ordinary dot K verification in DotOpInterface, share MMAv2 warp distribution between matmul acceleration and FPSan, and simplify FPSan tile selection using TTGIR's power-of-two shape invariant. Preserve bounded emulation tiles and existing i8 decomposition eligibility.

…local-gather-scatter-multicta

Keep partitioned shared gather and scatter unsupported until their multiple-base addressing is implemented. Pass target CTA IDs through the shared-memory target API and let NVIDIA select local versus cluster accesses. Add forced cross-CTA gather and scatter correctness coverage by swapping the CGA-distributed column bit. Validate with the compiler build, conversion lit tests, focused Gluon gather/scatter tests, and pre-commit.

…b.com/triton-lang/triton into jeffniu/local-gather-scatter-multicta

PR description written by Codex ## Summary - Load shared-memory TCGen MMA operands directly instead of round-tripping through global scratch. ## Stack Merge bottom-up: - [x] #10472 - [ ] #10473 (this PR) - [ ] #10527 - [ ] #10532 - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Load shared MMA operands directly into their result layouts and reuse existing scale shadows, avoiding redundant layout conversions and scale snapshots, and also load accumulator directly into its MMA layout. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [ ] #10527 (this PR) - [ ] #10532 - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Reland of increased fpsan test coverage now that fpsan is faster ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [ ] #10532 (this PR) - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Optimize i8 decomposition by reordering the dots and eagerly combining into the accumulator-on-the-fly to minimize register pressure, and include a basic subtiling heuristic determined experimentally ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [ ] #10533 (this PR) - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Fix FPSAN TMEM emulation for initialized scratch synchronization, predicated stores, reduction loads, and scale-copy reinterpret views. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [x] #10533 - [ ] #10542 (this PR) - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Remove redundant payload sign clears before masked multiplies so NVPTX cannot fold them to `abs.f32`, which quiets signaling NaNs. Add a reduced one-warp regression that fails bitwise on PR #10542’s base and passes with the fix. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [x] #10533 - [x] #10542 - [ ] #10548 (this PR) - [ ] #10561 - [ ] #10559

jeffniu-openai requested a review from ptillet as a code owner June 4, 2026 06:54

jeffniu-openai mentioned this pull request Jun 4, 2026

[DNR][Do not merge] All FPSAN changes branch #10467

Closed

lezcano reviewed Jun 4, 2026

View reviewed changes

jeffniu-openai force-pushed the jeffniu/i8-decomp branch from c3a9efb to c4e37c0 Compare June 5, 2026 06:50

jeffniu-openai requested review from Jokeren, antiagainst, peterbell10 and zhanglx13 as code owners June 5, 2026 06:50

jeffniu-openai added 3 commits June 5, 2026 06:53

Accelerate FPSan MMA emulation with i8 decomposition

45e2e4b

Test FPSan TCGen MMA in warp partitions

a784db0

jeffniu-openai force-pushed the jeffniu/i8-decomp branch from c4e37c0 to 333348b Compare June 5, 2026 06:55

jeffniu-openai added 4 commits June 5, 2026 08:31

Support multi-CTA local gather and scatter

4dd75e4

Simplify multi-CTA gather and scatter lowering

80bf7cc

Preserve explicit cluster gather codegen

1ffc0a1

Apply pre-commit formatting

6dc53ac

jeffniu-openai force-pushed the jeffniu/local-gather-scatter-multicta branch from 0e4806e to 6dc53ac Compare June 5, 2026 08:49

Base automatically changed from jeffniu/i8-decomp to main June 5, 2026 09:12

jeffniu-openai added 7 commits June 5, 2026 19:48

Merge remote-tracking branch 'refs/remotes/github/main' into jeffniu/…

5194768

…local-gather-scatter-multicta

[NVIDIA] Minimize multi-CTA shared dispatch

68037d4

[NVIDIA] Trim multi-CTA gather changes

3d0f65f

[NVIDIA] Restore multi-CTA lowering coverage

d098a0e

[NVIDIA] Simplify multi-CTA runtime test setup

96e29f0

[NVIDIA] Use nullable values for distributed shared memory

1e55b3c

jeffniu-openai requested review from CRobeck and fywkevin as code owners June 6, 2026 23:04

jeffniu-openai added 3 commits June 6, 2026 16:06

merge

d9bd4b7

Merge branch 'jeffniu/local-gather-scatter-multicta' of https://githu…

2396557

…b.com/triton-lang/triton into jeffniu/local-gather-scatter-multicta

cleanup

ac28f5d

jeffniu-openai added 3 commits June 6, 2026 23:26

[NVIDIA] Always map distributed shared accesses

ac6263d

[GPUToLLVM] Lookup local address outputs by name

32f914d

[NVIDIA] Relax local gather barrier check

a54b8e9

This was referenced Jun 7, 2026

[FPSAN] Load TCGen shared operands directly #10473

Merged

[FPSAN] Optimize shared MMA operand loads #10527

Merged

[FPSAN] Broaden FPSan MMA dtype and minimum-shape coverage #10532

Merged

[FPSAN] Optimize i8 decomposition #10533

Merged

lezcano approved these changes Jun 8, 2026

View reviewed changes

This was referenced Jun 8, 2026

[FPSAN] Fix TMEM emulation correctness #10542

Merged

[FPSAN] Preserve NaN payload bits in encoding #10548

Merged

jeffniu-openai merged commit bb67bbd into main Jun 9, 2026
10 checks passed

jeffniu-openai deleted the jeffniu/local-gather-scatter-multicta branch June 9, 2026 06:45

This was referenced Jun 10, 2026

[GSan][FPSan] Fix sanitizer correctness and performance #10561

Draft

[NVIDIA] Support local gather from multi-CTA subslices #10559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Support multi-CTA local gather and scatter#10472

[NVIDIA] Support multi-CTA local gather and scatter#10472
jeffniu-openai merged 20 commits into
mainfrom
jeffniu/local-gather-scatter-multicta

jeffniu-openai commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffniu-openai commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffniu-openai commented Jun 4, 2026 •

edited

Loading