Skip to content

[NVIDIA] Support multi-CTA local gather and scatter#10472

Merged
jeffniu-openai merged 20 commits into
mainfrom
jeffniu/local-gather-scatter-multicta
Jun 9, 2026
Merged

[NVIDIA] Support multi-CTA local gather and scatter#10472
jeffniu-openai merged 20 commits into
mainfrom
jeffniu/local-gather-scatter-multicta

Conversation

Comment thread lib/Conversion/TritonGPUToLLVM/Utility.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/Utility.cpp Outdated
Comment thread lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp
Comment thread lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp
Comment thread lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp Outdated
Centralize ordinary dot K verification in DotOpInterface, share MMAv2 warp distribution between matmul acceleration and FPSan, and simplify FPSan tile selection using TTGIR's power-of-two shape invariant. Preserve bounded emulation tiles and existing i8 decomposition eligibility.
@jeffniu-openai jeffniu-openai force-pushed the jeffniu/local-gather-scatter-multicta branch from 0e4806e to 6dc53ac Compare June 5, 2026 08:49
Base automatically changed from jeffniu/i8-decomp to main June 5, 2026 09:12
Keep partitioned shared gather and scatter unsupported until their multiple-base addressing is implemented. Pass target CTA IDs through the shared-memory target API and let NVIDIA select local versus cluster accesses.

Add forced cross-CTA gather and scatter correctness coverage by swapping the CGA-distributed column bit. Validate with the compiler build, conversion lit tests, focused Gluon gather/scatter tests, and pre-commit.
@jeffniu-openai jeffniu-openai merged commit bb67bbd into main Jun 9, 2026
10 checks passed
@jeffniu-openai jeffniu-openai deleted the jeffniu/local-gather-scatter-multicta branch June 9, 2026 06:45
jeffniu-openai added a commit that referenced this pull request Jun 11, 2026
PR description written by Codex

## Summary
- Load shared-memory TCGen MMA operands directly instead of
round-tripping through global scratch.

## Stack
Merge bottom-up:
- [x] #10472
- [ ] #10473 (this PR)
- [ ] #10527
- [ ] #10532
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Load shared MMA operands directly into their result layouts and reuse
existing scale shadows, avoiding redundant layout conversions and scale
snapshots, and also load accumulator directly into its MMA layout.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [ ] #10527 (this PR)
- [ ] #10532
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Reland of increased fpsan test coverage now that fpsan is faster

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [ ] #10532 (this PR)
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Optimize i8 decomposition by reordering the dots and eagerly combining
into the accumulator-on-the-fly to minimize register pressure, and
include a basic subtiling heuristic determined experimentally

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [ ] #10533 (this PR)
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Fix FPSAN TMEM emulation for initialized scratch synchronization,
predicated stores, reduction loads, and scale-copy reinterpret views.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [x] #10533
- [ ] #10542 (this PR)
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Remove redundant payload sign clears before masked multiplies so NVPTX
cannot fold them to `abs.f32`, which quiets signaling NaNs. Add a
reduced one-warp regression that fails bitwise on PR #10542’s base and
passes with the fix.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [x] #10533
- [x] #10542
- [ ] #10548 (this PR)
- [ ] #10561
- [ ] #10559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants