[FPSAN] Optimize shared MMA operand loads by jeffniu-openai · Pull Request #10527 · triton-lang/triton

jeffniu-openai · 2026-06-07T04:34:31Z

PR description written by Codex

Load shared MMA operands directly into their result layouts and reuse existing scale shadows, avoiding redundant layout conversions and scale snapshots, and also load accumulator directly into its MMA layout.

Stack

Merge bottom-up:

PR description written by Codex ## Summary - Add multi-CTA support to `ttg.local_gather` and `ttg.local_scatter`. ## Stack Merge bottom-up: - [ ] 👉 #10472 - [ ] #10473 - [ ] #10527 - [ ] #10532 - [ ] #10533 - [ ] #10542 - [ ] #10548

PR description written by Codex ## Summary - Load shared-memory TCGen MMA operands directly instead of round-tripping through global scratch. ## Stack Merge bottom-up: - [x] #10472 - [ ] #10473 (this PR) - [ ] #10527 - [ ] #10532 - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

Context: #10473 landed after #10572 removed the old sliced-layout helpers, so the restacked #10527 called getSingleDimSliceEncoding and expandAllSlicedDims and failed during compilation on every CI GPU job. Implementation: derive the K-axis range type with getSlicedTensorType and expand it with reshapeAndBroadcast, matching the replacement API introduced by #10572 while preserving the direct result layout. Validation: - ninja -C build/cmake.macosx-11.0-arm64-cpython-3.12 triton-opt -j8 - lit -v test/TritonGPU/nvidia-fpsan.mlir test/TritonGPU/fpsan.mlir

PR description written by Codex Reland of increased fpsan test coverage now that fpsan is faster ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [ ] #10532 (this PR) - [ ] #10533 - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Optimize i8 decomposition by reordering the dots and eagerly combining into the accumulator-on-the-fly to minimize register pressure, and include a basic subtiling heuristic determined experimentally ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [ ] #10533 (this PR) - [ ] #10542 - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Fix FPSAN TMEM emulation for initialized scratch synchronization, predicated stores, reduction loads, and scale-copy reinterpret views. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [x] #10533 - [ ] #10542 (this PR) - [ ] #10548 - [ ] #10561 - [ ] #10559

PR description written by Codex Remove redundant payload sign clears before masked multiplies so NVPTX cannot fold them to `abs.f32`, which quiets signaling NaNs. Add a reduced one-warp regression that fails bitwise on PR #10542’s base and passes with the fix. ## Stack Merge bottom-up: - [x] #10472 - [x] #10473 - [x] #10527 - [x] #10532 - [x] #10533 - [x] #10542 - [ ] #10548 (this PR) - [ ] #10561 - [ ] #10559

jeffniu-openai requested a review from ptillet as a code owner June 7, 2026 04:34

This was referenced Jun 10, 2026

[GSan][FPSan] Fix sanitizer correctness and performance #10561

Draft

[NVIDIA] Support local gather from multi-CTA subslices #10559

Open

pawelszczerbuk approved these changes Jun 11, 2026

View reviewed changes

Base automatically changed from jeffniu/tti-experimental-local-gather to main June 11, 2026 21:45

jeffniu-openai force-pushed the jeffniu/fpsan-direct-result-layout branch 2 times, most recently from 3e07ffe to ceff22c Compare June 11, 2026 21:57

[FPSAN] Optimize shared MMA operand loads

ebb2c60

jeffniu-openai force-pushed the jeffniu/fpsan-direct-result-layout branch from 2f74a21 to ebb2c60 Compare June 11, 2026 23:24

jeffniu-openai merged commit e4ee981 into main Jun 12, 2026
19 of 20 checks passed

jeffniu-openai deleted the jeffniu/fpsan-direct-result-layout branch June 12, 2026 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FPSAN] Optimize shared MMA operand loads#10527

[FPSAN] Optimize shared MMA operand loads#10527
jeffniu-openai merged 1 commit into
mainfrom
jeffniu/fpsan-direct-result-layout

jeffniu-openai commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffniu-openai commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffniu-openai commented Jun 7, 2026 •

edited

Loading