Skip to content

[FPSAN] Optimize shared MMA operand loads#10527

Merged
jeffniu-openai merged 1 commit into
mainfrom
jeffniu/fpsan-direct-result-layout
Jun 12, 2026
Merged

[FPSAN] Optimize shared MMA operand loads#10527
jeffniu-openai merged 1 commit into
mainfrom
jeffniu/fpsan-direct-result-layout

Conversation

@jeffniu-openai

@jeffniu-openai jeffniu-openai commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

jeffniu-openai added a commit that referenced this pull request Jun 11, 2026
PR description written by Codex

## Summary
- Load shared-memory TCGen MMA operands directly instead of
round-tripping through global scratch.

## Stack
Merge bottom-up:
- [x] #10472
- [ ] #10473 (this PR)
- [ ] #10527
- [ ] #10532
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
Base automatically changed from jeffniu/tti-experimental-local-gather to main June 11, 2026 21:45
@jeffniu-openai jeffniu-openai force-pushed the jeffniu/fpsan-direct-result-layout branch 2 times, most recently from 3e07ffe to ceff22c Compare June 11, 2026 21:57
jeffniu-openai added a commit that referenced this pull request Jun 11, 2026
Context: #10473 landed after #10572 removed the old sliced-layout helpers,
so the restacked #10527 called getSingleDimSliceEncoding and
expandAllSlicedDims and failed during compilation on every CI GPU job.

Implementation: derive the K-axis range type with getSlicedTensorType and
expand it with reshapeAndBroadcast, matching the replacement API introduced
by #10572 while preserving the direct result layout.

Validation:
- ninja -C build/cmake.macosx-11.0-arm64-cpython-3.12 triton-opt -j8
- lit -v test/TritonGPU/nvidia-fpsan.mlir test/TritonGPU/fpsan.mlir
@jeffniu-openai jeffniu-openai force-pushed the jeffniu/fpsan-direct-result-layout branch from 2f74a21 to ebb2c60 Compare June 11, 2026 23:24
@jeffniu-openai jeffniu-openai merged commit e4ee981 into main Jun 12, 2026
19 of 20 checks passed
@jeffniu-openai jeffniu-openai deleted the jeffniu/fpsan-direct-result-layout branch June 12, 2026 01:01
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Reland of increased fpsan test coverage now that fpsan is faster

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [ ] #10532 (this PR)
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Optimize i8 decomposition by reordering the dots and eagerly combining
into the accumulator-on-the-fly to minimize register pressure, and
include a basic subtiling heuristic determined experimentally

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [ ] #10533 (this PR)
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Fix FPSAN TMEM emulation for initialized scratch synchronization,
predicated stores, reduction loads, and scale-copy reinterpret views.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [x] #10533
- [ ] #10542 (this PR)
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Remove redundant payload sign clears before masked multiplies so NVPTX
cannot fold them to `abs.f32`, which quiets signaling NaNs. Add a
reduced one-warp regression that fails bitwise on PR #10542’s base and
passes with the fix.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [x] #10533
- [x] #10542
- [ ] #10548 (this PR)
- [ ] #10561
- [ ] #10559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants