Skip to content

[FPSAN] Load TCGen shared operands directly#10473

Merged
jeffniu-openai merged 26 commits into
mainfrom
jeffniu/tti-experimental-local-gather
Jun 11, 2026
Merged

[FPSAN] Load TCGen shared operands directly#10473
jeffniu-openai merged 26 commits into
mainfrom
jeffniu/tti-experimental-local-gather

Conversation

@jeffniu-openai

@jeffniu-openai jeffniu-openai commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

@jeffniu-openai jeffniu-openai force-pushed the jeffniu/local-gather-scatter-multicta branch from 0e4806e to 6dc53ac Compare June 5, 2026 08:49
@jeffniu-openai jeffniu-openai force-pushed the jeffniu/tti-experimental-local-gather branch from 70309a6 to 8cc70b2 Compare June 5, 2026 08:49
let assemblyFormat = "attr-dict `:` type($result)";
}

def TTI_ExperimentalLocalGatherOp

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't just slice and local_load?

Comment thread include/triton/Dialect/TritonInstrument/IR/TritonInstrumentOps.td
Keep partitioned shared gather and scatter unsupported until their multiple-base addressing is implemented. Pass target CTA IDs through the shared-memory target API and let NVIDIA select local versus cluster accesses.

Add forced cross-CTA gather and scatter correctness coverage by swapping the CGA-distributed column bit. Validate with the compiler build, conversion lit tests, focused Gluon gather/scatter tests, and pre-commit.
…-experimental-local-gather

# Conflicts:
#	include/triton/Conversion/TritonGPUToLLVM/Utility.h
#	lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp
#	lib/Conversion/TritonGPUToLLVM/Utility.cpp
@jeffniu-openai

Copy link
Copy Markdown
Collaborator Author

Slice is too restricted to perform the partial subviews needed to load the tiny subtiles of the MMA. tti.experimental_local_gather has extra ND dynamic offsets that apply to the logical dimensions. Doing it another way requires generalizing memdesc_reshape but I can look into it

@jeffniu-openai

Copy link
Copy Markdown
Collaborator Author

#10526 would enable not adding an extra op

jeffniu-openai added a commit that referenced this pull request Jun 9, 2026
PR description written by Codex

## Summary
- Add multi-CTA support to `ttg.local_gather` and `ttg.local_scatter`.

## Stack
Merge bottom-up:
- [ ] 👉 #10472
- [ ] #10473
- [ ] #10527
- [ ] #10532
- [ ] #10533
- [ ] #10542
- [ ] #10548
Base automatically changed from jeffniu/local-gather-scatter-multicta to main June 9, 2026 06:45
@jeffniu-openai

Copy link
Copy Markdown
Collaborator Author

Discussed this offline with @lezcano @ThomasRaoux and @peterbell10

  • the K dimension is swizzled and the emulation subtile can be smaller than the NVMMA swizzling tile, thus we cannot currently subslice along it
  • what the PR needs is to offset along M and N, and we can slice along K using local_gather as the elements are separately indexed, and therefore we can pass them through the inverse layout to get the right elements
  • I tried unrolling the loop to do that, but it regressed compile time by 3-5x and slightly regresses runtime, because it generates a lot of code and is challenging for the register allocator
  • we dropped [TritonGPU] Enable reshape of memdesc subviews #10526 as reshapes of subslices is broadly forbidden in the subslice model
  • in general to support subslicing (dynamic or static) into the NVVMA swizzling tile requires keep track of the whole subslice chain or change shared layouts to map from logical tensor to hardware, rather than the other way around
  • both require lots of work, and the former is unideal as that would mean subtile loops have to be unrolled

I'll move forward with the tti.experimental_local_gather op for now.

@jeffniu-openai jeffniu-openai merged commit ca6c8ca into main Jun 11, 2026
10 checks passed
@jeffniu-openai jeffniu-openai deleted the jeffniu/tti-experimental-local-gather branch June 11, 2026 21:45
jeffniu-openai added a commit that referenced this pull request Jun 11, 2026
Context: #10473 landed after #10572 removed the old sliced-layout helpers,
so the restacked #10527 called getSingleDimSliceEncoding and
expandAllSlicedDims and failed during compilation on every CI GPU job.

Implementation: derive the K-axis range type with getSlicedTensorType and
expand it with reshapeAndBroadcast, matching the replacement API introduced
by #10572 while preserving the direct result layout.

Validation:
- ninja -C build/cmake.macosx-11.0-arm64-cpython-3.12 triton-opt -j8
- lit -v test/TritonGPU/nvidia-fpsan.mlir test/TritonGPU/fpsan.mlir
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Load shared MMA operands directly into their result layouts and reuse
existing scale shadows, avoiding redundant layout conversions and scale
snapshots, and also load accumulator directly into its MMA layout.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [ ] #10527 (this PR)
- [ ] #10532
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Reland of increased fpsan test coverage now that fpsan is faster

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [ ] #10532 (this PR)
- [ ] #10533
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Optimize i8 decomposition by reordering the dots and eagerly combining
into the accumulator-on-the-fly to minimize register pressure, and
include a basic subtiling heuristic determined experimentally

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [ ] #10533 (this PR)
- [ ] #10542
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Fix FPSAN TMEM emulation for initialized scratch synchronization,
predicated stores, reduction loads, and scale-copy reinterpret views.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [x] #10533
- [ ] #10542 (this PR)
- [ ] #10548
- [ ] #10561
- [ ] #10559
jeffniu-openai added a commit that referenced this pull request Jun 12, 2026
PR description written by Codex

Remove redundant payload sign clears before masked multiplies so NVPTX
cannot fold them to `abs.f32`, which quiets signaling NaNs. Add a
reduced one-warp regression that fails bitwise on PR #10542’s base and
passes with the fix.

## Stack
Merge bottom-up:
- [x] #10472
- [x] #10473
- [x] #10527
- [x] #10532
- [x] #10533
- [x] #10542
- [ ] #10548 (this PR)
- [ ] #10561
- [ ] #10559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants