Skip to content

Draft: Pre-allocator generic affine fusion#7045

Open
bgrady-tt wants to merge 58 commits intomainfrom
bgrady/preallocate-affine-fusion
Open

Draft: Pre-allocator generic affine fusion#7045
bgrady-tt wants to merge 58 commits intomainfrom
bgrady/preallocate-affine-fusion

Conversation

@bgrady-tt
Copy link
Contributor

@bgrady-tt bgrady-tt commented Feb 13, 2026

Note: This PR does not enable affine fusion or scalrep passes by default. Currently only a few lit tests exercise these passes. This is due to lack of support in downstream passes to handle intermediate scratchpad allocations properly.

Big New Passes:

  • GenericAffineLoopFusion : Iteratively fuses producer/consumer generic op pairs if correct until no more successful fusions are possible. This pass doesn't eliminate intermediate CBs/operands; it just focuses on correct loop fusion.
  • GenericAffineScalarReplacement : Replaces intermediate operands and CBs (only used to pass intermediate results in a single generic op) and eliminates (forwards) matching store->load pairs using affine scalrep utilities. Uses d2m.scratch_allocate for allocating all intermediates (cannot be lowered as-is).

This PRs impact to default pipeline is:

  • Moves outer loop generation even earlier in the pipeline (immediately post-bufferization)
  • Introduces a new form of the d2m.generic that has affine blocking loops with symbolic bounds (get_block_factor()).
  • A new pass LowerToExplicitForm converts this affine form with unresolved blocking factors to the fully explicit generic form that has hardened block factors and scf.for loops. This conversion happens immediately post-allocator

Reorder the apply interchange pass to run after bufferization but before
allocation, enabling interchange decisions to be visible to the allocator.
Replace hardcoded arith.constant loop bounds and block factor values
with d2m.get_block_factor ops, making the pass reference the parent
generic op's block factors symbolically rather than materializing them
as integer constants.
Replace scf.for loop generation with affine.for loops using
symbol-based upper bounds from GetBlockFactorOp. This enables
downstream affine analysis and transformation passes to operate
on the generated loop nest.
…Load/Store ops

RemoteLoadOp now implements AffineReadOpInterface and AffineMapAccessInterface,
and RemoteStoreOp implements AffineWriteOpInterface and AffineMapAccessInterface.
This enables affine analysis passes (e.g. affine scalar replacement) to reason
about these ops' memory access patterns. getAffineMap() looks up the indexing
map from the parent GenericOp for the associated operand.
…ants

Replace affine.for loops marked with d2m.outer_loop (generated by
D2MGenerateOuterLoops) with scf.for loops that use arith.constant
bounds derived from the GenericOp's block_factors attribute. Also
replaces all d2m.get_block_factor ops with their corresponding
constant values.
Rename the "d2m.outer_loop" attribute to "d2m.blocking_loop" across all
D2M passes and tests. Additionally, change the attribute from a unit
attribute to an integer attribute that carries the associated block
factor index (0 = outermost loop, incrementing inward).
Replace OpRewritePattern/applyPatternsGreedily with a simple module pass
using while(changed) iteration to avoid greedy rewriter erasure assertions
in GenericOp's SSACFG regions. Replace canFuseLoops/fuseLoops with manual
per-level body cloning since RemoteLoad/Store's AffineReadOp/WriteOp
interfaces crash affine dependence analysis. Keep shared intermediate as
fused input to satisfy GenericOp's single-output verifier constraint.
…stRegisterAccess

Use the d2m.blocking_loop attribute on scf.for loops to directly obtain
induction variables instead of creating IterIndexOp. This is more
correct since IterIndexOp::fold was incorrectly folding iter_index(dim)
to the constant dim value. When a blocking loop is absent (unit loop
optimized away), constant 0 is returned.
… ops

Skip GenericAffineLoopFusion for generics containing block_mask,
packer_mask_reset, tile_tilize_block, tile_untilize_block,
write_row_mask_tile, and write_col_mask_tile ops.
Ensure affine loop fusion only considers GenericOps in unified form
(single region with ThreadType::Unified). This prevents the pass from
attempting to fuse ops that have already been split or are in an
unexpected form.
…bset

When producer and consumer have equal loop depth, the producer becomes
the subset and its ops were incorrectly placed after the consumer's ops
in the fused loop body. Fix by inserting subset ops at the start of
each loop level when the subset is the producer, preserving data-flow
order.
bgrady-tt and others added 24 commits February 13, 2026 21:01
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Drop generic affine compatibility-form conversion from scalar replacement, delete dead GenericAffineUtils code and build wiring, and keep the scalar-replacement test expectations aligned with the new flow.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep d2m.block_offset explicit by removing constant-like folding assumptions, and add a canonicalize regression test to prevent future erasure. Include related D2M pipeline/test expectation updates needed for current affine fusion configuration behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use deterministic prime placeholders for temporary block_offset
rewriting in affine utilities, restore d2m.block_offset after
transforms, re-enable fusion lit coverage, and add scalar-replacement
roundtrip checks.
@bgrady-tt bgrady-tt force-pushed the bgrady/preallocate-affine-fusion branch from 6886071 to c65faf6 Compare February 13, 2026 21:07
@codecov-commenter
Copy link

codecov-commenter commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 91.25000% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.40%. Comparing base (7ae39e3) to head (1891e18).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
.../D2M/Transforms/GenericAffineScalarReplacement.cpp 88.78% 24 Missing ⚠️
...Dialect/D2M/Transforms/GenericAffineLoopFusion.cpp 91.13% 21 Missing ⚠️
lib/Dialect/D2M/IR/D2MGenericRegionOps.cpp 57.14% 9 Missing ⚠️
lib/Dialect/D2M/IR/D2MOps.cpp 85.18% 4 Missing ⚠️
lib/Dialect/D2M/Transforms/GenericAffineUtils.cpp 94.59% 2 Missing ⚠️
...Dialect/D2M/Transforms/InsertDstRegisterAccess.cpp 88.88% 2 Missing ⚠️
lib/Dialect/D2M/Transforms/LowerToExplicitForm.cpp 98.83% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7045      +/-   ##
==========================================
+ Coverage   69.27%   69.40%   +0.13%     
==========================================
  Files         384      388       +4     
  Lines       67207    67852     +645     
==========================================
+ Hits        46555    47092     +537     
- Misses      20652    20760     +108     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// only works on top-level modules (doesn't run module has a parent op).
ttmetal::TTIRToTTMetalPipelineOptions ttmetalOptions;
ttmetalOptions.ttnnMode = true;
ttmetalOptions.enableAffineLoopFusionAndScalarReplacement = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force affine fusion disabled until feature is more mature.

// Get the CB operand and find which block argument it corresponds to.
Value cb = remoteLoad.getCb();
if (auto blockArg = mlir::dyn_cast<BlockArgument>(cb)) {
if (auto blockArg = mlir::dyn_cast_or_null<BlockArgument>(cb)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor cleanup to unsafe dyn_cast here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants