Draft: Pre-allocator generic affine fusion by bgrady-tt · Pull Request #7045 · tenstorrent/tt-mlir

bgrady-tt · 2026-02-13T20:35:57Z

Note: This PR does not enable affine fusion or scalrep passes by default. Currently only a few lit tests exercise these passes. This is due to lack of support in downstream passes to handle intermediate scratchpad allocations properly.

Big New Passes:

GenericAffineLoopFusion : Iteratively fuses producer/consumer generic op pairs if correct until no more successful fusions are possible. This pass doesn't eliminate intermediate CBs/operands; it just focuses on correct loop fusion.
GenericAffineScalarReplacement : Replaces intermediate operands and CBs (only used to pass intermediate results in a single generic op) and eliminates (forwards) matching store->load pairs using affine scalrep utilities. Uses d2m.scratch_allocate for allocating all intermediates (cannot be lowered as-is).

This PRs impact to default pipeline is:

Moves outer loop generation even earlier in the pipeline (immediately post-bufferization)
Introduces a new form of the d2m.generic that has affine blocking loops with symbolic bounds (get_block_factor()).
A new pass LowerToExplicitForm converts this affine form with unresolved blocking factors to the fully explicit generic form that has hardened block factors and scf.for loops. This conversion happens immediately post-allocator

Reorder the apply interchange pass to run after bufferization but before allocation, enabling interchange decisions to be visible to the allocator.

… pipeline

Replace hardcoded arith.constant loop bounds and block factor values with d2m.get_block_factor ops, making the pass reference the parent generic op's block factors symbolically rather than materializing them as integer constants.

Replace scf.for loop generation with affine.for loops using symbol-based upper bounds from GetBlockFactorOp. This enables downstream affine analysis and transformation passes to operate on the generated loop nest.

…ipeline

…Load/Store ops RemoteLoadOp now implements AffineReadOpInterface and AffineMapAccessInterface, and RemoteStoreOp implements AffineWriteOpInterface and AffineMapAccessInterface. This enables affine analysis passes (e.g. affine scalar replacement) to reason about these ops' memory access patterns. getAffineMap() looks up the indexing map from the parent GenericOp for the associated operand.

…ants Replace affine.for loops marked with d2m.outer_loop (generated by D2MGenerateOuterLoops) with scf.for loops that use arith.constant bounds derived from the GenericOp's block_factors attribute. Also replaces all d2m.get_block_factor ops with their corresponding constant values.

Rename the "d2m.outer_loop" attribute to "d2m.blocking_loop" across all D2M passes and tests. Additionally, change the attribute from a unit attribute to an integer attribute that carries the associated block factor index (0 = outermost loop, incrementing inward).

Replace OpRewritePattern/applyPatternsGreedily with a simple module pass using while(changed) iteration to avoid greedy rewriter erasure assertions in GenericOp's SSACFG regions. Replace canFuseLoops/fuseLoops with manual per-level body cloning since RemoteLoad/Store's AffineReadOp/WriteOp interfaces crash affine dependence analysis. Keep shared intermediate as fused input to satisfy GenericOp's single-output verifier constraint.

…stRegisterAccess Use the d2m.blocking_loop attribute on scf.for loops to directly obtain induction variables instead of creating IterIndexOp. This is more correct since IterIndexOp::fold was incorrectly folding iter_index(dim) to the constant dim value. When a blocking loop is absent (unit loop optimized away), constant 0 is returned.

… ops Skip GenericAffineLoopFusion for generics containing block_mask, packer_mask_reset, tile_tilize_block, tile_untilize_block, write_row_mask_tile, and write_col_mask_tile ops.

Ensure affine loop fusion only considers GenericOps in unified form (single region with ThreadType::Unified). This prevents the pass from attempting to fuse ops that have already been split or are in an unexpected form.

…bset When producer and consumer have equal loop depth, the producer becomes the subset and its ops were incorrectly placed after the consumer's ops in the fused loop body. Fix by inserting subset ops at the start of each loop level when the subset is the producer, preserving data-flow order.

…nd scalrep

Co-authored-by: Cursor <cursoragent@cursor.com>

Drop generic affine compatibility-form conversion from scalar replacement, delete dead GenericAffineUtils code and build wiring, and keep the scalar-replacement test expectations aligned with the new flow. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep d2m.block_offset explicit by removing constant-like folding assumptions, and add a canonicalize regression test to prevent future erasure. Include related D2M pipeline/test expectation updates needed for current affine fusion configuration behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

Use deterministic prime placeholders for temporary block_offset rewriting in affine utilities, restore d2m.block_offset after transforms, re-enable fusion lit coverage, and add scalar-replacement roundtrip checks.

codecov-commenter · 2026-02-13T21:59:00Z

Codecov Report

❌ Patch coverage is 91.25000% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.40%. Comparing base (7ae39e3) to head (1891e18).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
.../D2M/Transforms/GenericAffineScalarReplacement.cpp	88.78%	24 Missing ⚠️
...Dialect/D2M/Transforms/GenericAffineLoopFusion.cpp	91.13%	21 Missing ⚠️
lib/Dialect/D2M/IR/D2MGenericRegionOps.cpp	57.14%	9 Missing ⚠️
lib/Dialect/D2M/IR/D2MOps.cpp	85.18%	4 Missing ⚠️
lib/Dialect/D2M/Transforms/GenericAffineUtils.cpp	94.59%	2 Missing ⚠️
...Dialect/D2M/Transforms/InsertDstRegisterAccess.cpp	88.88%	2 Missing ⚠️
lib/Dialect/D2M/Transforms/LowerToExplicitForm.cpp	98.83%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7045      +/-   ##
==========================================
+ Coverage   69.27%   69.40%   +0.13%     
==========================================
  Files         384      388       +4     
  Lines       67207    67852     +645     
==========================================
+ Hits        46555    47092     +537     
- Misses      20652    20760     +108

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bgrady-tt · 2026-02-13T22:26:33Z

lib/Dialect/TTNN/Pipelines/TTNNPipelines.cpp

  // only works on top-level modules (doesn't run module has a parent op).
  ttmetal::TTIRToTTMetalPipelineOptions ttmetalOptions;
  ttmetalOptions.ttnnMode = true;
+  ttmetalOptions.enableAffineLoopFusionAndScalarReplacement = false;


Force affine fusion disabled until feature is more mature.

bgrady-tt · 2026-02-13T22:27:03Z

lib/Dialect/D2M/Transforms/ScheduleDMA.cpp

      // Get the CB operand and find which block argument it corresponds to.
      Value cb = remoteLoad.getCb();
-      if (auto blockArg = mlir::dyn_cast<BlockArgument>(cb)) {
+      if (auto blockArg = mlir::dyn_cast_or_null<BlockArgument>(cb)) {


Just minor cleanup to unsafe dyn_cast here

bgrady-tt requested review from a team, nsmithtt and sdjordjevicTT as code owners February 13, 2026 20:35

bgrady-tt requested review from azecevicTT, jserbedzijaTT, mtopalovicTT, phizalev-TT, svuckovicTT, vwellsTT, wenbinlyuTT and xanderchin February 13, 2026 20:35

bgrady-tt added 16 commits February 13, 2026 21:00

Move D2MGenericApplyInterchange before D2MAllocate in middleend pipeline

dbc8d99

Reorder the apply interchange pass to run after bufferization but before allocation, enabling interchange decisions to be visible to the allocator.

Move D2MGenerateOuterLoops to run after loop interchange in middleend…

a0d9d55

… pipeline

Add D2M GetBlockFactorOp index operation

208478a

Generate affine.for loops instead of scf.for in GenerateOuterLoops pass

b64e7f5

Replace scf.for loop generation with affine.for loops using symbol-based upper bounds from GetBlockFactorOp. This enables downstream affine analysis and transformation passes to operate on the generated loop nest.

Add affine scalar replacement pass before D2M allocate in middleend p…

f0a06b6

…ipeline

add comment for scalrep pass

f32a6ad

fix implementation of getAffineMap for remote load and store

68c6503

Add SkipOpAffineLoopFusionTrait to prevent fusion of masking/tilizing…

6ec005d

… ops Skip GenericAffineLoopFusion for generics containing block_mask, packer_mask_reset, tile_tilize_block, tile_untilize_block, write_row_mask_tile, and write_col_mask_tile ops.

Add isUnifiedForm() check to GenericAffineLoopFusion

635799b

Ensure affine loop fusion only considers GenericOps in unified form (single region with ThreadType::Unified). This prevents the pass from attempting to fuse ops that have already been split or are in an unexpected form.

bgrady-tt and others added 24 commits February 13, 2026 21:01

cleanup GenericAffineScalarReplacement pass

49fe82e

refactor GenericAffineLoopFusion pass

efc504d

add global option for enabling/disabling generic affine loop fusion a…

d8c6528

…nd scalrep

small refactor

b9e6773

remove redundant check for matching loads and stores

9012727

add BlockOffset op and refactor fusion tests to expect it

3841199

BlockIndex -> BlockOffset lowering

595df39

lower explicit form affine maps

8831596

Co-authored-by: Cursor <cursoragent@cursor.com>

more refactors

29eb331

refactor generic affine fusion helpers

605c4d4

Co-authored-by: Cursor <cursoragent@cursor.com>

fix scratch allocation lowering

817d73a

Co-authored-by: Cursor <cursoragent@cursor.com>

fix lowering of block_offset

a0d64a2

canonicalizer pass post scalrep

e65593a

add null checks

aed4abc

disable affine fusion for TTNN pipeline

279ad51

fix test post block offset fix

dfe6a3b

disable fusion and scalrep for DMA only generics

f8177c6

disable fusion passes by default

786a67a

enable fusion and scalrep passes explicitly for certain tests

c00c12a

fix tests after block_offset refactor

8982cbf

mark failing fusion tests as unsupported for now

98400a8

bridge block_offset for affine fusion and scalrep

c65faf6

Use deterministic prime placeholders for temporary block_offset rewriting in affine utilities, restore d2m.block_offset after transforms, re-enable fusion lit coverage, and add scalar-replacement roundtrip checks.

bgrady-tt force-pushed the bgrady/preallocate-affine-fusion branch from 6886071 to c65faf6 Compare February 13, 2026 21:07

add patch file to this branch for CI

b95b936

bgrady-tt commented Feb 13, 2026

View reviewed changes

remove temporary hack to allow CBs to be passed as intermediates

1891e18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Pre-allocator generic affine fusion#7045

Draft: Pre-allocator generic affine fusion#7045
bgrady-tt wants to merge 58 commits intomainfrom
bgrady/preallocate-affine-fusion

bgrady-tt commented Feb 13, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 13, 2026 •

edited

Loading

Uh oh!

bgrady-tt Feb 13, 2026

Uh oh!

bgrady-tt Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bgrady-tt commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bgrady-tt Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

bgrady-tt Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bgrady-tt commented Feb 13, 2026 •

edited

Loading

codecov-commenter commented Feb 13, 2026 •

edited

Loading