Skip to content

Conversation

@brnorris03
Copy link
Contributor

@brnorris03 brnorris03 commented Nov 16, 2025

Ticket

N/A

Summary

New pass d2m-insert-dst-register-gc implements graph coloring register allocation for DST (destination registers). Reduces DST usage by reusing slices for non-interfering values.
Note that while this patch appears large, more than 80% of it is tests.

Motivation

Current InsertDstRegisterAccess uses sequential allocation - each value gets a new DST slice. Graph coloring identifies values with non-overlapping lifetimes and assigns them to the same slice, reducing DST pressure.

Example: Diamond pattern (4 values, 2 interference edges) needs only 2 slices instead of 4 (50% reduction).

     dst0
    /    \
  dst1   dst2
    \    /
     dst3

Interference edges: dst0 - dst1, dst1 - dst2, dst2 - dst3. No edges between dst0 - dst2 or dst1 - dst3 (early releases enable reuse). Graph coloring assigns: dst0 and dst2 → slice 0, dst1 and dst3 → slice 1.

Changes

Overview of design

The design includes the defintion of simple abstract interfaces for DstAnalysis and ColoringStrategy which enable multiple implementations outside of any specific pass defiition.

                          DstAnalysis
                             ^   ^
                   inherits /     \ inherits
                           /       \
               DstAnalysisBasic   DstAnalysisGraphColoring
                                      | composes
                                      v
                              ColoringStrategy
                                 ^       ^
                     implements /         \ implements
                               /           \
               ChaitinBriggsColoring   GreedyColoring
                              |            |
                              | uses       | uses
                              v            v
                           InterferenceGraph utils

The analysis and transformation passes use the above strategies:

D2MDstRequirementAnalysisPass
    ├─ reads option 'strategy'
    ├─ instantiates DstAnalysis implementation
    │     ├─ basic  -> DstAnalysisBasic
    │     ├─ greedy -> DstAnalysisGraphColoring + GreedyColoring
    │     └─ graph-coloring -> DstAnalysisGraphColoring + ChaitinBriggsColoring
    └─ invokes analysis->analyze(funcOp) to report required slices

D2MInsertDstRegisterGCPass
    ├─ invokes DstCapacityAnalysis(funcOp) for capacity limit
    ├─ instantiates runtime strategy based on option
    │     ├─ basic  -> DstAnalysisBasic (equivalent to D2MInsertDstRegisterAccess)
    │     ├─ greedy -> DstAnalysisGraphColoring + GreedyColoring
    │     └─ graph-coloring -> DstAnalysisGraphColoring + ChaitinBriggsColoring
    ├─ performs pre-check via selected DstAnalysis->analyze(funcOp)
    └─ on success, uses DstAnalysisGraphColoring + chosen ColoringStrategy
       to allocate slices and rewrite operations

New Operations

d2m.release_dst - Marks end of DST value lifetime:

%dst = d2m.acquire_dst() : memref<1x!ttcore.tile<32x32, f32>, #dst>
// ... use dst ...
d2m.release_dst %dst : memref<1x!ttcore.tile<32x32, f32>, #dst>

Enables precise liveness analysis and early reuse. Verifiers ensure pairing with acquire_dst and prevent use-after-release.

Pass Implementation

D2MInsertDstRegisterGCPass (lib/Dialect/D2M/Transforms/InsertDstRegisterGC.cpp):

  • Collects affine.load operations for DST-consuming ops (via OperandLoadStoreRegisterOpInterface)
  • Builds interference graph using SSA liveness analysis
  • Applies graph coloring to assign DST slices
  • Generates L1↔DST data copy loops
  • Inserts acquire_dst/release_dst pairs

Coloring Strategies

Two algorithms implemented in GraphColoringStrategy.{h,cpp}:

  1. Chaitin-Briggs (default): Simplification-based coloring
  2. Greedy: Sequential first-available-color assignment

Strategy selection via pass option:

ttmlir-opt --d2m-insert-dst-register-gc="coloring-strategy=greedy" input.mlir

Pipeline Integration

Not yet integrated into main pipeline. Current d2m-insert-dst-register-access performs multiple unrelated transformations (most notably linalg to affine conversion). These need refactoring into separate passes that would be applied before this pass.

Future Work

  1. Keep accumulators in DST during reduction loops: For operations like matmul that accumulate results (e.g., C += A * B in a loop), the current pass loads the accumulator from L1 memory and writes it back on every iteration. This is inefficient. A better approach (used by the existing D2MInsertDSTAccess pass) keeps the accumulator in fast DST registers throughout the entire reduction loop and only writes back to L1 once at the end. This requires splitting the loop into three phases: (1) load initial values into DST, (2) compute with accumulator staying in DST, (3) write final results back to L1. This optimization could be implemented as a separate pass that runs before register allocation. (high priority)
  2. Precise loop dependence analysis using MLIR affine utilities (medium priority)
  3. Spill code generation for insufficient DST capacity (high priority)
  4. PBQP-based allocation for cost modeling (may be overkill)
  5. Move coalescing to eliminate redundant copies (medium priority)
  6. Performance benchmarking on real workloads (medium priority)

Copy link
Contributor

@vmilosevic vmilosevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@codecov-commenter
Copy link

codecov-commenter commented Nov 16, 2025

Codecov Report

❌ Patch coverage is 80.78406% with 299 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.77%. Comparing base (becbe0a) to head (8a9827c).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
lib/Dialect/D2M/Utils/TileMatmulUtils.cpp 18.77% 173 Missing ⚠️
lib/Dialect/D2M/Transforms/InsertDstRegisterGC.cpp 79.57% 106 Missing ⚠️
lib/Dialect/D2M/Analysis/DstAnalysisPass.cpp 78.12% 7 Missing ⚠️
lib/Dialect/D2M/IR/D2MOps.cpp 87.87% 4 Missing ⚠️
...b/Dialect/D2M/Transforms/GraphColoringStrategy.cpp 98.17% 3 Missing ⚠️
lib/Conversion/D2MToTTKernel/D2MToTTKernel.cpp 94.59% 2 Missing ⚠️
lib/Dialect/D2M/Analysis/DstAnalysisBasic.cpp 94.59% 2 Missing ⚠️
lib/Dialect/D2M/Analysis/DstCapacityAnalysis.cpp 95.65% 1 Missing ⚠️
...Dialect/D2M/Transforms/InsertDstRegisterAccess.cpp 97.36% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5879      +/-   ##
==========================================
+ Coverage   69.34%   69.77%   +0.42%     
==========================================
  Files         334      347      +13     
  Lines       50999    52483    +1484     
==========================================
+ Hits        35367    36621    +1254     
- Misses      15632    15862     +230     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@brnorris03 brnorris03 force-pushed the bnorris/dst-gc-allocator branch 7 times, most recently from 47ae5d9 to 4fd7dc6 Compare November 16, 2025 07:32
Copy link
Contributor

@vmilosevic vmilosevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

…erAccess pass so that it can be used prior to other DST allocation passes
Copy link
Contributor

@vmilosevic vmilosevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

Copy link
Contributor

@vmilosevic vmilosevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@brnorris03 brnorris03 force-pushed the bnorris/dst-gc-allocator branch from 9f22936 to 62fbc68 Compare November 18, 2025 06:11
Copy link
Contributor

@vmilosevic vmilosevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@brnorris03 brnorris03 force-pushed the bnorris/dst-gc-allocator branch from 32e8d2b to 16d3134 Compare November 18, 2025 07:05
…oring.

- InsertDstRegisterAccess.cpp: Emit diagnostic when encountering unconverted
linalg.generic operations, per LLVM guidelines on error reporting.
- LinalgToAffine.cpp: Make d2m.linalg_root attribute conditional to avoid
leaking internal pass state into IR.
- Passes.td: Add mark-root-loops option to control attribute emission.
- TTMetalPipelines.cpp: Configure pipeline to emit markers for pass
coordination.
- Tests: Add diagnostic verification test and comprehensive option
combination coverage.

Issues Addressed
- Silent pass failures violate error reporting guidelines in LLVM Programmer's
Manual § "Writing an LLVM Pass".
- Internal marker attributes in IR violate canonical form requirements in
LLVM Programmer's Manual § "The PassManager".
Root Cause: The D2MAllocate pass performed liveness analysis BEFORE
inserting stream operations, so the custom liveness extension couldn't
account for newly inserted streams that reference existing buffers.

Solution Implemented: Split D2MAllocate into two phases:
Phase 1: Allocation and stream insertion (no deallocs)
Phase 2: Re-run liveness analysis on complete IR, then insert deallocs
brnorris03 and others added 16 commits November 23, 2025 16:27
### Ticket
#5057
#5931

### Problem description
as descrbied in issue

### What's changed
- use `get_current_system_desc` , passing input tensor device if
available
- on invocation, the `JitFunction` will check `SYSTEM_DESC_PATH` env
var, if not set, it will run `_query_and_save_system_desc` to get it's
own system desc instead of just error'ing out.
- helper `_get_dispatch_core_type` and `_get_cluster_type` to get
`DispatchCoreType` based off what the cluster is
- in `test/ttnn-jit/conftest.py` set `DispatchCoreType` to `WORKER` for
p150, if not set `ETH` for device init

### Checklist
- [ ] New/Existing tests provide coverage for changes
#3915 and
#5899 introduce specific
`permute . reshape . permute -> reshape` patterns with hard-coded values
of permutation.

This PR generalizes the approach such that:
```
This pattern fuses the sequence: PermuteOp -> ReshapeOp -> PermuteOp into a
single ReshapeOp when the following conditions are met:
Original shape: [A_1, A_2,.., A_k]
permute(p_1, p_2,..., p_k)    -> [A_1', A_2',.., A_k']
reshape([A_1', A_2',.., A_k']) -> [B_1, B_2,.., B_k]
permute(p_1', p_2',..., p_k') -> [B_1', B_2',.., B_k']

where:
- k is the rank of the input tensor;
- (p_1, p_2,..., p_k) and (p_1', p_2',..., p_k') are
  permutations of {0, 1, ..., k-1};
- B_i = (A_r', A_r+1',..., A_r+l') where 1 <= r <= k, l >= 0 and r + l <= k,
  for each 1 <= i <= k;
- flatten([B_1', B_2',.., B_k']) = [A_1, A_2,.., A_k].

The result of this sequence is identical to the following reshape:
reshape([A_1, A_2,.., A_k]) -> [B_1', B_2',.., B_k']
```

Special credits to @mvasiljevicTT for scrutinizing an algorithm during
initial development, which revealed some edge cases that weren't
previously covered.
@brnorris03 brnorris03 force-pushed the bnorris/dst-gc-allocator branch from 5d0f7fa to 571c936 Compare November 25, 2025 22:18
Copy link
Contributor

@vmilosevic vmilosevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

llvm::SmallVector<int64_t, 4> tripCounts; // Trip count for each loop.
int64_t totalIterations; // Product of all trip counts.

LoopContext() : totalIterations(1) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-use-default-member-init ⚠️
use default member initializer for totalIterations

Suggested change
LoopContext() : totalIterations(1) {}
LoopContext() : {}

@brnorris03 brnorris03 force-pushed the bnorris/dst-gc-allocator branch from db490ee to 6ab65ad Compare November 26, 2025 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.