-
Notifications
You must be signed in to change notification settings - Fork 77
[D2M] Implement graph coloring-based DST register allocation #5879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
vmilosevic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5879 +/- ##
==========================================
+ Coverage 69.34% 69.77% +0.42%
==========================================
Files 334 347 +13
Lines 50999 52483 +1484
==========================================
+ Hits 35367 36621 +1254
- Misses 15632 15862 +230 ☔ View full report in Codecov by Sentry. |
47ae5d9 to
4fd7dc6
Compare
vmilosevic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
…erAccess pass so that it can be used prior to other DST allocation passes
vmilosevic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
vmilosevic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
9f22936 to
62fbc68
Compare
vmilosevic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
32e8d2b to
16d3134
Compare
…g-to-affine-from-dst-pass
…stRegister tests (all pass)
…oring. - InsertDstRegisterAccess.cpp: Emit diagnostic when encountering unconverted linalg.generic operations, per LLVM guidelines on error reporting. - LinalgToAffine.cpp: Make d2m.linalg_root attribute conditional to avoid leaking internal pass state into IR. - Passes.td: Add mark-root-loops option to control attribute emission. - TTMetalPipelines.cpp: Configure pipeline to emit markers for pass coordination. - Tests: Add diagnostic verification test and comprehensive option combination coverage. Issues Addressed - Silent pass failures violate error reporting guidelines in LLVM Programmer's Manual § "Writing an LLVM Pass". - Internal marker attributes in IR violate canonical form requirements in LLVM Programmer's Manual § "The PassManager".
Root Cause: The D2MAllocate pass performed liveness analysis BEFORE inserting stream operations, so the custom liveness extension couldn't account for newly inserted streams that reference existing buffers. Solution Implemented: Split D2MAllocate into two phases: Phase 1: Allocation and stream insertion (no deallocs) Phase 2: Re-run liveness analysis on complete IR, then insert deallocs
…g-to-affine-from-dst-pass
…y loops in a future pr
### Ticket #5057 #5931 ### Problem description as descrbied in issue ### What's changed - use `get_current_system_desc` , passing input tensor device if available - on invocation, the `JitFunction` will check `SYSTEM_DESC_PATH` env var, if not set, it will run `_query_and_save_system_desc` to get it's own system desc instead of just error'ing out. - helper `_get_dispatch_core_type` and `_get_cluster_type` to get `DispatchCoreType` based off what the cluster is - in `test/ttnn-jit/conftest.py` set `DispatchCoreType` to `WORKER` for p150, if not set `ETH` for device init ### Checklist - [ ] New/Existing tests provide coverage for changes
#3915 and #5899 introduce specific `permute . reshape . permute -> reshape` patterns with hard-coded values of permutation. This PR generalizes the approach such that: ``` This pattern fuses the sequence: PermuteOp -> ReshapeOp -> PermuteOp into a single ReshapeOp when the following conditions are met: Original shape: [A_1, A_2,.., A_k] permute(p_1, p_2,..., p_k) -> [A_1', A_2',.., A_k'] reshape([A_1', A_2',.., A_k']) -> [B_1, B_2,.., B_k] permute(p_1', p_2',..., p_k') -> [B_1', B_2',.., B_k'] where: - k is the rank of the input tensor; - (p_1, p_2,..., p_k) and (p_1', p_2',..., p_k') are permutations of {0, 1, ..., k-1}; - B_i = (A_r', A_r+1',..., A_r+l') where 1 <= r <= k, l >= 0 and r + l <= k, for each 1 <= i <= k; - flatten([B_1', B_2',.., B_k']) = [A_1, A_2,.., A_k]. The result of this sequence is identical to the following reshape: reshape([A_1, A_2,.., A_k]) -> [B_1', B_2',.., B_k'] ``` Special credits to @mvasiljevicTT for scrutinizing an algorithm during initial development, which revealed some edge cases that weren't previously covered.
5d0f7fa to
571c936
Compare
vmilosevic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy found issue(s) with the introduced code (1/1)
| llvm::SmallVector<int64_t, 4> tripCounts; // Trip count for each loop. | ||
| int64_t totalIterations; // Product of all trip counts. | ||
|
|
||
| LoopContext() : totalIterations(1) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use default member initializer for totalIterations
| LoopContext() : totalIterations(1) {} | |
| LoopContext() : {} |
…mlir into bnorris/dst-gc-allocator
db490ee to
6ab65ad
Compare
…mlir into bnorris/dst-gc-allocator
Ticket
N/A
Summary
New pass
d2m-insert-dst-register-gcimplements graph coloring register allocation for DST (destination registers). Reduces DST usage by reusing slices for non-interfering values.Note that while this patch appears large, more than 80% of it is tests.
Motivation
Current
InsertDstRegisterAccessuses sequential allocation - each value gets a new DST slice. Graph coloring identifies values with non-overlapping lifetimes and assigns them to the same slice, reducing DST pressure.Example: Diamond pattern (4 values, 2 interference edges) needs only 2 slices instead of 4 (50% reduction).
Interference edges:
dst0 - dst1,dst1 - dst2,dst2 - dst3. No edges betweendst0 - dst2ordst1 - dst3(early releases enable reuse). Graph coloring assigns:dst0anddst2→ slice 0,dst1anddst3→ slice 1.Changes
Overview of design
The design includes the defintion of simple abstract interfaces for
DstAnalysisandColoringStrategywhich enable multiple implementations outside of any specific pass defiition.The analysis and transformation passes use the above strategies:
New Operations
d2m.release_dst- Marks end of DST value lifetime:Enables precise liveness analysis and early reuse. Verifiers ensure pairing with
acquire_dstand prevent use-after-release.Pass Implementation
D2MInsertDstRegisterGCPass(lib/Dialect/D2M/Transforms/InsertDstRegisterGC.cpp):affine.loadoperations for DST-consuming ops (viaOperandLoadStoreRegisterOpInterface)acquire_dst/release_dstpairsColoring Strategies
Two algorithms implemented in
GraphColoringStrategy.{h,cpp}:Strategy selection via pass option:
ttmlir-opt --d2m-insert-dst-register-gc="coloring-strategy=greedy" input.mlirPipeline Integration
Not yet integrated into main pipeline. Current
d2m-insert-dst-register-accessperforms multiple unrelated transformations (most notably linalg to affine conversion). These need refactoring into separate passes that would be applied before this pass.Future Work
C += A * Bin a loop), the current pass loads the accumulator from L1 memory and writes it back on every iteration. This is inefficient. A better approach (used by the existing D2MInsertDSTAccess pass) keeps the accumulator in fast DST registers throughout the entire reduction loop and only writes back to L1 once at the end. This requires splitting the loop into three phases: (1) load initial values into DST, (2) compute with accumulator staying in DST, (3) write final results back to L1. This optimization could be implemented as a separate pass that runs before register allocation. (high priority)