[TTL] DST optimization: separate pack_tile loop#229
Draft
brnorris03 wants to merge 16 commits intomainfrom
Draft
[TTL] DST optimization: separate pack_tile loop#229brnorris03 wants to merge 16 commits intomainfrom
brnorris03 wants to merge 16 commits intomainfrom
Conversation
1. Allocate DST registers for inputs (block arguments) based on liveness 2. Allocate DST registers for outputs starting after inputs This ensures outputs get indices >= inputs_footprint, so they map to DST[inputs_footprint + tile_index] in multi-tile compute cases.
zoecarver
reviewed
Jan 13, 2026
| // Special binary ops with non-standard lowering | ||
| // Max uses 2-arg in-place form (TTLTileMaxToTTKernel template) | ||
| TTL_BINARY_TILE_OP_SPECIAL(Max, MaxTileOp, BinaryMaxTileInitOp, BinaryMaxTileOp) | ||
| TTL_BINARY_TILE_OP(Max, MaxTileOp, BinaryMaxTileInitOp, BinaryMaxTileOp) |
Contributor
There was a problem hiding this comment.
Is this intentional? Seems like at least should be it's own PR with tests.
Contributor
Author
There was a problem hiding this comment.
Certainly -- I think it should have been done back when switching to using the binary op (since max became no longer special at that point).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two related bugs prevent multi-tile (e.g., 2x2) CB shapes from working correctly:
Pack_tile interleaving bug (compute side): When lowering
ttl.computefor multitile blocks, pack_tile ops are emitted interleaved with math ops inside the tile loops. This is incorrect because pack_tile reads from DST registers that should contain all computed tiles first.Data movement threads are not copying the correct blocks that compute expects. Tensor slices created for multi-tile CBs don't match the CB's block shape. (closes [ttl] Incorrect ttl.copy lowering generates individual tile transfers instead of block transfers #138)
What changed
ttl.computelowering into two separate loop nests (compute phase + pack phase)tile_regs_commitandtile_regs_waitsynchronization between loopsBodyPhasescategorization to separate compute ops from pack ops