Skip to content

[TTL] DST optimization: separate pack_tile loop#229

Draft
brnorris03 wants to merge 16 commits intomainfrom
bnorris/fix-multitile-mlir
Draft

[TTL] DST optimization: separate pack_tile loop#229
brnorris03 wants to merge 16 commits intomainfrom
bnorris/fix-multitile-mlir

Conversation

@brnorris03
Copy link
Contributor

@brnorris03 brnorris03 commented Jan 13, 2026

Problem

Two related bugs prevent multi-tile (e.g., 2x2) CB shapes from working correctly:

  1. Pack_tile interleaving bug (compute side): When lowering ttl.compute for multitile blocks, pack_tile ops are emitted interleaved with math ops inside the tile loops. This is incorrect because pack_tile reads from DST registers that should contain all computed tiles first.

  2. Data movement threads are not copying the correct blocks that compute expects. Tensor slices created for multi-tile CBs don't match the CB's block shape. (closes [ttl] Incorrect ttl.copy lowering generates individual tile transfers instead of block transfers #138)

What changed

  • Split ttl.compute lowering into two separate loop nests (compute phase + pack phase)
  • Added tile_regs_commit and tile_regs_wait synchronization between loops
  • Implemented BodyPhases categorization to separate compute ops from pack ops
  • Dynamic DST index computation for outputs only (inputs reuse DST registers)

// Special binary ops with non-standard lowering
// Max uses 2-arg in-place form (TTLTileMaxToTTKernel template)
TTL_BINARY_TILE_OP_SPECIAL(Max, MaxTileOp, BinaryMaxTileInitOp, BinaryMaxTileOp)
TTL_BINARY_TILE_OP(Max, MaxTileOp, BinaryMaxTileInitOp, BinaryMaxTileOp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? Seems like at least should be it's own PR with tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly -- I think it should have been done back when switching to using the binary op (since max became no longer special at that point).

@brnorris03 brnorris03 changed the title [TTL] Fix multitile block lowering [TTL] DST optimization: separate pack_tile loop Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ttl] Incorrect ttl.copy lowering generates individual tile transfers instead of block transfers

2 participants