Skip to content

[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267

Open
shutovilyaep wants to merge 2 commits intotenstorrent:mainfrom
shutovilyaep:feat/lower_copy_wait
Open

[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267
shutovilyaep wants to merge 2 commits intotenstorrent:mainfrom
shutovilyaep:feat/lower_copy_wait

Conversation

@shutovilyaep
Copy link

@shutovilyaep shutovilyaep commented Jan 23, 2026

What?

Adds a pass option to choose how ttl.copy / ttl.wait are lowered to TTKernel DMA ops:

  • Default (unchanged): emit global ttkernel.noc_async_{read,write}_barrier() and no TRID setup.
  • Opt-in (use-trid-barriers=1): emit TRID-aware barriers ttkernel.noc_async_{read,write}_barrier_with_trid(trid, noc) and *_set_trid in the copy lowering. Each copy is assigned a TRID (0..15); the transfer handle is lowered to an i32 TRID value and waits emit barriers keyed by that TRID. A post-conversion cleanup (DeduplicateConsecutiveTridBarriers) merges consecutive TRID barriers that target the same TRID and NOC.

The option is plumbed through convert-ttl-to-ttkernel and ttl-to-ttkernel-pipeline. TRID-focused lit tests explicitly enable use-trid-barriers=1; Python lit tests keep using the default and continue to validate global barrier behavior.

Why?

  • Issue: tenstorrent/tt-lang#87 — lower ttl.copy/ttl.wait to TRID-specific TTKernel noc ops. Hardware supports TRID-scoped barriers so a wait can target only the transfers issued by a specific copy; the default path remains global barriers for compatibility.
  • Reviewer request: implement as a pass option so callers can choose the lowering while the default stays the same as on main.
  • CI: Python lit tests expect global noc_async_{read,write}_barrier(); switching the default to TRID barriers broke them. This PR keeps the default as global barriers and gates TRID behavior behind the option.

How?

  • Pass: convert-ttl-to-ttkernel gains option use-trid-barriers (default false). When false, copy/wait lowering emits global barriers only; when true, copy lowering allocates a TRID per copy, emits noc_async_*_set_trid before the tile read/write loop, and replaces the copy result with the TRID (i32); wait lowering emits noc_async_*_barrier_with_trid(trid, noc).
  • Pipeline: ttl-to-ttkernel-pipeline accepts use-trid-barriers and forwards it to the pass.
  • Cleanup: TTKernel cleanup patterns add DeduplicateConsecutiveTridBarriers for TRID barrier ops so consecutive barriers with the same TRID/NOC are merged.
  • Tests: TRID conversion tests (trid_barriers.mlir, dma_single_core.mlir, loopback_dram_copy.mlir) and relevant TTL-to-Cpp tests run with use-trid-barriers=1. trid_barriers.mlir uses tile-grid tensor types (tensor<1x1x!ttcore.tile<32x32,f32>>) so copy lowering's getTileGridShapeFromValue is valid (it expects TileType element type). Python lit tests are unchanged and run with the default (global barriers).

How to Test?

# Default: global barriers (same as main)
ttlang-opt --convert-ttl-to-ttkernel %s | FileCheck ...

# TRID mode
ttlang-opt --convert-ttl-to-ttkernel="use-trid-barriers=1" %s | FileCheck ...

# Pipeline
ttlang-opt --ttl-to-ttkernel-pipeline="use-trid-barriers=1" %s -o %t.mlir
  • llvm-lit test/ttlang/Conversion/TTLToTTKernel/ — conversion tests (default + TRID where used).
  • llvm-lit test/ttlang/Translate/TTLToCpp/ — translate tests that use the pipeline with use-trid-barriers=1.
  • llvm-lit test/python/ — Python lit tests (default lowering, no change).

Checklist

  • Self-reviewed (style, logic)
  • Added/updated tests; TRID tests gated behind use-trid-barriers=1, default path unchanged
  • PR is focused (pass option + pipeline plumbing + cleanup + tests)
  • Default behavior matches main (global barriers only)
  • No scope creep (pass/pipeline option and related tests only)

@shutovilyaep shutovilyaep marked this pull request as ready for review January 26, 2026 13:47
@shutovilyaep shutovilyaep requested a review from a team as a code owner January 26, 2026 13:47
@shutovilyaep
Copy link
Author

Comment received from @brnorris03:

It would be great if you can implement this as a pass option so we can choose between different lowerings (there will probably be more optimizations later), keeping the default the same as what's in main now.

Add a convert-ttl-to-ttkernel pass option (use-trid-barriers) and plumb it through ttl-to-ttkernel-pipeline so callers can choose between legacy global barriers and TRID-aware barriers.

Keep the default on legacy global barriers to match mainline codegen.

Also group ttlang-translate static archives on ELF linkers to avoid link-order dependent failures.
@shutovilyaep shutovilyaep changed the title TTL: Lower async DMA waits to TRID barriers [ttl] Make TRID DMA wait lowering selectable (default: global barriers) Jan 30, 2026
shutovilyaep added a commit to shutovilyaep/tt-lang that referenced this pull request Jan 30, 2026
…getTileGridShapeFromValue

The test used tensor<32x32xf32> (element type f32). Copy lowering calls
getTileGridShapeFromValue() which asserts the tensor has TileType element
type. Use tensor<1x1x!ttcore.tile<32x32,f32>> like other DMA tests to fix
CI crash (SIGABRT) in TTLToTTKernel conversion.

Attempt to fix CI failure in PR tenstorrent#267 / #1222.
Update TRID-focused conversion and translation lit tests to explicitly enable TRID barrier lowering so the default (global barrier) path remains stable.
@shutovilyaep
Copy link
Author

/codeowners ping

Copy link
Contributor

@brnorris03 brnorris03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! The only more significant issue I see is the lack of runtime tests, I think the best approach for now is to parameterize (some of) the test/me2e tests with the new option, what do you think? I can help with more concrete suggestions on how to do that if you agree.dd

Some general questions, mainly stemming from my lack of deep knowledge of the low-level semantics of the metal ops.

  1. Is the TRID value semantically meaningful, or just needs to be unique per copy? I am guessing order doesn't matter? As defined the generated TRIDs could be nondeterministic (but correctly unique) due to parallel pattern application.

  2. With the new ops requiring explicit NOC, I see that NOC 0 is always used -- is this appropriate or something that needs to be generalized (perhaps later PR)?

Again, thank you for contributing this!!

Comment on lines 73 to +82
patterns.add<DeduplicateConsecutiveBarriers<NocAsyncReadBarrierOp>>(
patterns.getContext());
patterns.add<DeduplicateConsecutiveBarriers<NocAsyncWriteBarrierOp>>(
patterns.getContext());
patterns
.add<DeduplicateConsecutiveTridBarriers<NocAsyncReadBarrierWithTridOp>>(
patterns.getContext());
patterns
.add<DeduplicateConsecutiveTridBarriers<NocAsyncWriteBarrierWithTridOp>>(
patterns.getContext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't matter that much, but could make the relevant patterns conditional on the option that enables TRID?

Comment on lines 21 to +25

let options = [
Option<"useTridBarriers", "use-trid-barriers", "bool", "false",
"Use TRID-aware DMA waits (barrier_with_trid) instead of global barriers.">,
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the option! Not asking you to do this in the PR but it would be interesting to profile the different approaches with a small set of representative benchmarks and set the default based on that (perhaps add a short TODO to that effect here if you agree?).

Comment on lines +575 to +581
class TridAllocator {
public:
uint32_t allocateTrid() { return nextTrid++ & 0xF; }

private:
uint32_t nextTrid = 0;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is wrapping at 16 TRIDs, but what happens if the 0th, etc are still not completed at that point? Is there any way to check/detect TRID overflow? Maybe add a TODO for future improvement to make this more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants