[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267
[ttl] Make TRID DMA wait lowering selectable (default: global barriers)#267shutovilyaep wants to merge 2 commits intotenstorrent:mainfrom
Conversation
|
Comment received from @brnorris03:
|
ca8a242 to
d754a28
Compare
Add a convert-ttl-to-ttkernel pass option (use-trid-barriers) and plumb it through ttl-to-ttkernel-pipeline so callers can choose between legacy global barriers and TRID-aware barriers. Keep the default on legacy global barriers to match mainline codegen. Also group ttlang-translate static archives on ELF linkers to avoid link-order dependent failures.
d754a28 to
42dbe50
Compare
…getTileGridShapeFromValue The test used tensor<32x32xf32> (element type f32). Copy lowering calls getTileGridShapeFromValue() which asserts the tensor has TileType element type. Use tensor<1x1x!ttcore.tile<32x32,f32>> like other DMA tests to fix CI crash (SIGABRT) in TTLToTTKernel conversion. Attempt to fix CI failure in PR tenstorrent#267 / #1222.
Update TRID-focused conversion and translation lit tests to explicitly enable TRID barrier lowering so the default (global barrier) path remains stable.
fbe3c1d to
cc01d5b
Compare
|
/codeowners ping |
brnorris03
left a comment
There was a problem hiding this comment.
Looks great, thank you! The only more significant issue I see is the lack of runtime tests, I think the best approach for now is to parameterize (some of) the test/me2e tests with the new option, what do you think? I can help with more concrete suggestions on how to do that if you agree.dd
Some general questions, mainly stemming from my lack of deep knowledge of the low-level semantics of the metal ops.
-
Is the TRID value semantically meaningful, or just needs to be unique per copy? I am guessing order doesn't matter? As defined the generated TRIDs could be nondeterministic (but correctly unique) due to parallel pattern application.
-
With the new ops requiring explicit NOC, I see that NOC 0 is always used -- is this appropriate or something that needs to be generalized (perhaps later PR)?
Again, thank you for contributing this!!
| patterns.add<DeduplicateConsecutiveBarriers<NocAsyncReadBarrierOp>>( | ||
| patterns.getContext()); | ||
| patterns.add<DeduplicateConsecutiveBarriers<NocAsyncWriteBarrierOp>>( | ||
| patterns.getContext()); | ||
| patterns | ||
| .add<DeduplicateConsecutiveTridBarriers<NocAsyncReadBarrierWithTridOp>>( | ||
| patterns.getContext()); | ||
| patterns | ||
| .add<DeduplicateConsecutiveTridBarriers<NocAsyncWriteBarrierWithTridOp>>( | ||
| patterns.getContext()); |
There was a problem hiding this comment.
Probably doesn't matter that much, but could make the relevant patterns conditional on the option that enables TRID?
|
|
||
| let options = [ | ||
| Option<"useTridBarriers", "use-trid-barriers", "bool", "false", | ||
| "Use TRID-aware DMA waits (barrier_with_trid) instead of global barriers.">, | ||
| ]; |
There was a problem hiding this comment.
Thank you for adding the option! Not asking you to do this in the PR but it would be interesting to profile the different approaches with a small set of representative benchmarks and set the default based on that (perhaps add a short TODO to that effect here if you agree?).
| class TridAllocator { | ||
| public: | ||
| uint32_t allocateTrid() { return nextTrid++ & 0xF; } | ||
|
|
||
| private: | ||
| uint32_t nextTrid = 0; | ||
| }; |
There was a problem hiding this comment.
There is wrapping at 16 TRIDs, but what happens if the 0th, etc are still not completed at that point? Is there any way to check/detect TRID overflow? Maybe add a TODO for future improvement to make this more robust.
What?
Adds a pass option to choose how
ttl.copy/ttl.waitare lowered to TTKernel DMA ops:ttkernel.noc_async_{read,write}_barrier()and no TRID setup.use-trid-barriers=1): emit TRID-aware barriersttkernel.noc_async_{read,write}_barrier_with_trid(trid, noc)and*_set_tridin the copy lowering. Each copy is assigned a TRID (0..15); the transfer handle is lowered to an i32 TRID value and waits emit barriers keyed by that TRID. A post-conversion cleanup (DeduplicateConsecutiveTridBarriers) merges consecutive TRID barriers that target the same TRID and NOC.The option is plumbed through
convert-ttl-to-ttkernelandttl-to-ttkernel-pipeline. TRID-focused lit tests explicitly enableuse-trid-barriers=1; Python lit tests keep using the default and continue to validate global barrier behavior.Why?
ttl.copy/ttl.waitto TRID-specific TTKernel noc ops. Hardware supports TRID-scoped barriers so a wait can target only the transfers issued by a specific copy; the default path remains global barriers for compatibility.main.noc_async_{read,write}_barrier(); switching the default to TRID barriers broke them. This PR keeps the default as global barriers and gates TRID behavior behind the option.How?
convert-ttl-to-ttkernelgains optionuse-trid-barriers(defaultfalse). Whenfalse, copy/wait lowering emits global barriers only; whentrue, copy lowering allocates a TRID per copy, emitsnoc_async_*_set_tridbefore the tile read/write loop, and replaces the copy result with the TRID (i32); wait lowering emitsnoc_async_*_barrier_with_trid(trid, noc).ttl-to-ttkernel-pipelineacceptsuse-trid-barriersand forwards it to the pass.DeduplicateConsecutiveTridBarriersfor TRID barrier ops so consecutive barriers with the same TRID/NOC are merged.trid_barriers.mlir,dma_single_core.mlir,loopback_dram_copy.mlir) and relevant TTL-to-Cpp tests run withuse-trid-barriers=1.trid_barriers.mliruses tile-grid tensor types (tensor<1x1x!ttcore.tile<32x32,f32>>) so copy lowering'sgetTileGridShapeFromValueis valid (it expects TileType element type). Python lit tests are unchanged and run with the default (global barriers).How to Test?
llvm-lit test/ttlang/Conversion/TTLToTTKernel/— conversion tests (default + TRID where used).llvm-lit test/ttlang/Translate/TTLToCpp/— translate tests that use the pipeline withuse-trid-barriers=1.llvm-lit test/python/— Python lit tests (default lowering, no change).Checklist
use-trid-barriers=1, default path unchangedmain(global barriers only)