You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #1535 introduces lightweight herd cloning during shim-level loop unrolling in loopUnrollFullWithAsyncTokenPreserved (mlir/lib/Util/Dependency.cpp). When unrolling loops that contain air.SegmentOp or air.HerdOp, it creates empty herd shells via OperationState and only clones channel ops + their transitive dependencies, skipping heavy compute ops (vector, arith, linalg). This avoids O(N × body_size) IR explosion.
Limitation
The lightweight unroller only fires when annotateFn is null — the non-tiled unroll path. The tiled path in AIROptimizeShimDMABDs (triggered by non-trivial shim-dma-tile-sizes like 2,2 or 4,4) uses loopUnrollByFactor with an annotateFn callback to tag unrolled iterations. This path still performs full deep-clone unrolling.
The annotateFn guard exists because loopUnrollByFactor is an upstream MLIR utility that doesn't support pluggable clone strategies. Extending lightweight cloning to this path requires either:
Wrapping loopUnrollByFactor to accept a custom clone callback, or
Performing a post-unroll strip (clone-then-strip) specifically for the tiled path — noting that a naive strip crashes due to cross-region use-after-free (see T002 postmortem in PR Lightweight herd cloning during shim DMA BD loop unrolling #1535 discussion)
Impact
For the default aircc invocation (tile sizes 1,1), this limitation has no effect — the shim BD pass doesn't tile or unroll. For workloads that use explicit tiling (e.g., flash attention with large trip counts and non-default tile sizes), the tiled unroll path still sees the full IR explosion.
Context
PR #1535 introduces lightweight herd cloning during shim-level loop unrolling in
loopUnrollFullWithAsyncTokenPreserved(mlir/lib/Util/Dependency.cpp). When unrolling loops that containair.SegmentOporair.HerdOp, it creates empty herd shells viaOperationStateand only clones channel ops + their transitive dependencies, skipping heavy compute ops (vector, arith, linalg). This avoids O(N × body_size) IR explosion.Limitation
The lightweight unroller only fires when
annotateFnis null — the non-tiled unroll path. The tiled path inAIROptimizeShimDMABDs(triggered by non-trivialshim-dma-tile-sizeslike2,2or4,4) usesloopUnrollByFactorwith anannotateFncallback to tag unrolled iterations. This path still performs full deep-clone unrolling.The
annotateFnguard exists becauseloopUnrollByFactoris an upstream MLIR utility that doesn't support pluggable clone strategies. Extending lightweight cloning to this path requires either:loopUnrollByFactorto accept a custom clone callback, orImpact
For the default
airccinvocation (tile sizes1,1), this limitation has no effect — the shim BD pass doesn't tile or unroll. For workloads that use explicit tiling (e.g., flash attention with large trip counts and non-default tile sizes), the tiled unroll path still sees the full IR explosion.