Explicit stores, cb lifetime ops, and side-effect-only loops#314
Open
Explicit stores, cb lifetime ops, and side-effect-only loops#314
Conversation
30b4850 to
464f5fd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This reworks the lowering pipeline so that stores and cb lifetime ops are explicit from Python all the way down, and loops don't carry dead tensor dataflow.
ttl.store/ttl.tile_storesplit: The oldttl.storeis renamed tottl.tile_store(tile-level, lives inside compute bodies). A new tensor-levelttl.storeis added, emitted directly by Python'so.store(result). This means the user's store intent is in the IR from the start rather than being reconstructed later from yields.LowerStoreToComputepattern:convert-ttl-to-computenow has a pattern that finds eachttl.store, locates the compute op that produces its input, and sinks to attl.tile_storein attl.computeblock (this is the same logic we use to fuse eltwise ops).TTLInsertTileRegsSyncno longer walks yields, scans for reserves, or synthesizes stores. It just inserts acquire/commit/wait/release around the existingtile_storeops. Went from ~250 lines of store inference down to straightforward insertion logic.ConvertTTLComputeToSCFgeneratesscf.forwithoutiter_argsfor outputs. Notensor.insert, no yielding updated tensors. Stores are explicit side effects viatile_store. This also simplifiesConvertTTLToTTKernelcleanup since there are notensor.insertops to tear down anymore.TTLAssignDSTnow inserts allcopy_tileops at the block start (reverse arg order) via a sharedcreateCopyTileForArgutility, instead of inserting at first-use with duplicated allocation logic.