Skip to content

Commit 0145106

Browse files
htyumeta-codesync[bot]
authored andcommitted
Replace 1:1 pipeline-to-warpgroup assumption with latency-aware multi-pipeline clustering (#1366)
Summary: Pull Request resolved: #1366 The design doc assumed each warp group maps to a single hardware pipeline (Limitation #5). This breaks for mixed groups like the epilogue (CUDA+MEM) or compute (CUDA+SFU). This change replaces that assumption with a latency-aware algorithm that uses two signals from the modulo schedule to decide which pipelines should share a warp group: 1. **Separation cost**: barrier overhead (∼30 cycles) relative to the cycle gap between cross-pipeline ops. Tightly-coupled ops (small gap) stay together; loosely-coupled ops (large gap) are separated. 2. **Multi-pipeline makespan**: list scheduling with per-pipeline resource tracking validates that a merged group can execute within II, correctly modeling that different pipelines overlap while data dependencies serialize. The algorithm is greedy agglomerative clustering: start with one group per active pipeline, merge the highest-coupling pair if makespan allows, repeat. Worked examples show it reproduces hand-tuned results: GEMM stays at 2 groups (MEM↔TC coupling = 0.03, not worth merging), FA Forward produces 3 groups with CUDA+SFU merged (coupling = 0.23), and epilogue chains merge into a single multi-pipeline group. The partitioning is moved from Pass B Step 1 into Pass A as Step 4.7, inside the iterative refinement loop, so it gets recomputed when DDG transformations change the schedule. The ScheduleGraph now carries warp group assignments (`modulo.warp_group`, per-node `wg` field), and Pass B Step 1 becomes a thin reader. Reviewed By: wlei-llvm Differential Revision: D102487494 fbshipit-source-id: 4d6c49b13346be2d8d1c17314d2f3cfebc641d3b
1 parent 17cc629 commit 0145106

1 file changed

Lines changed: 342 additions & 79 deletions

File tree

0 commit comments

Comments
 (0)