facebookexperimental
diff --git a/‎docs/design/ws_global_instruction_scheduling.md‎
Lines changed: 114 additions & 0 deletions b/‎docs/design/ws_global_instruction_scheduling.md‎
Lines changed: 114 additions & 0 deletions
diff --git a/‎include/triton/Tools/Sys/GetEnv.hpp‎
Lines changed: 2 additions & 1 deletion b/‎include/triton/Tools/Sys/GetEnv.hpp‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎python/triton/knobs.py‎
Lines changed: 1 addition & 1 deletion b/‎python/triton/knobs.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎test/TritonGPU/modulo-schedule-graph-edge.mlir‎
Lines changed: 4 additions & 7 deletions b/‎test/TritonGPU/modulo-schedule-graph-edge.mlir‎
Lines changed: 4 additions & 7 deletions
diff --git a/‎test/TritonGPU/modulo-schedule-graph.mlir‎
Lines changed: 14 additions & 14 deletions b/‎test/TritonGPU/modulo-schedule-graph.mlir‎
Lines changed: 14 additions & 14 deletions
diff --git a/‎test/TritonGPU/modulo-schedule.mlir‎
Lines changed: 9 additions & 18 deletions b/‎test/TritonGPU/modulo-schedule.mlir‎
Lines changed: 9 additions & 18 deletions
diff --git a/‎test/TritonGPU/modulo-ws-partition.mlir‎
Lines changed: 10 additions & 13 deletions b/‎test/TritonGPU/modulo-ws-partition.mlir‎
Lines changed: 10 additions & 13 deletions
diff --git a/‎third_party/nvidia/backend/compiler.py‎
Lines changed: 17 additions & 5 deletions b/‎third_party/nvidia/backend/compiler.py‎
Lines changed: 17 additions & 5 deletions
diff --git a/‎third_party/nvidia/hopper/lib/Transforms/CMakeLists.txt‎
Lines changed: 2 additions & 0 deletions b/‎third_party/nvidia/hopper/lib/Transforms/CMakeLists.txt‎
Lines changed: 2 additions & 0 deletions
@@ -20,6 +20,7 @@ This document is based on the original design in [WS global instruction scheduli
   - [Step 1: Compute Minimum Initiation Interval (II)](#step-1-compute-minimum-initiation-interval-ii)
   - [Step 2: Modulo Reservation Table Scheduling](#step-2-modulo-reservation-table-scheduling)
     - [Background: Rau's Iterative Modulo Scheduling](#background-raus-iterative-modulo-scheduling)
+    - [Alternative: Swing Modulo Scheduling (SMS)](#alternative-swing-modulo-scheduling-sms)
   - [Step 2.5: Compute Cluster IDs from the Modulo Schedule](#step-25-compute-cluster-ids-from-the-modulo-schedule)
   - [Step 3: Derive Per-Region Pipeline Depth from the Modulo Schedule](#step-3-derive-per-region-pipeline-depth-from-the-modulo-schedule)
   - [Step 4: Handling Resource Pressure (SMEM/TMEM Budget)](#step-4-handling-resource-pressure-smemtmem-budget)
@@ -349,6 +350,8 @@ The algorithm as described has several limitations:
 
 7. **Register allocation is approximate**: Pass B Step 4 estimates register usage from live variable counts but doesn't perform full register allocation. The actual register count is determined by the compiler backend (ptxas), which may differ from the estimate and cause spills that the schedule didn't anticipate.
 
+8. **SMS limitations**: The SMS implementation's simplified ASAP/ALAP computation (no II-dependent recurrence bounds) and BFS ordering (no SCC prioritization) may produce suboptimal schedules for kernels with multiple interacting recurrence circuits, such as FA backward with 5 MMA ops and cross-iteration accumulator/softmax/pointer dependencies. For single-MMA kernels (GEMM), SMS and Rau produce identical schedules.
+
 ---
 
 ## Inputs
@@ -760,6 +763,117 @@ def modulo_schedule(DDG, latencies, unit_map, MinII):
         II += 1  # Try larger II
 ```
 
+#### Alternative: Swing Modulo Scheduling (SMS)
+
+Swing Modulo Scheduling (J. Llosa, A. Gonzalez, E. Ayguade, M. Valero, "Swing Modulo Scheduling: A Lifetime-Sensitive Approach", PACT 1996), SMS, avoids backtracking by using a slack-based node ordering and directional placement.
+
+**Key differences from Rau's IMS:**
+
+| Property | Rau's IMS | SMS |
+|----------|-----------|-----|
+| Complexity | Potentially exponential (backtracking) | O(n) per II attempt |
+| Node ordering | Critical-path height (bottom-up) | Slack = ALAP - ASAP (tightest first) |
+| Placement | Earliest free slot, eject if blocked | Top-down for successors, bottom-up for predecessors |
+| Register pressure | Not considered | Reduced by keeping producer-consumer pairs close |
+
+**SMS Algorithm:**
+
+1. **Compute ASAP/ALAP**: Forward/backward relaxation including loop-carried edges (II-dependent: `ASAP[v] >= ASAP[u] + latency - distance * II`), recomputed for each candidate II. Slack = ALAP - ASAP measures scheduling freedom.
+
+2. **Ordering phase (swing)**: Start with the minimum-slack op (most constrained). Then BFS-expand: add its successors (marked top-down) sorted by ascending slack, then its predecessors (marked bottom-up) sorted by ascending slack. This alternation is the "swing" — it keeps producers and consumers adjacent in the schedule.
+
+3. **Scheduling phase**: For each op in swing order:
+   - **Top-down** ops: place at the earliest free slot from `earliest` upward (data is ready, issue immediately).
+   - **Bottom-up** ops: place at the latest free slot from `latest` downward (defer production, reducing live range and register pressure).
+
+```python
+def sms_schedule(DDG, latencies, unit_map, MinII):
+    for II in range(MinII, MinII + 11):  # capped at MinII+10
+        # Recompute per-II: loop-carried edges depend on II
+        asap = compute_ASAP(DDG, latencies, II)
+        alap = compute_ALAP(DDG, latencies, asap, II)
+        slack = {op: alap[op] - asap[op] for op in DDG.nodes}
+
+        table = ReservationTable(II)
+        scheduled = {}
+
+        # Ordering: BFS from min-slack seed
+        seed = min(DDG.nodes, key=lambda n: slack[n])
+        order = [(seed, True)]  # (node, is_top_down)
+        visited = {seed}
+        for node, _ in order:
+            # Successors → top-down
+            for s in sorted(successors(node), key=lambda n: slack[n]):
+                if s not in visited:
+                    order.append((s, True))
+                    visited.add(s)
+            # Predecessors → bottom-up
+            for p in sorted(predecessors(node), key=lambda n: slack[n]):
+                if p not in visited:
+                    order.append((p, False))
+                    visited.add(p)
+
+        # Placement
+        success = True
+        for op, top_down in order:
+            earliest = compute_earliest(op, scheduled, DDG, latencies, II)
+            latest = compute_latest(op, scheduled, DDG, latencies, II)
+            if top_down:
+                slot = table.find_free(earliest, unit_map[op])
+            else:
+                slot = table.find_free_reverse(latest, earliest, unit_map[op])
+            if slot is None:
+                slot = table.find_free(earliest, unit_map[op])  # fallback
+            if slot is None:
+                success = False
+                break
+            table.reserve(slot, unit_map[op], op)
+            scheduled[op] = slot
+
+        if success:
+            return scheduled, II
+    return None
+```
+
+**Implementation status:** SMS is available via `TRITON_USE_MODULO_SCHEDULE=sms`. Source: `SwingScheduler.cpp`. The implementation has the following simplifications relative to the paper:
+
+1. **No recurrence-aware ordering.** The paper identifies SCCs, orders them by RecMII contribution, and schedules the most critical recurrence first. The implementation uses simple BFS from the minimum-slack node.
+
+2. **Fallback on placement failure.** When the directional scan finds no free slot, the implementation falls back to `find_free` from earliest. The paper would fail at this II and increment.
+
+3. **BFS follows all DDG edges** including loop-carried (distance > 0). The paper's ordering only follows distance-0 edges.
+
+ASAP/ALAP include loop-carried edges and are recomputed per-II: `ASAP[v] >= ASAP[u] + latency - distance * II`, with a convergence limit of 1000 iterations.
+
+**selfLatency model:** All pipelines use `selfLatency = 1` because GPU execution units are deeply pipelined — a new instruction can be issued every ~1 cycle. This makes ResMII negligible (equal to the op count on the busiest pipeline) and lets RecMII (data dependencies) drive the schedule. Without this fix, SMS fails on FA backward (ResMII=4500 from 5 MMAs × 900 selfLatency each).
+
+**Stage assignment (emitMMAAnnotations):** After SMS assigns cycles, the pass derives pipeline stage annotations (`tt.autows`) for MMA ops using transitive MMA dependency counting:
+
+- 0-1 transitive MMA predecessors → stage 0 (can be prefetched)
+- 2+ transitive MMA predecessors → stage 1 (gated on multiple prior results)
+
+Within each stage, independent MMAs share the same order (cluster ID) to avoid barrier deadlocks.
+
+Example (FA backward, 5 MMAs):
+
+| MMA | Transitive MMA deps | Stage | Order |
+|-----|---------------------|-------|-------|
+| qkT = dot(k, qT) | 0 | 0 | 0 |
+| dpT = dot(v, do^T) | 0 | 0 | 0 |
+| dv += dot(ppT, do) | 1 (qkT) | 0 | 1 |
+| dq = dot(dsT^T, k) | 2 (qkT, dpT) | 1 | 0 |
+| dk += dot(dsT, qT) | 2 (qkT, dpT) | 1 | 0 |
+
+This matches the hand-tuned annotation partition exactly. Annotations are skipped when all MMAs land in the same stage (e.g., GEMM, FA forward) or when the loop already has `tt.autows` from Python `attrs=`.
+
+FA BWD performance (B200, `TRITON_USE_META_WS=1 TRITON_USE_META_PARTITION=1`):
+
+| Shape | Baseline TFLOPS | SMS TFLOPS | Diff |
+|---|---|---|---|
+| Z=4 H=16 N=2048 D=128 | 409.4 | 409.9 | +0.1% |
+| Z=8 H=16 N=1024 D=128 | 324.7 | 323.3 | -0.4% |
+| Z=1 H=32 N=4096 D=128 | 471.2 | 472.0 | +0.2% |
+
 ### Step 2.5: Compute Cluster IDs from the Modulo Schedule
 
 After the modulo schedule assigns each op a `(cycle, pipeline)`, compute **cluster IDs** that encode within-stage instruction ordering for the downstream code generator.
 
@@ -52,7 +52,8 @@ inline const std::set<std::string> CACHE_INVALIDATING_ENV_VARS = {
     "TRITON_DUMP_TLX_BENCHMARK",
     "TRITON_ENABLE_EXPERIMENTAL_CONSAN",
     "TRITON_PASS_PLUGIN_PATH",
-    "TRITON_STRICT_REDUCTION_ORDERING"
+    "TRITON_STRICT_REDUCTION_ORDERING",
+    "TRITON_USE_MODULO_SCHEDULE"
     // clang-format on
 };
 
 
@@ -511,7 +511,7 @@ class nvidia_knobs(base_knobs):
     libcuda_path: env_opt_str = env_opt_str("TRITON_LIBCUDA_PATH")
     use_meta_ws: env_bool = env_bool("TRITON_USE_META_WS")
     use_meta_partition: env_bool = env_bool("TRITON_USE_META_PARTITION")
-    use_modulo_schedule: env_bool = env_bool("TRITON_USE_MODULO_SCHEDULE")
+    use_modulo_schedule: env_opt_str = env_opt_str("TRITON_USE_MODULO_SCHEDULE")
     # Force OAI SWP schedule even when using Meta's WS implementation.
     force_trunk_swp_schedule: env_bool = env_bool("TRITON_FORCE_TRUNK_SWP_SCHEDULE")
     dump_ttgir_to_tlx: env_bool = env_bool("TRITON_DUMP_TTGIR_TO_TLX")
 
@@ -3,9 +3,8 @@
 
 //===----------------------------------------------------------------------===//
 // Edge case 0: Single-stage schedule (maxStage=0).
-// MMA-only loop: no TMA copy, no result use. The MMA self-latency (900) is
-// the only thing on the TC pipeline, so II = 900 and the MMA lands at
-// cycle 0, stage 0 — max_stage = 0.
+// MMA-only loop: no TMA copy, no result use. With selfLatency=1,
+// II = 1 (single TC op) and the MMA lands at cycle 0, stage 0.
 //
 // Regression test for Devmate review: tt.num_stages must be set even when
 // maxStage = 0 so downstream pipelining recognises the loop as scheduled.
@@ -18,11 +17,9 @@
 module attributes {"ttg.num-warps" = 4 : i32, ttg.target = "cuda:100"} {
 
 // Verify the maxStage=0 dump and the loop's tt.num_stages=1 attribute.
-// CHECK: ii = 900, max_stage = 0
+// CHECK: ii = 1, max_stage = 0
 // CHECK: @maxstage_0_mma_only
-// CHECK: tt.modulo_ii = 900 : i32
-// CHECK-SAME: tt.num_stages = 1 : i32
-// CHECK-SAME: tt.scheduled_max_stage = 0 : i32
+// CHECK: tt.num_stages = 1 : i32
 tt.func @maxstage_0_mma_only(
   %a: !ttg.memdesc<128x64xf16, #shared, #smem, mutable>,
   %b: !ttg.memdesc<64x128xf16, #shared, #smem, mutable>,
 
@@ -13,37 +13,37 @@
 
 module attributes {"ttg.num-warps" = 4 : i32, ttg.target = "cuda:100"} {
 
-// --- Graph structure: II=1038, max_stage=2, trip_count=32 ---
+// --- Graph structure: II=1005, max_stage=1, trip_count=32 ---
+// With selfLatency=1, loads issue every cycle (not every 518 cycles),
+// so II is driven by RecMII (loop-carried dep: MMA→tmem_load→tmem_alloc→MMA).
 // CHECK: [PASS-A] === Inner Loop ScheduleGraph ===
 // CHECK-NEXT: modulo.schedule @loop0 {
-// CHECK-NEXT:   ii = 1038, max_stage = 2, prologue_latency = 1038, trip_count = 32
+// CHECK-NEXT:   ii = 1005, max_stage = 1, prologue_latency = 703, trip_count = 32
 //
-// --- Nodes: loads+allocs@s0, MMA@s1, tmem_load@s2 with cluster IDs ---
+// --- Nodes: loads+allocs+MMA@s0, tmem_load@s1 ---
 // CHECK: modulo.stage @s0 {
-// CHECK:   tt.descriptor_load  {pipe: MEM, cycle: 0, cluster: 0, latency: 1218, selfLatency: 518}
-// CHECK:   tt.descriptor_load  {pipe: MEM, cycle: 518, cluster: 1, latency: 1218, selfLatency: 518}
-// CHECK:   ttg.local_alloc  {pipe: MEM, cycle: 1036, cluster: 2, latency: 700
-// CHECK:   ttg.local_alloc  {pipe: MEM, cycle: 1037, cluster: 3, latency: 700
+// CHECK:   tt.descriptor_load  {pipe: MEM, cycle: 0, cluster: 0, latency: 1218, selfLatency: 1}
+// CHECK:   tt.descriptor_load  {pipe: MEM, cycle: 1, cluster: 1, latency: 1218, selfLatency: 1}
+// CHECK:   ttg.local_alloc  {pipe: MEM, cycle: 2, cluster: 2, latency: 700
+// CHECK:   ttg.local_alloc  {pipe: MEM, cycle: 3, cluster: 3, latency: 700
+// CHECK:   ttng.tc_gen5_mma  {pipe: TC, cycle: 703, cluster: 4, latency: 900, selfLatency: 1
 // CHECK: }
 // CHECK: modulo.stage @s1 {
-// CHECK:   ttng.tc_gen5_mma  {pipe: TC, cycle: 1737, cluster: 0, latency: 900, selfLatency: 900
-// CHECK: }
-// CHECK: modulo.stage @s2 {
-// CHECK:   ttng.tmem_load  {pipe: CUDA, cycle: 2637, cluster: 0, latency: 130, selfLatency: 130
+// CHECK:   ttng.tmem_load  {pipe: CUDA, cycle: 1603, cluster: 0, latency: 105, selfLatency: 1
 // CHECK: }
 //
 // --- Edges: SSA + loop-carried ---
 // CHECK: edges {
 // CHECK-DAG: N0 -> N1  lat=0  dist=0
 // CHECK-DAG: N0 -> N2  lat=0  dist=0
-// CHECK-DAG: N1 -> N3  lat=518  dist=0
-// CHECK-DAG: N2 -> N4  lat=518  dist=0
+// CHECK-DAG: N1 -> N3  lat=1  dist=0
+// CHECK-DAG: N2 -> N4  lat=1  dist=0
 // CHECK-DAG: N3 -> N6  lat=700  dist=0
 // CHECK-DAG: N4 -> N6  lat=700  dist=0
 // CHECK-DAG: N5 -> N6  lat=0  dist=0
 // CHECK-DAG: N5 -> N7  lat=0  dist=0
 // CHECK-DAG: N6 -> N7  lat=900  dist=0
-// CHECK-DAG: N7 -> N5  lat=130  dist=1
+// CHECK-DAG: N7 -> N5  lat=105  dist=1
 // CHECK: }
 // CHECK: }
 tt.func @test_basic_graph(
 
@@ -8,26 +8,17 @@
 
 module attributes {"ttg.num-warps" = 4 : i32, ttg.target = "cuda:100"} {
 
-// Verify that the modulo schedule pass annotates ops with loop.stage/loop.cluster
-// and sets tt.modulo_ii on the loop.
+// Verify that the modulo schedule pass sets tt.num_stages on the inner loop.
+// For a single-MMA GEMM, all MMAs are in the same stage so tt.autows is
+// skipped, and inner loops no longer emit loop.stage/loop.cluster attrs
+// (those are only emitted on outer loops via emitScheduleAttributes).
 //
 // CHECK-LABEL: @gemm_inner_loop
-// Cluster IDs are dense ranks of modulo cycles within each stage (Step 2.5).
-// Stages processed in reverse order: higher stage -> lower cluster ID.
-// Same cycle -> same cluster; different cycle -> different cluster.
-// CHECK: tt.descriptor_load {{.*}} {loop.cluster = 0 : i32, loop.stage = 0 : i32}
-// CHECK: tt.descriptor_load {{.*}} {loop.cluster = 1 : i32, loop.stage = 0 : i32}
-// CHECK: ttg.local_alloc {{.*}} {loop.cluster = 2 : i32, loop.stage = 0 : i32}
-// CHECK: ttg.local_alloc {{.*}} {loop.cluster = 3 : i32, loop.stage = 0 : i32}
-// CHECK: ttng.tc_gen5_mma {{.*}} {loop.cluster = 0 : i32, loop.stage = 1 : i32}
-// CHECK: ttng.tmem_load {{.*}} {loop.cluster = 0 : i32, loop.stage = 2 : i32}
-// tt.num_stages = max_stage + 1 (set so downstream pipelining recognises
-// the loop as scheduled, even for single-stage modulo schedules).
-// tt.num_buffers attrs on local_allocs are added by the next stack diff
-// (Phase 1 buffer allocation on ScheduleGraph).
-// CHECK: tt.modulo_ii = 1038 : i32
-// CHECK-SAME: tt.num_stages = 3 : i32
-// CHECK-SAME: tt.scheduled_max_stage = 2 : i32
+// CHECK: scf.for
+// CHECK-NOT: loop.stage
+// CHECK-NOT: loop.cluster
+// CHECK-NOT: tt.autows
+// CHECK: tt.num_stages = 2 : i32
 tt.func @gemm_inner_loop(
   %a_desc: !tt.tensordesc<tensor<128x64xf16>>,
   %b_desc: !tt.tensordesc<tensor<64x128xf16>>
 
@@ -8,21 +8,18 @@
 
 module attributes {"ttg.num-warps" = 4 : i32, ttg.target = "cuda:100"} {
 
-// Verify that Pass B assigns utilization-driven ttg.partition attrs on a
-// persistent kernel with a WS outer loop containing an inner K-loop.
-// Expected partitions: MEM=0, TC=1, CUDA(tmem_load)=2.
-// Shared/scalar ops get allParts [0,1,2].
+// Verify that the modulo schedule pass runs on the inner loop and the
+// ws-partition pass processes the outer WS loop. With selfLatency=1, the
+// single-MMA GEMM inner loop gets tt.num_stages=2 and no tt.autows
+// (all MMAs in same stage). The outer loop gets tt.warp_specialize.
 //
 // CHECK-LABEL: @persistent_gemm_ws_partition
-// MEM ops (descriptor_load, local_alloc) → partition 0
-// CHECK: tt.descriptor_load {{.*}} ttg.partition = array<i32: 0>
-// CHECK: tt.descriptor_load {{.*}} ttg.partition = array<i32: 0>
-// CHECK: ttg.local_alloc {{.*}} ttg.partition = array<i32: 0>
-// CHECK: ttg.local_alloc {{.*}} ttg.partition = array<i32: 0>
-// TC ops (tc_gen5_mma) → partition 1
-// CHECK: ttng.tc_gen5_mma {{.*}} ttg.partition = array<i32: 1>
-// CUDA ops (tmem_load) → partition 2
-// CHECK: ttng.tmem_load {{.*}} ttg.partition = array<i32: 2>
+// CHECK: scf.for
+// Inner loop has tt.num_stages from modulo schedule
+// CHECK: scf.for
+// CHECK: tt.num_stages = 2 : i32
+// Outer loop has tt.warp_specialize
+// CHECK: tt.warp_specialize
 tt.func @persistent_gemm_ws_partition(
   %a_desc: !tt.tensordesc<tensor<128x64xf16>>,
   %b_desc: !tt.tensordesc<tensor<64x128xf16>>,
 
@@ -400,12 +400,24 @@ def make_ttgir(mod, metadata, opt, capability):
             passes.ttgpuir.add_optimize_accumulator_init(pm)
             passes.ttgpuir.add_hoist_tmem_alloc(pm, False)
             nvidia.passes.ttnvgpuir.add_promote_lhs_to_tmem(pm)
-            nvidia.passes.hopper.add_data_partitioning(pm, 1)
-            if knobs.nvidia.use_modulo_schedule:
+            if knobs.nvidia.use_modulo_schedule is not None:
+                # Modulo schedule runs BEFORE data partitioning so it can
+                # see MMA ops before they're moved into WS regions. It
+                # sets tt.autows annotations (stage/order) on MMA ops.
+                # TRITON_USE_MODULO_SCHEDULE=1 (default algo: rau)
+                # TRITON_USE_MODULO_SCHEDULE=sms|exhaustive|random
                 nvidia.passes.hopper.add_modulo_schedule(pm)
-            else:
-                passes.ttgpuir.add_assign_latencies(pm, opt.num_stages, use_meta_swp_schedule)
-                passes.ttgpuir.add_schedule_loops(pm, opt.num_stages, use_meta_swp_schedule)
+            nvidia.passes.hopper.add_data_partitioning(pm, 1)
+            # assign_latencies sets tt.latency on loads/MMAs (stage-distance
+            # latencies). schedule_loops reads tt.latency AND tt.autows:
+            # when MMA ops have tt.autows, scheduleKeyOpsAnnotation places
+            # them at the annotated stages/clusters while scheduling all
+            # other ops (loads, softmax, barriers) via the standard
+            # latency-based heuristic. Without assign_latencies, the WS
+            # pass's internal scheduleLoops has no latencies and can't
+            # enter the code path that reads tt.autows annotations.
+            passes.ttgpuir.add_assign_latencies(pm, opt.num_stages, use_meta_swp_schedule)
+            passes.ttgpuir.add_schedule_loops(pm, opt.num_stages, use_meta_swp_schedule)
             if not knobs.nvidia.use_meta_ws:
                 passes.ttgpuir.add_warp_specialize(pm, opt.num_stages)
             else:
 
@@ -21,6 +21,8 @@ add_triton_library(NVHopperTransforms
   ModuloScheduling/LatencyModel.cpp
   ModuloScheduling/DataDependenceGraph.cpp
   ModuloScheduling/ModuloReservationTable.cpp
+  ModuloScheduling/SwingScheduler.cpp
+  ModuloScheduling/ExhaustiveScheduler.cpp
   ModuloScheduling/ModuloSchedulePass.cpp
   ModuloScheduling/ModuloWSPartitionPass.cpp
   ModuloScheduling/ModuloScheduleGraph.cpp