[DRAFT][DNR] Update FP16 warp-pipelined GEMM by jungpark-mlir · Pull Request #9978 · triton-lang/triton

jungpark-mlir · 2026-04-09T14:33:41Z

No description provided.

…d loops When two warp-pipelined loops execute consecutively, ConvertWarpPipeline previously emitted a full reconverge/re-phase-shift/pre-barrier sequence between them: scf.for { loop 1 } cond_barrier(warpLow) ← post-loop reconverge ttg.barrier local ← pre-barrier for loop 2 cond_barrier(warpHigh) ← pre-loop phase shift scf.for { loop 2 } The post-loop reconverge and pre-loop phase shift are complementary predicates on the same counter-based S_BARRIER, so they cancel out. The intervening ttg.barrier local is redundant when loop 1's wrap-around cluster barrier already includes a local fence (i.e. the dependency analysis determined an LDS read/write hazard exists across the wrap-around point). In that case, all pending LDS writes are already resolved before loop 1 yields, and ModuleMembarAnalysis will not need to insert additional barriers between the loops. This patch adds a post-processing pass (eliminateRedundantCondBarriers) that detects this pattern and erases the three redundant ops, reducing the barrier overhead to: scf.for { loop 1 } scf.for { loop 2 } cond_barrier(warpLow) ← final reconverge only The pass runs after all scf.for loops have been converted (patternFor) but before execute_regions are inlined (patternInline), preserving the scf.for / cond_barrier adjacency needed for pattern matching. Also updates the f16_gemm_warp_pipeline_gfx1250.py example to use range() (producing scf.for) instead of static_range() (which unrolls at the Python level) for the epilogue loop, and wraps its stages in warp_pipeline_stage annotations so the back-to-back optimization can apply.

Extend the warp-pipeline infrastructure to handle loops unrolled at the Python level (e.g. via static_range/ttgl.static_range). Previously, warp-pipelining only worked with scf.for loops. Unrolled loops produce flat sequences of border markers in the IR that were silently ignored. Three main changes: 1. WarpPipeliner: add createFlatPipeline() Scans each block for triton.warp_pipeline.border markers outside scf.for. Groups the operations between borders into clusters and wraps each in an scf.execute_region with triton.warp_pipeline.stage, triton.warp_pipeline.priority, and no_inline attributes — the same representation createPipeline() produces for loop bodies. 2. ConvertWarpPipeline: add processUnrolledPipelineRegions() + emitPipelinedFlat() After the existing patternFor converts scf.for loops, this new pass walks each function block for contiguous sequences of flat scf.execute_region ops (with triton.warp_pipeline.stage). For each sequence it emits the full barrier structure: pre-barrier, phase shift (cond_barrier warpHigh), linear dependency analysis for cluster barriers (no wrap-around since the sequence is finite), priority management (s_setprio), and post-sequence reconverge (cond_barrier warpLow). The execute_regions are then inlined by the existing InlineWarpPipelineExecuteRegionPattern. Also extends eliminateRedundantCondBarriers() to handle the case where a pipelined scf.for is immediately followed by a flat pipeline (instead of only scf.for → scf.for). When the first loop's wrap-around barrier includes a local fence, the intervening reconverge + pre-barrier + phase-shift are redundant and eliminated. 3. Gluon frontend: assert warp_pipeline_stage is inside a for loop Since the compiler now supports flat border markers, there is a risk that users place warp_pipeline_stage outside any loop, which has no meaningful pipelining semantics. A for_loop_depth counter is added to GluonSemantic and incremented/decremented in code_generator's visit_For (covering both range and static_range). warp_pipeline_stage asserts for_loop_depth > 0 at exit. The f16 GEMM example kernel is updated to use ttgl.static_range for the epilogue loop, exercising the new flat pipeline path end-to-end. Lit tests added for both WarpPipeliner (flat_pipeline_example) and ConvertWarpPipeline (flat_pipeline_backend, back_to_back_for_then_flat).

Factor out the duplicated pre-barrier + phase-shift setup and the post-pipeline reconverge logic from emitPipelinedFor and emitPipelinedFlat into shared helpers emitPipelinePrelude and emitPipelinePostlude. NFC.

Unify the duplicated pairwise dependency analysis from emitPipelinedFor (circular/wrap-around) and emitPipelinedFlat (linear) into a single analyzePipelineDependencies function parameterized by `bool circular`. NFC.

…rier exists emitPipelinedFlat unconditionally inserted a new cluster barrier (s_barrier) at every stage boundary, ignoring pre-existing barrier ops (e.g., async_wait) between execute_regions. This produced two barriers at the same boundary. Mirror the emitPipelinedFor logic: scan between consecutive stages for existing barrier ops and wrap them with sched_barriers instead of inserting a new one.

Add --tdm_store flag to use tensor_store_from_lds instead of global_store for accumulator write-back. When enabled, the accumulator is written to LDS (PaddedSharedLayout) then DMA'd to global memory via TDM async store.

Halves epilogue LDS footprint and store bandwidth when used with TDM store. Accumulator is always fp32 for WMMA; downcast happens after the compute loop.

1. TDMUtility: gate PR triton-lang#9360 store padding adjustment behind TRITON_AMDGPU_WA_STORE_PAD=1. When set, skips tile_dim0 widening and relies on HW pad_enable for tensor_store_from_lds. 2. GEMM kernel: relax tolerance to 1e-3 for --out16 to account for FP16 rounding (two independent FP32->FP16 casts can differ by 1 ULP, rtol ~ 2^-10).

jungpark-mlir added 13 commits April 2, 2026 14:00

Merge branch 'triton-lang:main' into 2wp

6f36913

[AMD] Refactor: extract emitPipelinePrelude/Postlude helpers

bdf6076

Factor out the duplicated pre-barrier + phase-shift setup and the post-pipeline reconverge logic from emitPipelinedFor and emitPipelinedFlat into shared helpers emitPipelinePrelude and emitPipelinePostlude. NFC.

[AMD] Refactor: extract analyzePipelineDependencies helper

b5a9e4e

Unify the duplicated pairwise dependency analysis from emitPipelinedFor (circular/wrap-around) and emitPipelinedFlat (linear) into a single analyzePipelineDependencies function parameterized by `bool circular`. NFC.

Format

401e130

Remove unnecessary for_loop_depth assertion from warp_pipeline_stage

482c64b

Merge branch 'triton-lang:main' into 2wp

149cee3

Merge branch 'triton-lang:main' into 2wp

a96fbc7

Add optional TDM store epilogue to warp-pipelined GEMM kernel

542181c

Add --tdm_store flag to use tensor_store_from_lds instead of global_store for accumulator write-back. When enabled, the accumulator is written to LDS (PaddedSharedLayout) then DMA'd to global memory via TDM async store.

Add --out16 option to downcast accumulator to fp16 before store

faaf12d

Halves epilogue LDS footprint and store bandwidth when used with TDM store. Accumulator is always fp32 for WMMA; downcast happens after the compute loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT][DNR] Update FP16 warp-pipelined GEMM#9978

[DRAFT][DNR] Update FP16 warp-pipelined GEMM#9978
jungpark-mlir wants to merge 13 commits intotriton-lang:mainfrom
jungpark-mlir:f16ref

jungpark-mlir commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jungpark-mlir commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant