You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[AMD] Generalize in-thread tree reduction to support ternary grouping for max/min (#9897)
Summary
- Generalize treeReduceBinary into treeReduce parameterized by arity,
enabling ternary (or higher) tree reductions when the target benefits
from it
- Add getReductionTreeArity(Operation*) to TargetInfoBase (default: 2)
so targets can request wider grouping per combiner op
- AMD override returns 3 for MaximumFOp/MinimumFOp/MaxNumFOp/MinNumFOp,
generating max(max(a,b), c) groups that LLVM folds into
v_maximum3_f32/v_minimum3_f32
Motivation
The binary tree reduction creates an alternating pattern where every
other level produces results that LLVM's DAG combiner cannot fold into
ternary instructions. LLVM only matches max(max(a,b), c) → v_maximum3
when the inner max has a single use, but the balanced binary tree
creates intermediate results consumed by the next level that alternate
between foldable and unfoldable.
With arity=3, every group maps directly to a ternary instruction,
reducing max/min instruction count by ~23% (344 → 264 for a 256×256 f32
reduction). NVIDIA has no max3 equivalent, so the default arity=2
preserves existing behavior.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments