- Compare baseline and hierarchical reduction strategies for a classic GPU primitive.
- When does shared-memory tree reduction beat a simpler global-atomic approach?
global_atomicshared_tree
- Reduce the same deterministic input values with the same final operator in both variants.
- Sweep problem size across the available scratch budget and dispatch limit.
- Median GPU time by variant and size.
- Speedup of
shared_treevsglobal_atomic. - Effective input-read throughput.
- Hierarchical reduction usually pays off only after the workload is large enough to amortize setup overhead.
- This establishes the pre-subgroup baseline for later reduction work.