- Compare per-workgroup compaction built on shared atomics versus subgroup ballot ranking.
- How much does subgroup ballot reduce the overhead of local compaction across sparsity regimes?
shared_atomic_blocksubgroup_ballot- valid ratios
5%,25%,50%,75%,95%
- Compact each workgroup into its own fixed output segment.
- Validate exact counts for both variants and stable ordering only for the subgroup-ballot path.
- Median GPU time by valid ratio.
- Speedup of subgroup ballot vs shared atomic append.
- Effective payload GB/s using actual valid-count writes.
- This isolates subgroup compaction mechanics without the added cost of a global scan pipeline.
- Ordering guarantees should be interpreted separately from raw throughput.