- Measure block-scan performance and the effect of batching multiple items per thread.
- What is the best
items_per_threadtradeoff for the shared-memory scan baseline?
items_per_thread_1items_per_thread_4items_per_thread_8
- Run the same inclusive-scan kernel shape while changing only how many items each thread handles.
- Keep output semantics, correctness rules, and timing policy fixed across the sweep.
- Median GPU time by
items_per_thread. - Throughput by configuration.
- Practical default for later scan and compaction work.
- More work per thread is not automatically better if it reduces occupancy or inflates local state.
- This is the shared-memory scan baseline that later subgroup variants should be compared against.