Skip to content

Latest commit

 

History

History
25 lines (19 loc) · 884 Bytes

File metadata and controls

25 lines (19 loc) · 884 Bytes

Experiment 22: Prefix Sum or Scan

1. Focus

  • Measure block-scan performance and the effect of batching multiple items per thread.

2. Question

  • What is the best items_per_thread tradeoff for the shared-memory scan baseline?

3. Variants

  • items_per_thread_1
  • items_per_thread_4
  • items_per_thread_8

4. Method

  • Run the same inclusive-scan kernel shape while changing only how many items each thread handles.
  • Keep output semantics, correctness rules, and timing policy fixed across the sweep.

5. Outputs

  • Median GPU time by items_per_thread.
  • Throughput by configuration.
  • Practical default for later scan and compaction work.

6. Interpretation

  • More work per thread is not automatically better if it reduces occupancy or inflates local state.
  • This is the shared-memory scan baseline that later subgroup variants should be compared against.