University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2
- Yuntian Ke
- Tested on: Windows 11, Intel Core Ultra 9 275HX @ 2.70GHz 32GB, RTX 5070 Ti 30160MB
This is used as the expected value for the other tests, and compared with GPU methods.
Use the "Naive" algorithm to do the exclusive scan. At every level
Use the "Work-Efficient" algorithm to do the exclusive scan and stream compaction. For scan part, use a balanced binary tree to perform scan in Up-Sweep and Down-Sweep. For Stream Compaction part, first get the bool array, then do the exclusive scan we just implemented, and finally do Scatter.
Use thrust::exclusive_scan(first, last, result) to finish this.
I implemented this with original Work-Efficient GPU Scan. To enable this setting, you can just set "ENABLE_SHARED_MEMORY==1"
Figure 1: Block Size Optimization
I do roughly Block Size Optimization on size
Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan
From the data I collected, it shows that when Array Size is smaller or equal than
Figure 3: Nsight System Anaylsis
Figure 4: Thrust vs Shared Memory GPU Scan
CPU Scan (blue) grows steeply: runtime increases superlinearly as array size doubles. By
- CPU Scan → computation-bound: it’s a serial loop with no parallelism; every element adds cost.
- Naive GPU Scan → memory I/O-bound: each step does redundant reads/writes across the whole array. Even though GPU is parallel, global memory traffic dominates.
- Work-Efficient GPU Scan → still memory-bound, but much better use of parallelism: each element is moved fewer times;. Computation isn’t the limiter here, memory access patterns are.
- Shared-Memory GPU Scan → also memory-bound, but with minimal DRAM traffic (just one load + one store per element). Most work happens in shared memory; runtime is capped by global bandwidth rather than arithmetic.
The shared-memory scan is faster because it minimizes global memory traffic (just one load/store per element), does all tree operations in on-chip shared memory, and avoids repeated kernel launches. The original global work-efficient scan was still memory-bound, but with much higher DRAM traffic.
****************
** SCAN TESTS **
****************
[ 34 32 34 21 13 7 8 8 18 4 49 7 43 ... 18 0 ]
==== cpu scan, power-of-two ====
elapsed time: 5.7368ms (std::chrono Measured)
[ 0 34 66 100 121 134 141 149 157 175 179 228 235 ... 205433928 205433946 ]
==== cpu scan, non-power-of-two ====
elapsed time: 3.9875ms (std::chrono Measured)
[ 0 34 66 100 121 134 141 149 157 175 179 228 235 ... 205433850 205433880 ]
passed
==== naive scan, power-of-two ====
elapsed time: 3.39638ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 2.90432ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.775456ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.752288ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.476736ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.419232ms (CUDA Measured)
passed
*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 0 1 2 1 2 0 3 2 2 3 3 2 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 15.9244ms (std::chrono Measured)
[ 2 1 2 1 2 3 2 2 3 3 2 2 3 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 15.9825ms (std::chrono Measured)
[ 2 1 2 1 2 3 2 2 3 3 2 2 3 ... 1 2 ]
passed
==== cpu compact with scan ====
elapsed time: 36.1649ms (std::chrono Measured)
[ 2 1 2 1 2 3 2 2 3 3 2 2 3 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 1.62368ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 1.40934ms (CUDA Measured)
passed

