This repopository fully implements reduction optimisation steps from Mark Harris presentation:
Optimizing Parallel Reduction in CUDA - Mark Harris
make -j
./run
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 1650"
Compute capability = 7.5
Total global mememeory = 3896 MB
Multi processor Count = 16
Warp size = 32
Max grid size = [2147483647, 65535, 65535]
Max threads per block = 1024
Shared memory per block = 48 kB
Shared memory per multiprocessor = 64 kB
correct_sum: -2097152
reduce_0: 0.638 ms
reduce_1: 0.529 ms
reduce_2: 0.417 ms
reduce_3: 0.249 ms
reduce_4: 0.177 ms