-
Notifications
You must be signed in to change notification settings - Fork 90
Description
@gevtushenko and I have run into this curious case: When benchmarking cub::DeviceTransform::Fill to just fill a buffer with values (see source), we sometimes get results like this:
## generate
### [0] NVIDIA GeForce RTX 5090
| T{ct} | Elements{io} | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-------|------------------|---------|------------|--------|------------|--------|----------|--------------|---------|---------|------------|
| I8 | 2^16 = 65536 | 314x | 9.475 us | 13.30% | 2.836 us | 34.84% | 23.108G | 23.108 GB/s | 1.29% | 230528x | 2.169 us |
| I8 | 2^20 = 1048576 | 414x | 9.200 us | 10.08% | 2.683 us | 34.71% | 390.860G | 390.860 GB/s | 21.81% | 235432x | 2.124 us |
| I8 | 2^24 = 16777216 | 262x | 15.828 us | 4.94% | 8.182 us | 1.94% | 2.050T | 2.050 TB/s | 114.41% | 100812x | 4.961 us |
| I8 | 2^28 = 268435456 | 370x | 166.190 us | 0.79% | 158.758 us | 0.73% | 1.691T | 1.691 TB/s | 94.35% | 3338x | 158.649 us |
Notice the 114.41% BWUtil for the third run. Trying to figure out what happens, I run the benchmark under ncu and the memory workload analysis shows:

It looks like the kernel did not write back the data from L2 to GMEM. This may explain why the kernel could finish faster than it would take to do the full write back to GMEM (which is what I guess nvbench assumes).
ncu shows me that smaller runs write back zero bytes from L2 to GMEM, and the larger 2^28 run writes back everything except about 62MB of data, which remains in L2. The experiment was done on an RTX 5090 with 92MB L2.
It seems nvbench should incorporate a mechanism that forces the benchmarked kernel to do the full write back to GMEM, so bandwidth measurements on writes are accurate.