Write-only benchmark exceeds 100% bandwidth

@gevtushenko and I have run into this curious case: When benchmarking `cub::DeviceTransform::Fill` to just fill a buffer with values (see [source](https://github.com/NVIDIA/cccl/blob/07f66bc22347b3ccb85d89e4efe81853b405b136/cub/benchmarks/bench/transform/fill.cu)), we sometimes get results like this:

```
## generate

### [0] NVIDIA GeForce RTX 5090

| T{ct} |   Elements{io}   | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil  | Samples | Batch GPU  |
|-------|------------------|---------|------------|--------|------------|--------|----------|--------------|---------|---------|------------|
|    I8 |     2^16 = 65536 |    314x |   9.475 us | 13.30% |   2.836 us | 34.84% |  23.108G |  23.108 GB/s |   1.29% | 230528x |   2.169 us |
|    I8 |   2^20 = 1048576 |    414x |   9.200 us | 10.08% |   2.683 us | 34.71% | 390.860G | 390.860 GB/s |  21.81% | 235432x |   2.124 us |
|    I8 |  2^24 = 16777216 |    262x |  15.828 us |  4.94% |   8.182 us |  1.94% |   2.050T |   2.050 TB/s | 114.41% | 100812x |   4.961 us |
|    I8 | 2^28 = 268435456 |    370x | 166.190 us |  0.79% | 158.758 us |  0.73% |   1.691T |   1.691 TB/s |  94.35% |   3338x | 158.649 us |
```
Notice the  114.41% BWUtil for the third run. Trying to figure out what happens, I run the benchmark under ncu and the memory workload analysis shows:
<img width="1468" height="424" alt="Image" src="https://github.com/user-attachments/assets/facc58fd-0acf-4359-96e7-7ed8b10637c6" />
It looks like the kernel did not write back the data from L2 to GMEM. This may explain why the kernel could finish faster than it would take to do the full write back to GMEM (which is what I guess nvbench assumes).

ncu shows me that smaller runs write back zero bytes from L2 to GMEM, and the larger 2^28 run writes back everything except about 62MB of data, which remains in L2. The experiment was done on an RTX 5090 with 92MB L2.

<img width="1468" height="424" alt="Image" src="https://github.com/user-attachments/assets/29ce46b8-bf35-4b91-9db7-ef6dcb78149c" />

It seems nvbench should incorporate a mechanism that forces the benchmarked kernel to do the full write back to GMEM, so bandwidth measurements on writes are accurate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Write-only benchmark exceeds 100% bandwidth #270

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Write-only benchmark exceeds 100% bandwidth #270

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions