Skip to content

Commit 8af0aa3

Browse files
committed
Updates
1 parent df7abef commit 8af0aa3

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

docs/best_practices.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ Let’s begin with a simple example for users who are new to NVBench and want to
1818
```cpp
1919
void sequence_bench(nvbench::state& state) {
2020
auto data = thrust::device_vector<int>(10);
21-
state.exec([](nvbench::launch& launch) {
21+
state.exec([](nvbench::launch&) {
2222
thrust::sequence(data.begin(), data.end());
2323
});
2424
}
2525
NVBENCH_BENCH(sequence_bench);
2626
```
27-
Will this code work as-is? Depending on the build system configuration, compilation may succeed but generate warnings indicating that `launch` is an unused parameter. The code may or may not execute correctly. This often occurs when users, accustomed to a sequential programming mindset, overlook the fact that GPU architectures are highly parallel. Proper use of streams and synchronization is essential for accurately measuring performance in benchmark code.
27+
Will this code run correctly as written? While it may compile successfully, runtime behavior isn’t guaranteed. This is a common pitfall for developers used to sequential programming, who may overlook the massively parallel nature of GPU architectures. To ensure accurate performance measurement in benchmark code, proper use of streams and synchronization is crucial.
2828
2929
A common mistake in this context is neglecting stream specification: NVBench requires knowledge of the exact CUDA stream being targeted to correctly trace kernel execution and measure performance. Therefore, users must explicitly provide the stream to be benchmarked. For example, passing the NVBench launch stream ensures correct execution and accurate measurement:
3030
@@ -74,7 +74,7 @@ NVBENCH_BENCH(sequence_bench);
7474
When the benchmark is executed, results are displayed without issues. However, users, particularly in a multi-GPU environment, may observe that more results are collected than expected:
7575
7676
```bash
77-
user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
77+
user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
7878
# Devices
7979
8080
## [0] `Quadro RTX 8000`
@@ -106,9 +106,9 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
106106
# Log
107107
108108
Run: [1/2] sequence_bench [Device=0]
109-
Pass: Cold: 0.006150ms GPU, 0.009768ms CPU, 0.50s total GPU, 4.52s total wall, 81312x
109+
Pass: Cold: 0.006150ms GPU, 0.009768ms CPU, 0.50s total GPU, 4.52s total wall, 81312x
110110
Run: [2/2] sequence_bench [Device=1]
111-
Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 63952x
111+
Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 63952x
112112
113113
# Benchmark Results
114114
@@ -136,7 +136,7 @@ user@nvbench-test:~/nvbench/build/bin$ export CUDA_VISIBLE_DEVICES=0
136136
Now, if we rerun:
137137

138138
```bash
139-
user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
139+
user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
140140
# Devices
141141

142142
## [0] `Quadro RTX 8000`
@@ -155,7 +155,7 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
155155
# Log
156156

157157
Run: [1/1] sequence_bench [Device=0]
158-
Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 79920x
158+
Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 79920x
159159

160160
# Benchmark Results
161161

@@ -207,9 +207,9 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench -a Num=[10,100000]
207207
# Log
208208
209209
Run: [1/2] sequence_bench [Device=0 Num=10]
210-
Pass: Cold: 0.006318ms GPU, 0.009948ms CPU, 0.50s total GPU, 4.37s total wall, 79152x
210+
Pass: Cold: 0.006318ms GPU, 0.009948ms CPU, 0.50s total GPU, 4.37s total wall, 79152x
211211
Run: [2/2] sequence_bench [Device=0 Num=100000]
212-
Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 75936x
212+
Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 75936x
213213
214214
# Benchmark Results
215215
@@ -267,7 +267,7 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_transfor
267267
NVBench provides a convenient script under `nvbench/scripts` called `nvbench_compare.py`. After copying the JSON files to the scripts folder:
268268

269269
```bash
270-
user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json
270+
user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json
271271
['sequence_ref.json', 'sequence_transform.json']
272272
# sequence_bench
273273

@@ -288,7 +288,7 @@ user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequ
288288
- Failure (diff > min_noise): 0
289289
```
290290

291-
We can see that the performance of the two approaches is essentially the same.
291+
We can see that the performance of the two approaches is essentially the same.
292292

293293
(wanted to mention users can also use the json file to trace regressions in CI)
294294

0 commit comments

Comments
 (0)