@@ -4,7 +4,7 @@ A basic kernel benchmark can be created with just a few lines of CUDA C++:
44
55``` cpp
66void my_benchmark (nvbench::state& state) {
7- state.exec([ ] (nvbench::launch& launch) {
7+ state.exec([ ] (nvbench::launch& launch) {
88 my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
99 });
1010}
@@ -97,7 +97,7 @@ void benchmark(nvbench::state& state)
9797 const auto num_inputs = state.get_int64("NumInputs");
9898 thrust::device_vector<int> data = generate_input(num_inputs);
9999
100- state.exec([&data](nvbench::launch& launch) {
100+ state.exec([&data](nvbench::launch& launch) {
101101 my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(data.begin(), data.end());
102102 });
103103}
@@ -134,7 +134,7 @@ void benchmark(nvbench::state& state)
134134 const auto quality = state.get_float64("Quality");
135135
136136 state.exec([&quality](nvbench::launch& launch)
137- {
137+ {
138138 my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(quality);
139139 });
140140}
@@ -153,7 +153,7 @@ void benchmark(nvbench::state& state)
153153 thrust::device_vector<int > data = generate_input(rng_dist);
154154
155155 state.exec([ &data] (nvbench::launch& launch)
156- {
156+ {
157157 my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(data.begin(), data.end());
158158 });
159159}
@@ -182,7 +182,7 @@ void my_benchmark(nvbench::state& state, nvbench::type_list<T>)
182182 thrust::device_vector<T> data = generate_input<T>();
183183
184184 state.exec([&data](nvbench::launch& launch)
185- {
185+ {
186186 my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(data.begin(), data.end());
187187 });
188188}
@@ -266,7 +266,6 @@ In general::
266266
267267More examples can found in [ examples/throughput.cu] ( ../examples/throughput.cu ) .
268268
269-
270269# Skip Uninteresting / Invalid Benchmarks
271270
272271Sometimes particular combinations of parameters aren't useful or interesting —
@@ -294,7 +293,7 @@ void my_benchmark(nvbench::state& state, nvbench::type_list<T, U>)
294293// Skip benchmarks at compile time -- for example, always skip when T == U
295294// (Note that the ` type_list ` argument defines the same type twice).
296295template <typename SameType >
297- void my_benchmark(nvbench::state& state,
296+ void my_benchmark(nvbench::state& state,
298297 nvbench::type_list<SameType, SameType>)
299298{
300299 state.skip("T must not be the same type as U.");
@@ -320,6 +319,15 @@ true:
320319 synchronize internally.
321320- `nvbench::exec_tag::timer` requests a timer object that can be used to
322321 restrict the timed region.
322+ - `nvbench::exec_tag::no_batch` disables batch measurements. This both disables
323+ them during execution to reduce runtime, and prevents their compilation to
324+ reduce compile-time and binary size.
325+ - `nvbench::exec_tag::gpu` is an optional hint that prevents non-GPU benchmarking
326+ code from being compiled for a particular benchmark. A runtime error is emitted
327+ if the benchmark is defined with `set_is_cpu_only(true)`.
328+ - `nvbench::exec_tag::no_gpu` is an optional hint that prevents GPU benchmarking
329+ code from being compiled for a particular benchmark. A runtime error is emitted
330+ if the benchmark does not also define `set_is_cpu_only(true)`.
323331
324332Multiple execution tags may be combined using `operator|`, e.g.
325333
@@ -370,7 +378,7 @@ Note that using manual timer mode disables batch measurements.
370378void timer_example(nvbench::state& state)
371379{
372380 // Pass the `timer` exec tag to request a timer:
373- state.exec(nvbench::exec_tag::timer,
381+ state.exec(nvbench::exec_tag::timer,
374382 // Lambda now accepts a timer:
375383 [](nvbench::launch& launch, auto& timer)
376384 {
@@ -391,6 +399,79 @@ NVBENCH_BENCH(timer_example);
391399See [ examples/exec_tag_timer.cu] ( ../examples/exec_tag_timer.cu ) for a complete
392400example.
393401
402+ ## Compilation hints: ` nvbench::exec_tag::no_batch ` , ` gpu ` , and ` no_gpu `
403+
404+ These execution tags are optional hints that disable the compilation of various
405+ code paths when they are not needed. They apply only to a single benchmark.
406+
407+ - ` nvbench::exec_tag::no_batch ` prevents the execution and instantiation of the batch measurement backend.
408+ - ` nvbench::exec_tag::gpu ` prevents the instantiation of CPU-only benchmarking backends.
409+ - Requires that the benchmark does not define ` set_is_cpu_only(true) ` .
410+ - Optional; this has no effect on runtime measurements, but reduces compile-time and binary size.
411+ - Host-side CPU measurements of GPU kernel execution time are still provided.
412+ - ` nvbench::exec_tag::no_gpu ` prevents the instantiation of GPU benchmarking backends.
413+ - Requires that the benchmark defines ` set_is_cpu_only(true) ` .
414+ - Optional; this has no effect on runtime measurements, but reduces compile-time and binary size.
415+ - See also [ CPU-only Benchmarks] ( #cpu-only-benchmarks ) .
416+
417+ # CPU-only Benchmarks
418+
419+ NVBench provides CPU-only benchmarking facilities that are intended for measuring
420+ significant CPU workloads. We do not recommend using these features for high-resolution
421+ CPU benchmarking -- other libraries (such as Google Benchmark) are more appropriate for
422+ such applications. Examples are provided in [ examples/cpu_only.cu] ( ../examples/cpu_only.cu ) .
423+
424+ Note that NVBench still requires a CUDA compiler and runtime even if a project only contains
425+ CPU-only benchmarks.
426+
427+ The ` is_cpu_only ` property of the benchmark toggles between GPU and CPU-only measurements:
428+
429+ ``` cpp
430+ void my_cpu_benchmark (nvbench::state &state)
431+ {
432+ state.exec([ ] (nvbench::launch &) { /* workload * / });
433+ }
434+ NVBENCH_BENCH(my_cpu_benchmark)
435+ .set_is_cpu_only(true); // Mark as CPU-only.
436+ ```
437+
438+ The optional `nvbench::exec_tag::no_gpu` hint may be used to reduce tbe compilation time and
439+ binary size of CPU-only benchmarks. An error is emitted at runtime if this tag is used while
440+ `is_cpu_only` is false.
441+
442+ ```cpp
443+ void my_cpu_benchmark(nvbench::state &state)
444+ {
445+ state.exec(nvbench::exec_tag::no_gpu, // Prevent compilation of GPU backends
446+ [](nvbench::launch &) { /* workload */ });
447+ }
448+ NVBENCH_BENCH(my_cpu_benchmark)
449+ .set_is_cpu_only(true); // Mark as CPU-only.
450+ ```
451+
452+ The ` nvbench::exec_tag::timer ` execution tag is also supported by CPU-only benchmarks. This
453+ is useful for benchmarks that require additional per-sample setup/teardown. See the
454+ [ ` nvbench::exec_tag::timer ` ] ( #explicit-timer-mode-nvbenchexec_tagtimer ) section for more
455+ details.
456+
457+ ``` cpp
458+ void my_cpu_benchmark (nvbench::state &state)
459+ {
460+ state.exec(nvbench::exec_tag::no_gpu | // Prevent compilation of GPU backends
461+ nvbench::exec_tag::timer, // Request a timer object
462+ [ ] (nvbench::launch &, auto &timer)
463+ {
464+ // Setup here
465+ timer.start();
466+ // timed workload
467+ timer.stop();
468+ // teardown here
469+ });
470+ }
471+ NVBENCH_BENCH(my_cpu_benchmark)
472+ .set_is_cpu_only(true); // Mark as CPU-only.
473+ ```
474+
394475# Beware: Combinatorial Explosion Is Lurking
395476
396477Be very careful of how quickly the configuration space can grow. The following
@@ -403,7 +484,7 @@ using value_types = nvbench::type_list<nvbench::uint8_t,
403484 nvbench::int32_t,
404485 nvbench::float32_t,
405486 nvbench::float64_t>;
406- using op_types = nvbench::type_list<thrust::plus<>,
487+ using op_types = nvbench::type_list<thrust::plus<>,
407488 thrust::multiplies<>,
408489 thrust::maximum<>>;
409490
@@ -418,7 +499,7 @@ NVBENCH_BENCH_TYPES(my_benchmark,
418499
419500```
420501960 total configs
421- = 4 [ T=(U8, I32, F32, F64)]
502+ = 4 [T=(U8, I32, F32, F64)]
422503* 4 [U=(U8, I32, F32, F64)]
423504* 4 [V=(U8, I32, F32, F64)]
424505* 3 [Op=(plus, multiplies, max)]
@@ -427,8 +508,8 @@ NVBENCH_BENCH_TYPES(my_benchmark,
427508
428509For large configuration spaces like this, pruning some of the less useful
429510combinations (e.g. ` sizeof(init_type) < sizeof(output) ` ) using the techniques
430- described in the " Skip Uninteresting / Invalid Benchmarks" section can help
431- immensely with keeping compile / run times manageable.
511+ described in the [ Skip Uninteresting / Invalid Benchmarks] ( #skip-uninteresting--invalid-benchmarks )
512+ section can help immensely with keeping compile / run times manageable.
432513
433514Splitting a single large configuration space into multiple, more focused
434515benchmarks with reduced dimensionality will likely be worth the effort as well.
0 commit comments