Skip to content

Commit a6df59a

Browse files
committed
Add support for CPU-only benchmarking.
Fixes #95. CPU-only mode is enabled by setting the `is_cpu_only` property while defining a benchmark, e.g. `NVBENCH_BENCH(foo).set_is_cpu_only(true)`. An optional `nvbench::exec_tag::no_gpu` hint can also be passed to `state.exec` to avoid instantiating GPU benchmarking backends. Note that a CUDA compiler and CUDA runtime are always required, even if all benchmarks in a translation unit are CPU-only. Similarly, a new `nvbench::exec_tag::gpu` hint can be used to avoid compiling CPU-only backends for GPU benchmarks.
1 parent 1efed5f commit a6df59a

16 files changed

+773
-82
lines changed

README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ features:
2525
* Batch Measurements:
2626
* Executes the benchmark multiple times back-to-back and records total time.
2727
* Reports the average execution time (total time / number of executions).
28+
* [CPU-only Measurements](docs/benchmarks.md#cpu-only-benchmarks)
29+
* Measures the host-side execution time of a non-GPU benchmark.
30+
* Not suitable for microbenchmarking.
2831

2932
# Supported Compilers and Tools
3033

@@ -65,6 +68,7 @@ This repository provides a number of [examples](examples/) that demonstrate
6568
various NVBench features and usecases:
6669
6770
- [Runtime and compile-time parameter sweeps](examples/axes.cu)
71+
- [CPU-only benchmarking](examples/cpu_only.cu)
6872
- [Enums and compile-time-constant-integral parameter axes](examples/enums.cu)
6973
- [Reporting item/sec and byte/sec throughput statistics](examples/throughput.cu)
7074
- [Skipping benchmark configurations](examples/skip.cu)
@@ -171,6 +175,7 @@ testing and parameter tuning of individual kernels. For in-depth analysis of
171175
end-to-end performance of multiple applications, the NVIDIA Nsight tools are
172176
more appropriate.
173177
174-
NVBench is focused on evaluating the performance of CUDA kernels and is not
175-
optimized for CPU microbenchmarks. This may change in the future, but for now,
178+
NVBench is focused on evaluating the performance of CUDA kernels. It also provides
179+
CPU-only benchmarking facilities intended for non-trivial CPU workloads, but is
180+
not optimized for CPU microbenchmarks. This may change in the future, but for now,
176181
consider using Google Benchmark for high resolution CPU benchmarks.

docs/benchmarks.md

Lines changed: 93 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ A basic kernel benchmark can be created with just a few lines of CUDA C++:
44

55
```cpp
66
void my_benchmark(nvbench::state& state) {
7-
state.exec([](nvbench::launch& launch) {
7+
state.exec([](nvbench::launch& launch) {
88
my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
99
});
1010
}
@@ -97,7 +97,7 @@ void benchmark(nvbench::state& state)
9797
const auto num_inputs = state.get_int64("NumInputs");
9898
thrust::device_vector<int> data = generate_input(num_inputs);
9999
100-
state.exec([&data](nvbench::launch& launch) {
100+
state.exec([&data](nvbench::launch& launch) {
101101
my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(data.begin(), data.end());
102102
});
103103
}
@@ -134,7 +134,7 @@ void benchmark(nvbench::state& state)
134134
const auto quality = state.get_float64("Quality");
135135
136136
state.exec([&quality](nvbench::launch& launch)
137-
{
137+
{
138138
my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(quality);
139139
});
140140
}
@@ -153,7 +153,7 @@ void benchmark(nvbench::state& state)
153153
thrust::device_vector<int> data = generate_input(rng_dist);
154154

155155
state.exec([&data](nvbench::launch& launch)
156-
{
156+
{
157157
my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(data.begin(), data.end());
158158
});
159159
}
@@ -182,7 +182,7 @@ void my_benchmark(nvbench::state& state, nvbench::type_list<T>)
182182
thrust::device_vector<T> data = generate_input<T>();
183183
184184
state.exec([&data](nvbench::launch& launch)
185-
{
185+
{
186186
my_kernel<<<blocks, threads, 0, launch.get_stream()>>>(data.begin(), data.end());
187187
});
188188
}
@@ -266,7 +266,6 @@ In general::
266266

267267
More examples can found in [examples/throughput.cu](../examples/throughput.cu).
268268

269-
270269
# Skip Uninteresting / Invalid Benchmarks
271270

272271
Sometimes particular combinations of parameters aren't useful or interesting —
@@ -294,7 +293,7 @@ void my_benchmark(nvbench::state& state, nvbench::type_list<T, U>)
294293
// Skip benchmarks at compile time -- for example, always skip when T == U
295294
// (Note that the `type_list` argument defines the same type twice).
296295
template <typename SameType>
297-
void my_benchmark(nvbench::state& state,
296+
void my_benchmark(nvbench::state& state,
298297
nvbench::type_list<SameType, SameType>)
299298
{
300299
state.skip("T must not be the same type as U.");
@@ -320,6 +319,15 @@ true:
320319
synchronize internally.
321320
- `nvbench::exec_tag::timer` requests a timer object that can be used to
322321
restrict the timed region.
322+
- `nvbench::exec_tag::no_batch` disables batch measurements. This both disables
323+
them during execution to reduce runtime, and prevents their compilation to
324+
reduce compile-time and binary size.
325+
- `nvbench::exec_tag::gpu` is an optional hint that prevents non-GPU benchmarking
326+
code from being compiled for a particular benchmark. A runtime error is emitted
327+
if the benchmark is defined with `set_is_cpu_only(true)`.
328+
- `nvbench::exec_tag::no_gpu` is an optional hint that prevents GPU benchmarking
329+
code from being compiled for a particular benchmark. A runtime error is emitted
330+
if the benchmark does not also define `set_is_cpu_only(true)`.
323331
324332
Multiple execution tags may be combined using `operator|`, e.g.
325333
@@ -370,7 +378,7 @@ Note that using manual timer mode disables batch measurements.
370378
void timer_example(nvbench::state& state)
371379
{
372380
// Pass the `timer` exec tag to request a timer:
373-
state.exec(nvbench::exec_tag::timer,
381+
state.exec(nvbench::exec_tag::timer,
374382
// Lambda now accepts a timer:
375383
[](nvbench::launch& launch, auto& timer)
376384
{
@@ -391,6 +399,79 @@ NVBENCH_BENCH(timer_example);
391399
See [examples/exec_tag_timer.cu](../examples/exec_tag_timer.cu) for a complete
392400
example.
393401

402+
## Compilation hints: `nvbench::exec_tag::no_batch`, `gpu`, and `no_gpu`
403+
404+
These execution tags are optional hints that disable the compilation of various
405+
code paths when they are not needed. They apply only to a single benchmark.
406+
407+
- `nvbench::exec_tag::no_batch` prevents the execution and instantiation of the batch measurement backend.
408+
- `nvbench::exec_tag::gpu` prevents the instantiation of CPU-only benchmarking backends.
409+
- Requires that the benchmark does not define `set_is_cpu_only(true)`.
410+
- Optional; this has no effect on runtime measurements, but reduces compile-time and binary size.
411+
- Host-side CPU measurements of GPU kernel execution time are still provided.
412+
- `nvbench::exec_tag::no_gpu` prevents the instantiation of GPU benchmarking backends.
413+
- Requires that the benchmark defines `set_is_cpu_only(true)`.
414+
- Optional; this has no effect on runtime measurements, but reduces compile-time and binary size.
415+
- See also [CPU-only Benchmarks](#cpu-only-benchmarks).
416+
417+
# CPU-only Benchmarks
418+
419+
NVBench provides CPU-only benchmarking facilities that are intended for measuring
420+
significant CPU workloads. We do not recommend using these features for high-resolution
421+
CPU benchmarking -- other libraries (such as Google Benchmark) are more appropriate for
422+
such applications. Examples are provided in [examples/cpu_only.cu](../examples/cpu_only.cu).
423+
424+
Note that NVBench still requires a CUDA compiler and runtime even if a project only contains
425+
CPU-only benchmarks.
426+
427+
The `is_cpu_only` property of the benchmark toggles between GPU and CPU-only measurements:
428+
429+
```cpp
430+
void my_cpu_benchmark(nvbench::state &state)
431+
{
432+
state.exec([](nvbench::launch &) { /* workload */ });
433+
}
434+
NVBENCH_BENCH(my_cpu_benchmark)
435+
.set_is_cpu_only(true); // Mark as CPU-only.
436+
```
437+
438+
The optional `nvbench::exec_tag::no_gpu` hint may be used to reduce tbe compilation time and
439+
binary size of CPU-only benchmarks. An error is emitted at runtime if this tag is used while
440+
`is_cpu_only` is false.
441+
442+
```cpp
443+
void my_cpu_benchmark(nvbench::state &state)
444+
{
445+
state.exec(nvbench::exec_tag::no_gpu, // Prevent compilation of GPU backends
446+
[](nvbench::launch &) { /* workload */ });
447+
}
448+
NVBENCH_BENCH(my_cpu_benchmark)
449+
.set_is_cpu_only(true); // Mark as CPU-only.
450+
```
451+
452+
The `nvbench::exec_tag::timer` execution tag is also supported by CPU-only benchmarks. This
453+
is useful for benchmarks that require additional per-sample setup/teardown. See the
454+
[`nvbench::exec_tag::timer`](#explicit-timer-mode-nvbenchexec_tagtimer) section for more
455+
details.
456+
457+
```cpp
458+
void my_cpu_benchmark(nvbench::state &state)
459+
{
460+
state.exec(nvbench::exec_tag::no_gpu | // Prevent compilation of GPU backends
461+
nvbench::exec_tag::timer, // Request a timer object
462+
[](nvbench::launch &, auto &timer)
463+
{
464+
// Setup here
465+
timer.start();
466+
// timed workload
467+
timer.stop();
468+
// teardown here
469+
});
470+
}
471+
NVBENCH_BENCH(my_cpu_benchmark)
472+
.set_is_cpu_only(true); // Mark as CPU-only.
473+
```
474+
394475
# Beware: Combinatorial Explosion Is Lurking
395476
396477
Be very careful of how quickly the configuration space can grow. The following
@@ -403,7 +484,7 @@ using value_types = nvbench::type_list<nvbench::uint8_t,
403484
nvbench::int32_t,
404485
nvbench::float32_t,
405486
nvbench::float64_t>;
406-
using op_types = nvbench::type_list<thrust::plus<>,
487+
using op_types = nvbench::type_list<thrust::plus<>,
407488
thrust::multiplies<>,
408489
thrust::maximum<>>;
409490
@@ -418,7 +499,7 @@ NVBENCH_BENCH_TYPES(my_benchmark,
418499

419500
```
420501
960 total configs
421-
= 4 [T=(U8, I32, F32, F64)]
502+
= 4 [T=(U8, I32, F32, F64)]
422503
* 4 [U=(U8, I32, F32, F64)]
423504
* 4 [V=(U8, I32, F32, F64)]
424505
* 3 [Op=(plus, multiplies, max)]
@@ -427,8 +508,8 @@ NVBENCH_BENCH_TYPES(my_benchmark,
427508

428509
For large configuration spaces like this, pruning some of the less useful
429510
combinations (e.g. `sizeof(init_type) < sizeof(output)`) using the techniques
430-
described in the "Skip Uninteresting / Invalid Benchmarks" section can help
431-
immensely with keeping compile / run times manageable.
511+
described in the [Skip Uninteresting / Invalid Benchmarks](#skip-uninteresting--invalid-benchmarks)
512+
section can help immensely with keeping compile / run times manageable.
432513

433514
Splitting a single large configuration space into multiple, more focused
434515
benchmarks with reduced dimensionality will likely be worth the effort as well.

examples/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ set(example_srcs
22
auto_throughput.cu
33
axes.cu
44
custom_criterion.cu
5+
cpu_only.cu
56
enums.cu
67
exec_tag_sync.cu
78
exec_tag_timer.cu

examples/cpu_only.cu

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
/*
2+
* Copyright 2025 NVIDIA Corporation
3+
*
4+
* Licensed under the Apache License, Version 2.0 with the LLVM exception
5+
* (the "License"); you may not use this file except in compliance with
6+
* the License.
7+
*
8+
* You may obtain a copy of the License at
9+
*
10+
* http://llvm.org/foundation/relicensing/LICENSE.txt
11+
*
12+
* Unless required by applicable law or agreed to in writing, software
13+
* distributed under the License is distributed on an "AS IS" BASIS,
14+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
* See the License for the specific language governing permissions and
16+
* limitations under the License.
17+
*/
18+
19+
#include <nvbench/nvbench.cuh>
20+
21+
#include <chrono>
22+
#include <thread>
23+
24+
// Block execution of the current CPU thread for `seconds` seconds.
25+
void sleep_host(double seconds)
26+
{
27+
std::this_thread::sleep_for(
28+
std::chrono::milliseconds(static_cast<nvbench::int64_t>(seconds * 1000)));
29+
}
30+
31+
//=============================================================================
32+
// Simple CPU-only benchmark that sleeps on host for a specified duration.
33+
void simple(nvbench::state &state)
34+
{
35+
const auto duration = state.get_float64("Duration");
36+
37+
state.exec([duration](nvbench::launch &) { sleep_host(duration); });
38+
}
39+
NVBENCH_BENCH(simple)
40+
// 100 -> 500 ms in 100 ms increments.
41+
.add_float64_axis("Duration", nvbench::range(.1, .5, .1))
42+
// Mark as CPU-only.
43+
.set_is_cpu_only(true);
44+
45+
//=============================================================================
46+
// Simple CPU-only benchmark that sleeps on host for a specified duration and
47+
// uses a custom timed region.
48+
void simple_timer(nvbench::state &state)
49+
{
50+
const auto duration = state.get_float64("Duration");
51+
52+
state.exec(nvbench::exec_tag::timer, [duration](nvbench::launch &, auto &timer) {
53+
// Do any setup work before starting the timer here...
54+
timer.start();
55+
56+
// The region of code to be timed:
57+
sleep_host(duration);
58+
59+
timer.stop();
60+
// Any per-run cleanup here...
61+
});
62+
}
63+
NVBENCH_BENCH(simple_timer)
64+
// 100 -> 500 ms in 100 ms increments.
65+
.add_float64_axis("Duration", nvbench::range(.1, .5, .1))
66+
// Mark as CPU-only.
67+
.set_is_cpu_only(true);
68+
69+
//=============================================================================
70+
// Simple CPU-only benchmark that uses the optional `nvbench::exec_tag::no_gpu`
71+
// hint to prevent GPU measurement code from being instantiated. Note that
72+
// `set_is_cpu_only(true)` is still required when using this hint.
73+
void simple_no_gpu(nvbench::state &state)
74+
{
75+
const auto duration = state.get_float64("Duration");
76+
77+
state.exec(nvbench::exec_tag::no_gpu, [duration](nvbench::launch &) { sleep_host(duration); });
78+
}
79+
NVBENCH_BENCH(simple_no_gpu)
80+
// 100 -> 500 ms in 100 ms increments.
81+
.add_float64_axis("Duration", nvbench::range(.1, .5, .1))
82+
// Mark as CPU-only.
83+
.set_is_cpu_only(true);

nvbench/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ set(srcs
2525

2626
detail/entropy_criterion.cxx
2727
detail/measure_cold.cu
28+
detail/measure_cpu_only.cxx
2829
detail/measure_hot.cu
2930
detail/state_generator.cxx
3031
detail/stdrel_criterion.cxx

nvbench/benchmark_base.cuh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,16 @@ struct benchmark_base
159159
}
160160
/// @}
161161

162+
/// If true, the benchmark measurements only record CPU time and assume no GPU work is performed.
163+
/// @{
164+
[[nodiscard]] bool get_is_cpu_only() const { return m_is_cpu_only; }
165+
benchmark_base &set_is_cpu_only(bool is_cpu_only)
166+
{
167+
m_is_cpu_only = is_cpu_only;
168+
return *this;
169+
}
170+
/// @}
171+
162172
/// If true, the benchmark is only run once, skipping all warmup runs and only
163173
/// executing a single non-batched measurement. This is intended for use with
164174
/// external profiling tools. @{
@@ -263,6 +273,7 @@ protected:
263273

264274
optional_ref<nvbench::printer_base> m_printer;
265275

276+
bool m_is_cpu_only{false};
266277
bool m_run_once{false};
267278
bool m_disable_blocking_kernel{false};
268279

nvbench/benchmark_base.cxx

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,18 @@ std::unique_ptr<benchmark_base> benchmark_base::clone() const
3434
result->m_axes = m_axes;
3535
result->m_devices = m_devices;
3636

37-
result->m_min_samples = m_min_samples;
38-
result->m_criterion_params = m_criterion_params;
37+
result->m_printer = m_printer;
38+
39+
result->m_is_cpu_only = m_is_cpu_only;
40+
result->m_run_once = m_run_once;
41+
result->m_disable_blocking_kernel = m_disable_blocking_kernel;
42+
43+
result->m_min_samples = m_min_samples;
3944

4045
result->m_skip_time = m_skip_time;
4146
result->m_timeout = m_timeout;
4247

48+
result->m_criterion_params = m_criterion_params;
4349
result->m_stopping_criterion = m_stopping_criterion;
4450

4551
return result;

nvbench/benchmark_manager.cxx

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,10 @@ void benchmark_manager::initialize()
4040
const auto& mgr = device_manager::get();
4141
for (auto& bench : m_benchmarks)
4242
{
43-
bench->set_devices(mgr.get_devices());
43+
if (!bench->get_is_cpu_only())
44+
{
45+
bench->set_devices(mgr.get_devices());
46+
}
4447
}
4548
}
4649

0 commit comments

Comments
 (0)