-
Notifications
You must be signed in to change notification settings - Fork 90
Add best practices documentation #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
PointKernel
wants to merge
12
commits into
NVIDIA:main
Choose a base branch
from
PointKernel:best-practices-doc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 10 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
2da0b8d
Create best_practices.md
PointKernel af9773d
Update best_practices.md
PointKernel be0cda8
Update best_practices.md
PointKernel 87aa856
Update best_practices.md
PointKernel 63f884d
Update best_practices.md
PointKernel 0b25b91
Update best_practices.md
PointKernel df27dcb
Update best_practices.md
PointKernel 54cbcd0
Update best_practices.md
PointKernel b95acb8
Update best_practices.md
PointKernel df7abef
Update best_practices.md
PointKernel 8af0aa3
Updates
PointKernel 3a9c80d
Updates
PointKernel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,297 @@ | ||
| # NVBench Best Practices | ||
|
|
||
| NVBench is a **small yet actively developed benchmarking library** for CUDA GPU workloads. The [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) serves as the ideal starting point, providing detailed guidance for users to get hands-on experience—from installation to framework usage. It includes links to the [benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md), which covers all essential features and usage instructions, as well as links to [code examples](https://github.com/NVIDIA/nvbench/tree/main/examples) that demonstrate how to integrate and apply various NVBench features within a user’s codebase. | ||
|
|
||
| This document is **not intended to replace** the detailed benchmark documentation ([here](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md)) or the CLI help guides ([CLI help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md)). All examples provided are for demonstration purposes and are **not intended as recommendations for best practices in real-world scenarios**. | ||
|
|
||
| ## NVBench | ||
|
|
||
| * Provides GPU-aware features: runtime customization, throughput calculations, and parameter sweeps, etc. | ||
| * Produces machine-readable output (JSON, CSV) suitable for regression tracking and CI pipelines. | ||
| * The natural choice for benchmarking GPU-accelerated code. | ||
| * Also supports benchmarking normal CPU implementations. | ||
| * Python code support is in the roadmap. | ||
|
|
||
| ## Benchmarking GPU Workloads with NVBench | ||
| Let’s begin with a simple example for users who are new to NVBench and want to learn the basics of benchmarking GPU code. Consider measuring the performance of `thrust::sequence` on a GPU. Similar to `std::iota`, suppose we have an input array of 10 elements, and we want `thrust::sequence` to populate it with the sequence of values from 0 to 9. The following example demonstrates this approach: | ||
|
|
||
| ```cpp | ||
| void sequence_bench(nvbench::state& state) { | ||
| auto data = thrust::device_vector<int>(10); | ||
| state.exec([](nvbench::launch& launch) { | ||
| thrust::sequence(data.begin(), data.end()); | ||
| }); | ||
| } | ||
| NVBENCH_BENCH(sequence_bench); | ||
| ``` | ||
| Will this code work as-is? Depending on the build system configuration, compilation may succeed but generate warnings indicating that `launch` is an unused parameter. The code may or may not execute correctly. This often occurs when users, accustomed to a sequential programming mindset, overlook the fact that GPU architectures are highly parallel. Proper use of streams and synchronization is essential for accurately measuring performance in benchmark code. | ||
|
|
||
| A common mistake in this context is neglecting stream specification: NVBench requires knowledge of the exact CUDA stream being targeted to correctly trace kernel execution and measure performance. Therefore, users must explicitly provide the stream to be benchmarked. For example, passing the NVBench launch stream ensures correct execution and accurate measurement: | ||
|
|
||
| ```cpp | ||
| void sequence_bench(nvbench::state& state) { | ||
| auto data = thrust::device_vector<int>(10); | ||
| state.exec([](nvbench::launch& launch) { | ||
| thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end()); | ||
| }); | ||
| } | ||
| NVBENCH_BENCH(sequence_bench); | ||
| ``` | ||
|
|
||
| By explicitly specifying `launch.get_stream()`, NVBench can correctly target the kernels executed on that stream. After recompilation, the compilation warnings will be resolved, and the build will complete successfully. However, at runtime, the code may hang, for example: | ||
|
|
||
|
|
||
| ```bash | ||
|
|
||
| ###################################################################### | ||
| ##################### Possible Deadlock Detected ##################### | ||
| ###################################################################### | ||
|
|
||
| Forcing unblock: The current measurement appears to have deadlocked | ||
| and the results cannot be trusted. | ||
|
|
||
| This happens when the KernelLauncher synchronizes the CUDA device. | ||
| If this is the case, pass the `sync` exec_tag to the `exec` call: | ||
|
|
||
| state.exec(<KernelLauncher>); // Deadlock | ||
| state.exec(nvbench::exec_tag::sync, <KernelLauncher>); // Safe | ||
| ``` | ||
|
|
||
| The runtime execution log indicates a deadlock, and NVBench terminated the run to prevent unnecessary execution. The log shows that the issue arises from implicit synchronization within the target kernel—in this case, the `thrust::sequence` call. By default, unless explicitly specified, `thrust` uses a synchronous execution policy internally. Therefore, users must pass `nvbench::exec_tag::sync` to ensure correct benchmarking. This will **not** produce a build-time error but can cause runtime hangs if omitted. | ||
|
|
||
| Now, we fix the code: | ||
|
|
||
| ```cpp | ||
| void sequence_bench(nvbench::state& state) { | ||
| auto data = thrust::device_vector<int>(10); | ||
| state.exec(nvbench::exec_tag::sync, [&data](nvbench::launch& launch) { | ||
| thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end()); | ||
| }); | ||
| } | ||
| NVBENCH_BENCH(sequence_bench); | ||
| ``` | ||
|
|
||
| When the benchmark is executed, results are displayed without issues. However, users, particularly in a multi-GPU environment, may observe that more results are collected than expected: | ||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench | ||
| # Devices | ||
|
|
||
| ## [0] `Quadro RTX 8000` | ||
| * SM Version: 750 (PTX Version: 750) | ||
| * Number of SMs: 72 | ||
| * SM Default Clock Rate: 1770 MHz | ||
| * Global Memory: 48232 MiB Free / 48403 MiB Total | ||
| * Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz) | ||
| * Max Shared Memory: 64 KiB/SM, 48 KiB/Block | ||
| * L2 Cache Size: 6144 KiB | ||
| * Maximum Active Blocks: 16/SM | ||
| * Maximum Active Threads: 1024/SM, 1024/Block | ||
| * Available Registers: 65536/SM, 65536/Block | ||
| * ECC Enabled: No | ||
|
|
||
| ## [1] `NVIDIA RTX A400` | ||
| * SM Version: 860 (PTX Version: 860) | ||
| * Number of SMs: 6 | ||
| * SM Default Clock Rate: 1762 MHz | ||
| * Global Memory: 2801 MiB Free / 3769 MiB Total | ||
| * Global Memory Bus Peak: 96 GB/sec (64-bit DDR @6001MHz) | ||
| * Max Shared Memory: 100 KiB/SM, 48 KiB/Block | ||
| * L2 Cache Size: 1024 KiB | ||
| * Maximum Active Blocks: 16/SM | ||
| * Maximum Active Threads: 1536/SM, 1024/Block | ||
| * Available Registers: 65536/SM, 65536/Block | ||
| * ECC Enabled: No | ||
|
|
||
| # Log | ||
|
|
||
| Run: [1/2] sequence_bench [Device=0] | ||
| Pass: Cold: 0.006150ms GPU, 0.009768ms CPU, 0.50s total GPU, 4.52s total wall, 81312x | ||
| Run: [2/2] sequence_bench [Device=1] | ||
| Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 63952x | ||
|
|
||
| # Benchmark Results | ||
|
|
||
| ## sequence_bench | ||
|
|
||
| ### [0] Quadro RTX 8000 | ||
|
|
||
| | Samples | CPU Time | Noise | GPU Time | Noise | | ||
| |---------|----------|--------|----------|--------| | ||
| | 81312x | 9.768 us | 13.55% | 6.150 us | 20.16% | | ||
|
|
||
| ### [1] NVIDIA RTX A400 | ||
|
|
||
| | Samples | CPU Time | Noise | GPU Time | Noise | | ||
| |---------|-----------|---------|----------|---------| | ||
| | 63952x | 13.864 us | 432.95% | 7.819 us | 447.95% | | ||
| ``` | ||
|
|
||
| By default, NVBench runs benchmarks on all available GPUs unless specified otherwise. On multi-GPU systems, this can unnecessarily increase runtime and resource usage. To target a specific GPU, saving both time and resources, you can set the `CUDA_VISIBLE_DEVICES` environment variable. In our case, we target the **RTX8000**: | ||
oleksandr-pavlyk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/build/bin$ export CUDA_VISIBLE_DEVICES=0 | ||
| ``` | ||
|
|
||
| Now, if we rerun: | ||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench | ||
| # Devices | ||
|
|
||
| ## [0] `Quadro RTX 8000` | ||
| * SM Version: 750 (PTX Version: 750) | ||
| * Number of SMs: 72 | ||
| * SM Default Clock Rate: 1770 MHz | ||
| * Global Memory: 48232 MiB Free / 48403 MiB Total | ||
| * Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz) | ||
| * Max Shared Memory: 64 KiB/SM, 48 KiB/Block | ||
| * L2 Cache Size: 6144 KiB | ||
| * Maximum Active Blocks: 16/SM | ||
| * Maximum Active Threads: 1024/SM, 1024/Block | ||
| * Available Registers: 65536/SM, 65536/Block | ||
| * ECC Enabled: No | ||
|
|
||
| # Log | ||
|
|
||
| Run: [1/1] sequence_bench [Device=0] | ||
| Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 79920x | ||
|
|
||
| # Benchmark Results | ||
|
|
||
| ## sequence_bench | ||
|
|
||
| ### [0] Quadro RTX 8000 | ||
|
|
||
| | Samples | CPU Time | Noise | GPU Time | Noise | | ||
| |---------|----------|-------|----------|--------| | ||
| | 79920x | 9.850 us | 9.62% | 6.257 us | 13.32% | | ||
| ``` | ||
|
|
||
| ## Benchmarking Multiple Problem Sizes | ||
|
|
||
| Benchmarking the performance of a single problem size is usually **less desired in real-world problems**. In most cases, we want to run different problem sizes for the same kernel. NVBench provides an **“axis”** feature to help with this. For example, to test input sizes from `10` to `1000000`: | ||
|
|
||
| ```cpp | ||
| void sequence_bench(nvbench::state& state) { | ||
| auto const n = state.get_int64("Num"); | ||
| auto data = thrust::device_vector<int>(n); | ||
|
|
||
| state.exec(nvbench::exec_tag::sync, [&data](nvbench::launch& launch) { | ||
| thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end()); | ||
| }); | ||
| } | ||
| NVBENCH_BENCH(sequence_bench) | ||
| .add_int64_axis("Num", std::vector<nvbench::int64_t>{10, 100, 1000, 1000000}); | ||
| ``` | ||
|
|
||
| **Axis is a powerful tool** provided by NVBench. Users may encounter situations where they want to test only certain sizes. NVBench provides a **flexible CLI**, so users can change the benchmark parameters **without recompiling the code**: | ||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench -a Num=[10,100000] | ||
| # Devices | ||
|
|
||
| ## [0] `Quadro RTX 8000` | ||
| * SM Version: 750 (PTX Version: 750) | ||
| * Number of SMs: 72 | ||
| * SM Default Clock Rate: 1770 MHz | ||
| * Global Memory: 48232 MiB Free / 48403 MiB Total | ||
| * Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz) | ||
| * Max Shared Memory: 64 KiB/SM, 48 KiB/Block | ||
| * L2 Cache Size: 6144 KiB | ||
| * Maximum Active Blocks: 16/SM | ||
| * Maximum Active Threads: 1024/SM, 1024/Block | ||
| * Available Registers: 65536/SM, 65536/Block | ||
| * ECC Enabled: No | ||
|
|
||
| # Log | ||
|
|
||
| Run: [1/2] sequence_bench [Device=0 Num=10] | ||
| Pass: Cold: 0.006318ms GPU, 0.009948ms CPU, 0.50s total GPU, 4.37s total wall, 79152x | ||
| Run: [2/2] sequence_bench [Device=0 Num=100000] | ||
| Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 75936x | ||
|
|
||
| # Benchmark Results | ||
|
|
||
| ## sequence_bench | ||
|
|
||
| ### [0] Quadro RTX 8000 | ||
|
|
||
| | Num | Samples | CPU Time | Noise | GPU Time | Noise | | ||
| |--------|---------|-----------|-------|----------|--------| | ||
| | 10 | 79152x | 9.948 us | 9.63% | 6.318 us | 13.73% | | ||
| | 100000 | 75936x | 10.193 us | 9.62% | 6.586 us | 12.86% | | ||
| ``` | ||
|
|
||
| For more details about **CLI axis control**, please check [here](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md). | ||
|
|
||
| ## Comparing Algorithms Using NVBench | ||
|
|
||
| Once benchmarks are set, a major use is to evaluate performance between different algorithms. For example, the same sequence algorithm can be written manually using `thrust::transform`. We can compare the performance of a manual transform sequence against `thrust::sequence`. | ||
|
|
||
| ### Step 1: Record Reference Performance | ||
|
|
||
| Record the `thrust::sequence` benchmark in a JSON file for post-processing: | ||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_ref.json | ||
| ``` | ||
|
|
||
| ### Step 2: Update Code with `thrust::transform` | ||
|
|
||
| ```cpp | ||
| void sequence_bench(nvbench::state& state) { | ||
| auto const n = state.get_int64("Num"); | ||
| auto data = thrust::device_vector<int>(n); | ||
|
|
||
| state.exec(nvbench::exec_tag::sync, [&data, n](nvbench::launch& launch) { | ||
| thrust::transform( | ||
| thrust::device.on(launch.get_stream()), | ||
| thrust::counting_iterator<int>(0), | ||
| thrust::counting_iterator<int>(n), | ||
| data.begin(), | ||
| cuda::std::identity{} | ||
| ); | ||
| }); | ||
| } | ||
| ``` | ||
|
|
||
| ### Step 3: Run Benchmark with Transform and Save JSON | ||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_transform.json | ||
| ``` | ||
|
|
||
| ### Step 4: Compare Results | ||
|
|
||
| NVBench provides a convenient script under `nvbench/scripts` called `nvbench_compare.py`. After copying the JSON files to the scripts folder: | ||
|
|
||
| ```bash | ||
| user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json | ||
| ['sequence_ref.json', 'sequence_transform.json'] | ||
| # sequence_bench | ||
|
|
||
| ## [0] Quadro RTX 8000 | ||
|
|
||
| | Num | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | | ||
| |---------|------------|-------------|------------|-------------|-----------|---------|----------| | ||
| | 10 | 6.288 us | 13.70% | 6.301 us | 14.38% | 0.013 us | 0.20% | SAME | | ||
| | 100 | 6.331 us | 13.74% | 6.350 us | 15.15% | 0.019 us | 0.31% | SAME | | ||
| | 1000 | 6.548 us | 13.29% | 6.504 us | 13.95% | -0.043 us | -0.66% | SAME | | ||
| | 1000000 | 12.528 us | 7.56% | 12.507 us | 8.41% | -0.021 us | -0.17% | SAME | | ||
|
|
||
| # Summary | ||
|
|
||
| - Total Matches: 4 | ||
| - Pass (diff <= min_noise): 4 | ||
| - Unknown (infinite noise): 0 | ||
| - Failure (diff > min_noise): 0 | ||
| ``` | ||
|
|
||
| We can see that the performance of the two approaches is essentially the same. | ||
|
|
||
| (wanted to mention users can also use the json file to trace regressions in CI) | ||
|
|
||
| --- | ||
|
|
||
| For more information on how to use NVBench in your projects, please check the [NVBench repository](https://github.com/NVIDIA/nvbench). Feel free to raise questions or feature requests via GitHub [issues](https://github.com/NVIDIA/nvbench/issues) or [discussions](https://github.com/NVIDIA/nvbench/discussions), and enjoy benchmarking with NVBench! | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.