Skip to content

Commit bc09838

Browse files
1.2 release
1 parent 4052bf9 commit bc09838

20 files changed

+761
-179
lines changed

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,17 @@
11
# Changelog
22

3+
## [1.2.0] - 2025-07-21
4+
5+
### Added
6+
- Multicast reductions benchmarks
7+
- Option to specify iteration count (-i/--iterations)
8+
- Option to repeat a testcase for a specified number of iterations (-c/--repeat)
9+
- Option to repeat a testcase for a specified number of seconds (-d/--duration)
10+
- CUDA Stream Ordered Memory Allocator was added as a new allocator option (-a cudapool)
11+
12+
### Changed
13+
- Caching multicast allocations is now much faster, thanks to multicast-specific memory pool
14+
315
## [1.1.0] - 2025-05-22
416

517
### Added

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#[[
2-
SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
SPDX-License-Identifier: Apache-2.0
44
55
Licensed under the Apache License, Version 2.0 (the "License");

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,28 @@ The default is 5 samples per rack.
174174

175175
When `--richOutput` is enabled, all sample measurements will be shown. Otherwise, only the median value across the samples is reported.
176176

177+
## --iterations
178+
The `--iterations` option controls how many copy operations are performed within each measurement, not including the initial warmup iteration.
179+
180+
The default is 16.
181+
182+
Controlled with `-i` or `--iterations` option.
183+
184+
## --repeat
185+
The `--repeat` option controls how many times each testcase is executed in a single run. By default, each testcase is run once. You can use this option to repeat the same testcase multiple times.
186+
187+
Controlled with `-c` or `--repeat` option.
188+
189+
For example, `-c 10` will run each selected testcase 10 times in a row.
190+
191+
## --duration
192+
193+
The `--duration` option allows you to specify how long (in seconds) each testcase should be repeated. The testcase will be executed repeatedly until the specified duration has elapsed.
194+
195+
Controlled with `-d` or `--duration` option.
196+
197+
**Note:** You cannot specify both `--duration` and `--repeat` at the same time; only one of these options can be used per run. If neither is specified, each testcase will run once (the default).
198+
177199
# Heatmap plotter
178200

179201
`plot_heatmaps.py` included in the `nvloom_cli` directory produces heatmaps for each testcase of a given `nvloom_cli` output.
@@ -351,3 +373,24 @@ Bandwidth of the "continuous arrow" copy is reported, however it causes `NUM_GPU
351373
Multicast_all_to_all measures bandwidth of every single GPU broadcasting to every single GPU at the same time. In essence, it's `NUM_GPU` of `multicast_one_to_all` running simultaneously.
352374

353375
Sum of all "continuous arrow" bandwidth is reported.
376+
377+
### Multicast_one_to_all_red: multimem.red
378+
379+
Each measurement in multicast_one_to_all_red performs an addition of data from a regular "device" buffer on the source GPU to a multicast allocation that's allocated on all GPUs in the job. Multimem.red PTX instruction is used for this reduction. For more information, see [Data Movement and Conversion Instructions: multimem.ld_reduce, multimem.st, multimem.red](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem)
380+
381+
![Diagram of multicast multimem.red traffic pattern](docs/multicast_red.png)
382+
383+
### Multicast_all_to_all_red: multimem.red
384+
385+
Multicast_all_to_all measures bandwidth of every single GPU adding data (reducing) to every single GPU at the same time. In essence, it's `NUM_GPU` of `multicast_one_to_all_red` running simultaneously.
386+
387+
### Multicast_all_to_one_red: multimem.ld_reduce
388+
389+
Each measurement in multicast_all_to_one_red performs a sum of data residing on all GPUs and saves the result to local memory. Multimem.ld_reduce PTX instruction is used for this reduction. For more information, see [Data Movement and Conversion Instructions: multimem.ld_reduce, multimem.st, multimem.red](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem)
390+
391+
![Diagram of multicast multimem.ld_reduce traffic pattern](docs/multicast_ld_reduce.png)
392+
393+
### Multicast_all_to_all_ld_reduce: multimem.ld_reduce
394+
395+
Multicast_all_to_all measures bandwidth of every single GPU reducing data from every single GPU at the same time. In essence, it's `NUM_GPU` of `Multicast_all_to_all_ld_reduce` running simultaneously.
396+

cli/nvloom_cli.cpp

Lines changed: 66 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
* SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
* SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
* SPDX-License-Identifier: Apache-2.0
44
*
55
* Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,13 +24,32 @@
2424
#include <iostream>
2525
#include <memory>
2626

27-
#define NVLOOM_VERSION "1.1.0"
27+
#define NVLOOM_VERSION "1.2.0"
2828
#ifndef GIT_COMMIT
2929
#define GIT_COMMIT "unknown"
3030
#endif
3131

3232
bool richOutput = false;
3333
int gpuToRackSamples = 5;
34+
int iterations = NvLoom::getDefaultIterationCount();
35+
36+
bool shouldContinue(boost::program_options::variables_map &vm, int iteration, std::chrono::time_point<std::chrono::high_resolution_clock> startTime) {
37+
if (vm["repeat"].defaulted() && vm["duration"].defaulted()) {
38+
return false;
39+
}
40+
41+
if (!vm["repeat"].defaulted()) {
42+
return iteration + 1 < vm["repeat"].as<int>();
43+
}
44+
45+
if (!vm["duration"].defaulted()) {
46+
auto duration = std::chrono::duration<double>(std::chrono::high_resolution_clock::now() - startTime).count();
47+
return duration < vm["duration"].as<int>();
48+
}
49+
50+
ASSERT(0);
51+
return false;
52+
}
3453

3554
int run_program(int argc, char **argv) {
3655
boost::program_options::options_description opts("nvloom CLI");
@@ -40,6 +59,8 @@ int run_program(int argc, char **argv) {
4059

4160
bool listTestcases = false;
4261
int bufferSizeInMiB = 512;
62+
int repeat = 1;
63+
int duration = -1;
4364

4465
std::string suitesOptionDescription("Suite(s) to run (by name): all-to-one, egm, fabric-stress, gpu-to-rack, multicast, pairwise, rack-to-rack");
4566
opts.add_options()
@@ -49,8 +70,11 @@ int run_program(int argc, char **argv) {
4970
("suite,s", boost::program_options::value<std::vector<std::string>>(&suitesToRun)->multitoken(), suitesOptionDescription.c_str())
5071
("listTestcases,l", boost::program_options::bool_switch(&listTestcases)->default_value(listTestcases), "List testcases")
5172
("richOutput,r", boost::program_options::bool_switch(&richOutput)->default_value(richOutput), "Rich output")
52-
("allocatorStrategy,a", boost::program_options::value<std::string>(&allocatorStrategyString)->default_value("reuse"), "Allocator strategy: choose between unique and reuse")
73+
("allocatorStrategy,a", boost::program_options::value<std::string>(&allocatorStrategyString)->default_value("reuse"), "Allocator strategy: choose between unique, reuse and cudapool")
5374
("gpuToRackSamples", boost::program_options::value<int>(&gpuToRackSamples)->default_value(gpuToRackSamples), "Number of per-rack samples to use in gpu_to_rack testcases")
75+
("iterations,i", boost::program_options::value<int>(&iterations)->default_value(iterations), "Number of copy iterations within the testcase to run, not including the warmup iteration")
76+
("repeat,c", boost::program_options::value<int>(&repeat)->default_value(repeat), "Number of times to repeat each testcase")
77+
("duration,d", boost::program_options::value<int>(&duration)->default_value(duration), "Duration of each testcase in seconds")
5478
;
5579

5680
boost::program_options::variables_map vm;
@@ -69,6 +93,11 @@ int run_program(int argc, char **argv) {
6993
return 0;
7094
}
7195

96+
if (!vm["repeat"].defaulted() && !vm["duration"].defaulted()) {
97+
std::cerr << "Cannot specify both repeat and duration\n";
98+
return 1;
99+
}
100+
72101
OUTPUT << "nvloom_cli " << NVLOOM_VERSION << std::endl;
73102
OUTPUT << "git commit: " << GIT_COMMIT << std::endl;
74103

@@ -77,6 +106,8 @@ int run_program(int argc, char **argv) {
77106
allocatorStrategy = ALLOCATOR_STRATEGY_REUSE;
78107
} else if (allocatorStrategyString == "unique") {
79108
allocatorStrategy = ALLOCATOR_STRATEGY_UNIQUE;
109+
} else if (allocatorStrategyString == "cudapool") {
110+
allocatorStrategy = ALLOCATOR_STRATEGY_CUDA_POOLS;
80111
} else {
81112
std::cerr << "Unknown value for the allocatorStrategy argument: " << allocatorStrategyString << "\n";
82113
OUTPUT << opts << "\n";
@@ -89,6 +120,8 @@ int run_program(int argc, char **argv) {
89120

90121
OUTPUT << "Buffer size: " << bufferSizeInMiB << " MiB" << std::endl;
91122

123+
OUTPUT << "Iteration count: " << iterations << std::endl;
124+
92125
auto [testcases, suites] = buildTestcases(allocatorStrategy);
93126

94127
if (listTestcases) {
@@ -130,14 +163,36 @@ int run_program(int argc, char **argv) {
130163
}
131164

132165
for (auto testcase : testcasesToRunSet) {
133-
OUTPUT << "Running " << testcase << std::endl;
134-
auto startTime = std::chrono::high_resolution_clock::now();
135-
testcases[testcase]->filterRun(bufferSizeInB);
136-
clearAllocationPools();
137-
auto endTime = std::chrono::high_resolution_clock::now();
138-
OUTPUT << "ExecutionTime " << testcase << " " << std::chrono::duration<double>(endTime - startTime).count() << " s" << std::endl;
139-
OUTPUT << "Done " << testcase << std::endl;
140-
OUTPUT << std::endl;
166+
int iterationCount = 0;
167+
auto loopStartTime = std::chrono::high_resolution_clock::now();
168+
while (true) {
169+
std::string testcaseName = testcase;
170+
if (!vm["repeat"].defaulted() || !vm["duration"].defaulted()) {
171+
testcaseName += "_iter_" + std::to_string(iterationCount);
172+
}
173+
174+
OUTPUT << "Running " << testcaseName << std::endl;
175+
auto startTime = std::chrono::high_resolution_clock::now();
176+
testcases[testcase]->filterRun(bufferSizeInB);
177+
auto endTime = std::chrono::high_resolution_clock::now();
178+
179+
bool shouldContinueIteration = shouldContinue(vm, iterationCount, loopStartTime);
180+
if (!shouldContinueIteration) {
181+
// We're only clearing the pools on last iteration of the loop
182+
// But we still want to include the time it took to clear the pools in the output
183+
clearAllocationPools();
184+
}
185+
186+
OUTPUT << "ExecutionTime " << testcaseName << " " << std::chrono::duration<double>(endTime - startTime).count() << " s" << std::endl;
187+
OUTPUT << "Done " << testcaseName << std::endl;
188+
OUTPUT << std::endl;
189+
190+
if (!shouldContinueIteration) {
191+
break;
192+
}
193+
194+
iterationCount++;
195+
}
141196
}
142197

143198
return 0;

0 commit comments

Comments
 (0)