You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+43Lines changed: 43 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -174,6 +174,28 @@ The default is 5 samples per rack.
174
174
175
175
When `--richOutput` is enabled, all sample measurements will be shown. Otherwise, only the median value across the samples is reported.
176
176
177
+
## --iterations
178
+
The `--iterations` option controls how many copy operations are performed within each measurement, not including the initial warmup iteration.
179
+
180
+
The default is 16.
181
+
182
+
Controlled with `-i` or `--iterations` option.
183
+
184
+
## --repeat
185
+
The `--repeat` option controls how many times each testcase is executed in a single run. By default, each testcase is run once. You can use this option to repeat the same testcase multiple times.
186
+
187
+
Controlled with `-c` or `--repeat` option.
188
+
189
+
For example, `-c 10` will run each selected testcase 10 times in a row.
190
+
191
+
## --duration
192
+
193
+
The `--duration` option allows you to specify how long (in seconds) each testcase should be repeated. The testcase will be executed repeatedly until the specified duration has elapsed.
194
+
195
+
Controlled with `-d` or `--duration` option.
196
+
197
+
**Note:** You cannot specify both `--duration` and `--repeat` at the same time; only one of these options can be used per run. If neither is specified, each testcase will run once (the default).
198
+
177
199
# Heatmap plotter
178
200
179
201
`plot_heatmaps.py` included in the `nvloom_cli` directory produces heatmaps for each testcase of a given `nvloom_cli` output.
@@ -351,3 +373,24 @@ Bandwidth of the "continuous arrow" copy is reported, however it causes `NUM_GPU
351
373
Multicast_all_to_all measures bandwidth of every single GPU broadcasting to every single GPU at the same time. In essence, it's `NUM_GPU` of `multicast_one_to_all` running simultaneously.
352
374
353
375
Sum of all "continuous arrow" bandwidth is reported.
376
+
377
+
### Multicast_one_to_all_red: multimem.red
378
+
379
+
Each measurement in multicast_one_to_all_red performs an addition of data from a regular "device" buffer on the source GPU to a multicast allocation that's allocated on all GPUs in the job. Multimem.red PTX instruction is used for this reduction. For more information, see [Data Movement and Conversion Instructions: multimem.ld_reduce, multimem.st, multimem.red](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem)
380
+
381
+

382
+
383
+
### Multicast_all_to_all_red: multimem.red
384
+
385
+
Multicast_all_to_all measures bandwidth of every single GPU adding data (reducing) to every single GPU at the same time. In essence, it's `NUM_GPU` of `multicast_one_to_all_red` running simultaneously.
386
+
387
+
### Multicast_all_to_one_red: multimem.ld_reduce
388
+
389
+
Each measurement in multicast_all_to_one_red performs a sum of data residing on all GPUs and saves the result to local memory. Multimem.ld_reduce PTX instruction is used for this reduction. For more information, see [Data Movement and Conversion Instructions: multimem.ld_reduce, multimem.st, multimem.red](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem)
390
+
391
+

Multicast_all_to_all measures bandwidth of every single GPU reducing data from every single GPU at the same time. In essence, it's `NUM_GPU` of `Multicast_all_to_all_ld_reduce` running simultaneously.
("allocatorStrategy,a", boost::program_options::value<std::string>(&allocatorStrategyString)->default_value("reuse"), "Allocator strategy: choose between uniqueand reuse")
73
+
("allocatorStrategy,a", boost::program_options::value<std::string>(&allocatorStrategyString)->default_value("reuse"), "Allocator strategy: choose between unique, reuse and cudapool")
53
74
("gpuToRackSamples", boost::program_options::value<int>(&gpuToRackSamples)->default_value(gpuToRackSamples), "Number of per-rack samples to use in gpu_to_rack testcases")
75
+
("iterations,i", boost::program_options::value<int>(&iterations)->default_value(iterations), "Number of copy iterations within the testcase to run, not including the warmup iteration")
76
+
("repeat,c", boost::program_options::value<int>(&repeat)->default_value(repeat), "Number of times to repeat each testcase")
77
+
("duration,d", boost::program_options::value<int>(&duration)->default_value(duration), "Duration of each testcase in seconds")
54
78
;
55
79
56
80
boost::program_options::variables_map vm;
@@ -69,6 +93,11 @@ int run_program(int argc, char **argv) {
69
93
return0;
70
94
}
71
95
96
+
if (!vm["repeat"].defaulted() && !vm["duration"].defaulted()) {
97
+
std::cerr << "Cannot specify both repeat and duration\n";
0 commit comments