Skip to content

Commit 1bde292

Browse files
authored
Hotfix scripts generic (#54)
* locked the version of numpy * Added description on the INSTALL_DIR path * Added 1-1 mapping of benchmarks numbers and codename * updated with generic python_wrapper for starting the benchmarks * Updated scripts' readme with expected folder structure * removed project specific slurm args
1 parent c29cb10 commit 1bde292

File tree

7 files changed

+390
-6
lines changed

7 files changed

+390
-6
lines changed

projects/resources/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## Benchmarks and Utilities
2+
3+
This folder contains the Java, Python and CUDA benchmarks along with the scripts to compute the interconnection matrix within the current architecture.
4+
5+
| **Benchmark #** | **Benchmark Name** | **Extended Description** |
6+
|----------------- |-------------------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
7+
| B1M | VEC | _Compute the sum of difference of squares of 2 vectors, using multiple GrCUDA kernels. Parallelize the computation on multiple GPUs, by computing a chunk of the output on each._ |
8+
| B5M | B&S | _Black-Scholes equation benchmark, executed concurrently on different input vectors._ |
9+
| B6M | ML | _Compute an ensemble of Categorical Naive Bayes and Ridge Regression classifiers. Predictions are aggregated averaging the class scores after softmax normalizatio._ |
10+
| B9M | CG | _Compute the conjugate gradient algorithm on a dense symmetric matrix. The matrix-vector multiplications are row-partitioned to scale across multiple GPUs._ |
11+
| B11M | MUL | _Dense matrix-vector multiplication, partitioning the matrix in blocks of rows._ |
12+
Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
# Copyright (c) 2025 NECSTLab, Politecnico di Milano. All rights reserved.
2+
3+
# Redistribution and use in source and binary forms, with or without
4+
# modification, are permitted provided that the following conditions
5+
# are met:
6+
# * Redistributions of source code must retain the above copyright
7+
# notice, this list of conditions and the following disclaimer.
8+
# * Redistributions in binary form must reproduce the above copyright
9+
# notice, this list of conditions and the following disclaimer in the
10+
# documentation and/or other materials provided with the distribution.
11+
# * Neither the name of NECSTLab nor the names of its
12+
# contributors may be used to endorse or promote products derived
13+
# from this software without specific prior written permission.
14+
# * Neither the name of Politecnico di Milano nor the names of its
15+
# contributors may be used to endorse or promote products derived
16+
# from this software without specific prior written permission.
17+
18+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
19+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
21+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
22+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
23+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
24+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
25+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
26+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
27+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29+
30+
import argparse
31+
import subprocess
32+
import time
33+
import os
34+
from datetime import datetime
35+
from benchmark_result import BenchmarkResult
36+
from pathlib import Path
37+
38+
##############################
39+
##############################
40+
GPU = "GPU_NAME" ## not relevant, it is just used for the output files
41+
BANDWIDTH_MATRIX = f"{os.getenv('GRCUDA_HOME')}/projects/resources/connection_graph/datasets/connection_graph.csv"
42+
43+
44+
# Benchmark settings;
45+
DEFAULT_NUM_BLOCKS = 640
46+
47+
HEAP_SIZE = 470
48+
49+
benchmarks = [
50+
"b1m",
51+
"b5m",
52+
"b6m",
53+
"b9m",
54+
"b11m",
55+
]
56+
57+
num_elem= {
58+
"b1m": [
59+
160000,
60+
500000,
61+
950000
62+
],
63+
"b5m": [
64+
10000,
65+
21000,
66+
35000
67+
],
68+
"b6m": [
69+
1000,
70+
1400,
71+
1800
72+
],
73+
"b9m": [
74+
2000,
75+
4000,
76+
6000
77+
],
78+
"b11m": [
79+
2000,
80+
4000,
81+
6000
82+
]
83+
}
84+
85+
86+
block_dim_dict = {
87+
"b1m": 64,
88+
"b5m": 64,
89+
"b6m": 64,
90+
"b9m": 64,
91+
"b11m": 64,
92+
}
93+
94+
95+
exec_policies = ["async"]
96+
97+
dependency_policies = ["with-const"] #, "no-const"]
98+
99+
new_stream_policies = ["always-new"] #, "reuse"]
100+
101+
parent_stream_policies = ["multigpu-disjoint"] # ["same-as-parent", "disjoint", "multigpu-early-disjoint", "multigpu-disjoint"]
102+
103+
choose_device_policies = ["round-robin", "stream-aware", "min-transfer-size", "minmax-transfer-time"] # ["single-gpu", "round-robin", "stream-aware", "min-transfer-size", "minmin-transfer-time", "minmax-transfer-time"]
104+
105+
memory_advise = ["none"]
106+
107+
prefetch = ["false"]
108+
109+
stream_attach = [False]
110+
111+
time_computation = [False]
112+
113+
num_gpus = [2]
114+
115+
block_sizes1d_dict = {
116+
"b1m": 32,
117+
"b5m": 1024,
118+
"b6m": 32,
119+
"b9m": 32,
120+
"b11m": 256,
121+
}
122+
123+
block_sizes2d_dict = {
124+
"b1m": 8,
125+
"b5m": 8,
126+
"b6m": 8,
127+
"b9m": 8,
128+
"b11m": 8,
129+
}
130+
131+
##############################
132+
##############################
133+
134+
GRAALPYTHON_CMD = "graalpython --vm.XX:MaxHeapSize={}G --jvm --polyglot --experimental-options " \
135+
"--grcuda.ExecutionPolicy={} --grcuda.DependencyPolicy={} --grcuda.RetrieveNewStreamPolicy={} " \
136+
"--grcuda.NumberOfGPUs={} --grcuda.RetrieveParentStreamPolicy={} " \
137+
"--grcuda.DeviceSelectionPolicy={} --grcuda.MemAdvisePolicy={} --grcuda.InputPrefetch={} --grcuda.BandwidthMatrix={} {} {} " \
138+
"benchmark_main.py -i {} -n {} -g {} --number_of_gpus {} --reinit false --realloc false " \
139+
"-b {} --block_size_1d {} --block_size_2d {} --execution_policy {} --dependency_policy {} --new_stream {} "\
140+
"--parent_stream {} --device_selection {} --memory_advise_policy {} --prefetch {} --no_cpu_validation {} {} {} {} -o {}"
141+
142+
def execute_grcuda_benchmark(benchmark, size, num_gpus, block_sizes, exec_policy, dependency_policy, new_stream_policy,
143+
parent_stream_policy, choose_device_policy, memory_advise, prefetch, num_iter, bandwidth_matrix, time_phases, debug, stream_attach=False,
144+
time_computation=False, num_blocks=DEFAULT_NUM_BLOCKS, output_date=None, mock=False):
145+
if debug:
146+
BenchmarkResult.log_message("#" * 30)
147+
BenchmarkResult.log_message(f"Benchmark {i + 1}/{tot_benchmarks}")
148+
BenchmarkResult.log_message(f"benchmark={benchmark}, size={n},"
149+
f"gpus={num_gpus}, "
150+
f"block-sizes={block_sizes}, "
151+
f"num-blocks={num_blocks}, "
152+
f"exec-policy={exec_policy}, "
153+
f"dependency-policy={dependency_policy}, "
154+
f"new-stream-policy={new_stream_policy}, "
155+
f"parent-stream-policy={parent_stream_policy}, "
156+
f"choose-device-policy={choose_device_policy}, "
157+
f"mem-advise={memory_advise}, "
158+
f"prefetch={prefetch}, "
159+
f"stream-attachment={stream_attach}, "
160+
f"time-computation={time_computation}, "
161+
f"bandwidth-matrix={bandwidth_matrix}, "
162+
f"time-phases={time_phases}")
163+
BenchmarkResult.log_message("")
164+
165+
if not output_date:
166+
output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
167+
file_name = f"{output_date}_{benchmark}_{size}_{num_gpus}_{num_blocks}_{exec_policy}_{dependency_policy}_" \
168+
f"{new_stream_policy}_{parent_stream_policy}_{choose_device_policy}_" \
169+
f"{memory_advise}_{prefetch}_{stream_attach}.json"
170+
# Create a folder if it doesn't exist;
171+
output_folder_path = os.path.join(BenchmarkResult.DEFAULT_RES_FOLDER, output_date + "_grcuda")
172+
if not os.path.exists(output_folder_path):
173+
if debug:
174+
BenchmarkResult.log_message(f"creating result folder: {output_folder_path}")
175+
if not mock:
176+
Path(output_folder_path).mkdir(parents=True, exist_ok=True)
177+
output_path = os.path.join(output_folder_path, file_name)
178+
b1d_size = " ".join([str(b['block_size_1d']) for b in block_sizes])
179+
b2d_size = " ".join([str(b['block_size_2d']) for b in block_sizes])
180+
181+
benchmark_cmd = GRAALPYTHON_CMD.format(HEAP_SIZE, exec_policy, dependency_policy, new_stream_policy,
182+
num_gpus, parent_stream_policy, choose_device_policy, memory_advise, prefetch, bandwidth_matrix,
183+
"--grcuda.ForceStreamAttach" if stream_attach else "",
184+
"--grcuda.EnableComputationTimers" if time_computation else "",
185+
num_iter, size, num_blocks, num_gpus, benchmark, b1d_size, b2d_size, exec_policy, dependency_policy,
186+
new_stream_policy, parent_stream_policy, choose_device_policy, memory_advise, prefetch,
187+
"-d" if debug else "",
188+
"-p" if time_phases else "",
189+
"--force_stream_attach" if stream_attach else "",
190+
"--timing" if time_computation else "",
191+
output_path)
192+
if debug:
193+
BenchmarkResult.log_message(benchmark_cmd)
194+
BenchmarkResult.log_message("#" * 30)
195+
BenchmarkResult.log_message("")
196+
BenchmarkResult.log_message("")
197+
if not mock:
198+
start = time.time()
199+
result = subprocess.run(benchmark_cmd,
200+
shell=True,
201+
stdout=None, #subprocess.STDOUT,
202+
cwd=f"{os.getenv('GRCUDA_HOME')}/projects/resources/python/benchmark")
203+
result.check_returncode()
204+
end = time.time()
205+
if debug:
206+
BenchmarkResult.log_message(f"Benchmark total execution time: {(end - start):.2f} seconds")
207+
208+
##############################
209+
##############################
210+
211+
212+
if __name__ == "__main__":
213+
214+
parser = argparse.ArgumentParser(description="Wrap the GrCUDA benchmark to specify additional settings")
215+
216+
parser.add_argument("-d", "--debug", action="store_true",
217+
help="If present, print debug messages")
218+
parser.add_argument("-c", "--cuda_test", action="store_true",
219+
help="If present, run performance tests using CUDA")
220+
parser.add_argument("-i", "--num_iter", metavar="N", type=int, default=BenchmarkResult.DEFAULT_NUM_ITER,
221+
help="Number of times each benchmark is executed")
222+
parser.add_argument("-g", "--num_blocks", metavar="N", type=int,
223+
help="Number of blocks in each kernel, when applicable")
224+
parser.add_argument("-p", "--time_phases", action="store_true",
225+
help="Measure the execution time of each phase of the benchmark;"
226+
" note that this introduces overheads, and might influence the total execution time")
227+
parser.add_argument("-m", "--mock", action="store_true",
228+
help="If present, simply print the benchmark CMD without executing it")
229+
parser.add_argument("--gpus", metavar="N", type=int, nargs="*",
230+
help="Specify the maximum number of GPUs to use in the computation")
231+
232+
# Parse the input arguments;
233+
args = parser.parse_args()
234+
235+
debug = args.debug if args.debug else BenchmarkResult.DEFAULT_DEBUG
236+
num_iter = args.num_iter if args.num_iter else BenchmarkResult.DEFAULT_NUM_ITER
237+
use_cuda = args.cuda_test
238+
time_phases = args.time_phases
239+
num_blocks = args.num_blocks
240+
mock = args.mock
241+
gpus = args.gpus
242+
243+
if gpus is not None:
244+
num_gpus = gpus
245+
246+
if debug:
247+
BenchmarkResult.log_message(f"using block sizes: {block_sizes1d_dict} {block_sizes2d_dict}; using low-level CUDA benchmarks: {use_cuda}")
248+
249+
def tot_benchmark_count():
250+
tot = 0
251+
if use_cuda:
252+
for b in benchmarks:
253+
for e in cuda_exec_policies:
254+
if e == "sync":
255+
tot += len(num_elem[b]) * len(prefetch) * len(stream_attach)
256+
else:
257+
tot += len(num_elem[b]) * len(prefetch) * len(num_gpus) * len(stream_attach)
258+
else:
259+
for b in benchmarks:
260+
for e in exec_policies:
261+
if e == "sync":
262+
tot += len(num_elem[b]) * len(memory_advise) * len(prefetch) * len(stream_attach) * len(time_computation)
263+
else:
264+
for n in num_gpus:
265+
if n == 1:
266+
tot += len(num_elem[b]) * len(memory_advise) * len(prefetch) * len(stream_attach) * len(time_computation)
267+
else:
268+
tot += len(num_elem[b]) * len(dependency_policies) * len(new_stream_policies) * len(parent_stream_policies) * len(choose_device_policies) * len(memory_advise) * len(prefetch) * len(stream_attach) * len(time_computation)
269+
return tot
270+
271+
output_date = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
272+
273+
# Execute each test;
274+
i = 0
275+
tot_benchmarks = tot_benchmark_count()
276+
for b in benchmarks:
277+
for n in num_elem[b]:
278+
for exec_policy in exec_policies: # GrCUDA Benchmarks;
279+
if exec_policy == "sync":
280+
dp = [dependency_policies[0]]
281+
nsp = [new_stream_policies[0]]
282+
psp = [parent_stream_policies[0]]
283+
cdp = [choose_device_policies[0]]
284+
ng = [1]
285+
else:
286+
dp = dependency_policies
287+
nsp = new_stream_policies
288+
psp = parent_stream_policies
289+
cdp = choose_device_policies
290+
ng = num_gpus
291+
for num_gpu in ng:
292+
if exec_policy == "async" and num_gpu == 1:
293+
dp = [dependency_policies[0]]
294+
nsp = [new_stream_policies[0]]
295+
psp = [parent_stream_policies[0]]
296+
cdp = [choose_device_policies[0]]
297+
else:
298+
dp = dependency_policies
299+
nsp = new_stream_policies
300+
psp = parent_stream_policies
301+
cdp = choose_device_policies
302+
for m in memory_advise:
303+
for p in prefetch:
304+
for s in stream_attach:
305+
for t in time_computation:
306+
# Select the correct connection graph;
307+
BANDWIDTH_MATRIX = f"{os.getenv('GRCUDA_HOME')}/projects/resources/connection_graph/datasets/connection_graph.csv"
308+
for dependency_policy in dp:
309+
for new_stream_policy in nsp:
310+
for parent_stream_policy in psp:
311+
for choose_device_policy in cdp:
312+
nb = num_blocks if num_blocks else block_dim_dict[b]
313+
block_sizes = BenchmarkResult.create_block_size_list([block_sizes1d_dict[b]], [block_sizes2d_dict[b]])
314+
execute_grcuda_benchmark(b, n, num_gpu, block_sizes,
315+
exec_policy, dependency_policy, new_stream_policy, parent_stream_policy, choose_device_policy,
316+
m, p, num_iter, BANDWIDTH_MATRIX, time_phases, debug, s, t, nb, output_date=output_date, mock=mock)
317+
i += 1
318+

scripts/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,30 @@ The installation proceeds by modifying accordingly `env.sh`, and then running `b
1515

1616
The file `env.sh` contains the env variables that needs to be properly set to build and run GrCUDA.
1717

18+
##### Expected folder structure
19+
20+
Users should clone grcuda's repository in its own directory, named `grcuda`.
21+
After you have run the generic setup, the content of `$INSTALL_DIR` (specified in `env.sh`) should be the following:
22+
23+
- grcuda (clone of grcuda's repository)
24+
- graal
25+
- graalvm-ce-java11.22.1.0
26+
- mx
27+
- labsjdk-ce-11.0.15-jvmci-22.1-b01
28+
- graalpython_venv
29+
30+
31+
1832

1933
### OCI setup
34+
2035
This folder contains the scripts necessary for a full install of GrCUDA on OCI resources as detailed by the outermost README of this repository.
2136

2237
### run_local
38+
2339
The scripts within this folder are used to run the java workloads on a simple local multi-GPU machine.
2440
The file `env.sh` contains the env variables that needs to be properly set to build and run GrCUDA.
2541

2642
### run_slurm
43+
2744
The scripts within this folder are used to run the python and java workloads within a slurm setup. The file `env.sh` contains the env variables that needs to be properly set to build and run GrCUDA.

scripts/generic_setup/env.sh

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,14 @@
66
## (java workloads) maven installed/configured, e.g.:
77
# module load maven/3.8.4
88

9-
## setup env
9+
###################
10+
###### SETUP ######
11+
###################
12+
13+
# INSTALL_DIR should point to the install folder of your choice (GrCUDA will be cloned there if not already present)
14+
# or the parent folder where grcuda's source files have been unpacked
15+
# i.e. if grcuda is placed in: /home/X/my_custom_install_dir/grcuda
16+
# INSTALL_DIR should point to /home/X/my_custom_install_dir
1017
export INSTALL_DIR=$HOME/cf25_grcuda ## modify as necessary
1118
mkdir -p $INSTALL_DIR
1219

0 commit comments

Comments
 (0)