Open
Description
Pav2 is the only test harness I've found that allows me to specify a number of nodes and execute all subsequent jobs on them (thank you). This is achieved as follows:
modes/share.yaml
scheduler: slurm
schedule:
nodes: 1
share_allocation: max
However, when looking at the results output, it appears that these jobs are launched serially, rather than asynchronously. See below.
Edited pav results output showing launch times.
11:20:24
11:20:19
11:20:16
11:20:12
11:20:03
11:19:53
11:19:43
11:19:34
11:19:30
11:19:27
11:19:24
11:19:20
Note that all of these tests are a single rank, thus they should be able to be launched with srun using the following srun args.
slurm:
srun_extra:
- --overlap
One potential issue is overwhelming SLURM. Perhaps adding another key, e.g. max_queue
, that limits the number of asynchronous jobs that can be put in the srun queue will be helpful. Perhaps something as follows.
modes/share.yaml
scheduler: slurm
schedule:
nodes: 1
share_allocation: max
max_queue: 250
slurm:
srun_extra:
- --overlap
- --gres=craynetwork:0