Skip to content

share_allocation: max -- asynchronous launching #725

Open
@j-ogas

Description

@j-ogas

Pav2 is the only test harness I've found that allows me to specify a number of nodes and execute all subsequent jobs on them (thank you). This is achieved as follows:

modes/share.yaml

scheduler: slurm
schedule:
  nodes: 1
  share_allocation: max

However, when looking at the results output, it appears that these jobs are launched serially, rather than asynchronously. See below.

Edited pav results output showing launch times.

11:20:24
11:20:19
11:20:16
11:20:12
11:20:03
11:19:53
11:19:43
11:19:34
11:19:30
11:19:27
11:19:24
11:19:20

Note that all of these tests are a single rank, thus they should be able to be launched with srun using the following srun args.

  slurm:
    srun_extra:
     - --overlap  

One potential issue is overwhelming SLURM. Perhaps adding another key, e.g. max_queue, that limits the number of asynchronous jobs that can be put in the srun queue will be helpful. Perhaps something as follows.

modes/share.yaml

scheduler: slurm
schedule:
  nodes: 1
  share_allocation: max
  max_queue: 250
  slurm:
    srun_extra:
     - --overlap 
     - --gres=craynetwork:0

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions