Skip to content

run scaling studies in parallel on consistent node sets #151

Open
@reidpr

Description

@reidpr

We discussed this in the Pavilion training on 2/12/2020 and 2/13/2020.

We would like to do a performance scaling study with multiple power-of-2 tests, with each set of tests at a given scale using the same nodes, running in parallel as much as possible. I'll give a small example, but we'd like this to work with arbitrary scales and repetition counts.

Suppose we have a machine with 9 nodes. We want to run a scaling study with 3 repetitions at scales of 1, 2, 4, and 8 nodes, run in independent Slurm jobs, so there are 12 jobs; I'll name each job by its scale and a letter. Each run takes one time unit. The below table show one possible sequence of which job is running on which node, with "X" meaning unrelated jobs.

Time cn1 cn2 cn3 cn4 cn5 cn6 cn7 cn8 cn9
1 8a 8a 8a 8a 8a 8a 8a 8a 1a
2 8b 8b 8b 8b 8b 8b 8b 8b 1b
3 4a 4a 4a 4a X X X X X
4 8c 8c 8c 8c 8c 8c 8c 8c X
5 4b 4b 4b 4b 2a 2a X X 1c
6 4c 4c 4c 4c 2b 2b X X X
7 X X X X 2c 2c X X X

This table shows an invalid sequence, because runs of the same size change which nodes they get:

Time cn1 cn2 cn3 cn4 cn5 cn6 cn7 cn8 cn9
1 8a 8a 8a 8a 8a 8a 8a 8a 1a
2 1b 8b 8b 8b 8b 8b 8b 8b 8b
3 4a 4a 4a 4a 4b 4b 4b 4b X

Thanks for the training and your hard work on Pavilion 2. Let me know what additional information you need.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions