Description
We discussed this in the Pavilion training on 2/12/2020 and 2/13/2020.
We would like to do a performance scaling study with multiple power-of-2 tests, with each set of tests at a given scale using the same nodes, running in parallel as much as possible. I'll give a small example, but we'd like this to work with arbitrary scales and repetition counts.
Suppose we have a machine with 9 nodes. We want to run a scaling study with 3 repetitions at scales of 1, 2, 4, and 8 nodes, run in independent Slurm jobs, so there are 12 jobs; I'll name each job by its scale and a letter. Each run takes one time unit. The below table show one possible sequence of which job is running on which node, with "X" meaning unrelated jobs.
Time | cn1 | cn2 | cn3 | cn4 | cn5 | cn6 | cn7 | cn8 | cn9 |
---|---|---|---|---|---|---|---|---|---|
1 | 8a | 8a | 8a | 8a | 8a | 8a | 8a | 8a | 1a |
2 | 8b | 8b | 8b | 8b | 8b | 8b | 8b | 8b | 1b |
3 | 4a | 4a | 4a | 4a | X | X | X | X | X |
4 | 8c | 8c | 8c | 8c | 8c | 8c | 8c | 8c | X |
5 | 4b | 4b | 4b | 4b | 2a | 2a | X | X | 1c |
6 | 4c | 4c | 4c | 4c | 2b | 2b | X | X | X |
7 | X | X | X | X | 2c | 2c | X | X | X |
This table shows an invalid sequence, because runs of the same size change which nodes they get:
Time | cn1 | cn2 | cn3 | cn4 | cn5 | cn6 | cn7 | cn8 | cn9 |
---|---|---|---|---|---|---|---|---|---|
1 | 8a | 8a | 8a | 8a | 8a | 8a | 8a | 8a | 1a |
2 | 1b | 8b | 8b | 8b | 8b | 8b | 8b | 8b | 8b |
3 | 4a | 4a | 4a | 4a | 4b | 4b | 4b | 4b | X |
Thanks for the training and your hard work on Pavilion 2. Let me know what additional information you need.