Skip to content

Commit c40e92c

Browse files
Merge pull request #205 from NVIDIA/am/pydantic-test-scenario
Pydantic for Test Scenario
2 parents 02b8289 + f056569 commit c40e92c

30 files changed

+1013
-519
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,3 +84,4 @@ jobs:
8484
cloudai --help
8585
cloudai --mode verify-systems --tests-dir conf/common/test --system-config conf/common/system
8686
cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test
87+
cloudai --mode verify-test-scenarios --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,15 @@ cloudai\
127127
```
128128
`--tests-dir` can be a file or a directory to verify all configs in the directory.
129129

130+
Verify if test scenarios are valid:
131+
```bash
132+
cloudai\ --mode verify-test-scenarios\
133+
--system-config conf/common/system/example_slurm_cluster.toml\
134+
--tests-dir conf/common/test\
135+
--test-scenario conf/common/test_scenario
136+
```
137+
`--test-scenario` can be a file or a directory to verify all configs in the directory.
138+
130139
## Contributing
131140
Feel free to contribute to the CloudAI project. Your contributions are highly appreciated.
132141

USER_GUIDE.md

Lines changed: 41 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -123,20 +123,24 @@ Test Scenario uses Test description from the previous step. Below is the `myconf
123123
```toml
124124
name = "nccl-test"
125125

126-
[Tests.1]
127-
name = "nccl_test_all_reduce_single_node"
128-
time_limit = "00:20:00"
129-
130-
[Tests.2]
131-
name = "nccl_test_all_reduce_single_node"
132-
time_limit = "00:20:00"
133-
[Tests.2.dependencies]
134-
start_post_comp = { name = "Tests.1", time = 0 }
126+
[[Tests]]
127+
id = "Tests.1"
128+
test_name = "nccl_test_all_reduce_single_node"
129+
time_limit = "00:20:00"
130+
131+
[[Tests]]
132+
id = "Tests.2"
133+
test_name = "nccl_test_all_reduce_single_node"
134+
time_limit = "00:20:00"
135+
[[Tests.dependencies]]
136+
type = "start_post_comp"
137+
id = "Tests.1"
138+
time = 0
135139
```
136140

137141
Notes on the test scenario:
138-
1. `name` is a mandatory filed. Other fields describe arbitrary number of tests and their dependencies.
139-
1. The `name` of the tests should be found in the test schema files. Node lists and time limits are optional.
142+
1. `id` is a mandatory filed and must be uniq for each test.
143+
1. The `test_name` specifies test definition from one of the Test TOML files. Node lists and time limits are optional.
140144
1. If needed, `nodes` should be described as a list of node names as shown in a Slurm system. Alternatively, if groups are defined in the system schema, you can ask CloudAI to allocate a specific number of nodes from a specified partition and group. For example `nodes = ['PARTITION:GROUP:16']`: 16 nodes are allocated from a group `GROUP`, from a partition `PARTITION`.
141145
1. There are three types of dependencies: `start_post_comp`, `start_post_init` and `end_post_comp`.
142146
1. `start_post_comp` means that the current test should be started after a specific delay of the completion of the depending test.
@@ -243,27 +247,34 @@ cache_docker_images_locally = true
243247

244248
## Describing a Test Scenario in the Test Scenario Schema
245249
A test scenario is a set of tests with specific dependencies between them. A test scenario is described in a TOML schema file. This is an example of a test scenario file:
246-
```
250+
```toml
247251
name = "nccl-test"
248252

249-
[Tests.1]
250-
name = "nccl_test_all_reduce"
251-
num_nodes = "2"
252-
time_limit = "00:20:00"
253-
254-
[Tests.2]
255-
name = "nccl_test_all_gather"
256-
num_nodes = "2"
257-
time_limit = "00:20:00"
258-
[Tests.2.dependencies]
259-
start_post_comp = { name = "Tests.1", time = 0 }
260-
261-
[Tests.3]
262-
name = "nccl_test_reduce_scatter"
263-
num_nodes = "2"
264-
time_limit = "00:20:00"
265-
[Tests.3.dependencies]
266-
start_post_comp = { name = "Tests.2", time = 0 }
253+
[[Tests]]
254+
id = "Tests.1"
255+
test_name = "nccl_test_all_reduce"
256+
num_nodes = "2"
257+
time_limit = "00:20:00"
258+
259+
[[Tests]]
260+
id = "Tests.2"
261+
test_name = "nccl_test_all_gather"
262+
num_nodes = "2"
263+
time_limit = "00:20:00"
264+
[[Tests.dependencies]]
265+
type = "start_post_comp"
266+
id = "Tests.1"
267+
time = 0
268+
269+
[[Tests]]
270+
id = "Tests.3"
271+
templat_test = "nccl_test_reduce_scatter"
272+
num_nodes = "2"
273+
time_limit = "00:20:00"
274+
[[Tests.dependencies]]
275+
type = "start_post_comp"
276+
id = "Tests.2"
277+
time = 0
267278
```
268279

269280
The `name` field is the test scenario name, which can be any unique identifier for the scenario. Each test has a section name, following the convention `Tests.1`, `Tests.2`, etc., with an increasing index. The `name` of a test should be specified in this section and must correspond to an entry in the test schema. If a test in a test scenario is not present in the test schema, CloudAI will not be able to identify it.

conf/common/test_scenario/chakra_replay.toml

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@
1515
# limitations under the License.
1616

1717
name = "chakra_replay"
18-
19-
[Tests]
20-
[Tests.1]
21-
name = "chakra_replay"
22-
num_nodes = "2"
18+
[[Tests]]
19+
id = "Tests.1"
20+
test_name = "chakra_replay"
21+
num_nodes = "2"

0 commit comments

Comments
 (0)