Skip to content

Commit 953f04b

Browse files
authored
Merge pull request #145 from NVIDIA/am/schema
Introduce Pydantic to verify Test schema
2 parents c3542c7 + 7dcf8a7 commit 953f04b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+906
-1125
lines changed

README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,13 +63,12 @@ CloudAI supports five modes: install, dry-run, run, generate-report, and uninsta
6363
* Use the generate-report mode to generate reports under the test directories alongside the raw data.
6464
* Use the uninstall mode to remove installed test templates.
6565

66-
To install test templates, run CloudAI CLI in install mode.
66+
To install test prerequisites, run CloudAI CLI in install mode.
6767
Please make sure to use the correct system configuration file that corresponds to your current setup for installation and experiments.
6868
```bash
6969
cloudai\
7070
--mode install\
7171
--system-config conf/common/system/example_slurm_cluster.toml\
72-
--test-templates-dir conf/common/test_template\
7372
--tests-dir conf/common/test
7473
```
7574

@@ -78,7 +77,6 @@ To simulate running experiments without execution, use the dry-run mode:
7877
cloudai\
7978
--mode dry-run\
8079
--system-config conf/common/system/example_slurm_cluster.toml\
81-
--test-templates-dir conf/common/test_template\
8280
--tests-dir conf/common/test\
8381
--test-scenario conf/common/test_scenario/sleep.toml
8482
```
@@ -88,7 +86,6 @@ To run experiments, execute CloudAI CLI in run mode:
8886
cloudai\
8987
--mode run\
9088
--system-config conf/common/system/example_slurm_cluster.toml\
91-
--test-templates-dir conf/common/test_template\
9289
--tests-dir conf/common/test\
9390
--test-scenario conf/common/test_scenario/sleep.toml
9491
```
@@ -98,19 +95,17 @@ To generate reports, execute CloudAI CLI in generate-report mode:
9895
cloudai\
9996
--mode generate-report\
10097
--system-config conf/common/system/example_slurm_cluster.toml\
101-
--test-templates-dir conf/common/test_template\
10298
--tests-dir conf/common/test\
10399
--output-dir /path/to/output_directory
104100
```
105101
In the generate-report mode, use the --output-dir argument to specify a subdirectory under the result directory.
106102
This subdirectory is usually named with a timestamp for unique identification.
107103

108-
To uninstall test templates, run CloudAI CLI in uninstall mode:
104+
To uninstall test prerequisites, run CloudAI CLI in uninstall mode:
109105
```bash
110106
cloudai\
111107
--mode uninstall\
112108
--system-config conf/common/system/example_slurm_cluster.toml\
113-
--test-templates-dir conf/common/test_template\
114109
--tests-dir conf/common/test
115110
```
116111

@@ -119,11 +114,19 @@ Verify if system configs are valid:
119114
cloudai\
120115
--mode verify-systems\
121116
--tests-dir conf/common/test\
122-
--test-templates-dir conf/common/test_template\
123117
--system-config conf/common/system
124118
```
125119
`--system-config` can be a file or a directory to verify all configs in the directory.
126120

121+
Verify if test configs are valid:
122+
```bash
123+
cloudai\
124+
--mode verify-tests\
125+
--system-config conf/common/system/example_slurm_cluster.toml\
126+
--tests-dir conf/common/test
127+
```
128+
`--tests-dir` can be a file or a directory to verify all configs in the directory.
129+
127130
## Contributing
128131
Feel free to contribute to the CloudAI project. Your contributions are highly appreciated.
129132

USER_GUIDE.md

Lines changed: 24 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
# CloudAI User Guide
2-
This is a CloudAI user guide to help users use CloudAI, covering topics such as adding a new test template and downloading datasets for running NeMo-launcher.
3-
4-
## Adding a New Test Template
5-
CloudAI allows users to package workloads as test templates to facilitate the automation of running experiments. This method involves packaging workloads as docker images, which is one of several approaches you can take with CloudAI. Users can run workloads using test templates. However, since docker images are not part of the CloudAI distribution, users must build their own docker image. This guide describes how to build a docker image and then run experiments.
2+
This is a CloudAI user guide to help users use CloudAI, covering topics such as adding new tests and downloading datasets for running NeMo-launcher.
63

74
#### Step 1: Create a Docker Image
85
1. **Set Up the GitLab Repository**
@@ -49,49 +46,30 @@ CloudAI allows users to package workloads as test templates to facilitate the au
4946

5047
#### Step 2: Prepare configuration files
5148
CloudAI is fully configurable via set of TOML configuration files. You can find examples of these files under `conf/common`. In this guide, we will use the following configuration files:
52-
1. `myconfig/test_templates/nccl_template.toml` - Describes the test template configuration.
5349
1. `myconfig/system.toml` - Describes the system configuration.
5450
1. `myconfig/tests/nccl_test.toml` - Describes the test to run.
5551
1. `myconfig/scenario.toml` - Describes the test scenario configuration.
5652

5753

58-
#### Step 3: Test Template
59-
Test template config describes all arguments of a test. Let's create a test template file for the NCCL test. You can find more examples of test templates under `conf/common/test_template/`. Our example will be small for demonstration purposes. Below is the `myconfig/test_templates/nccl_template.toml` file:
60-
```toml
61-
name = "NcclTest"
54+
#### Step 3: Test definition
55+
Test definition is a Pydantic model that describes the arguments of a test. Such models should be inherited from the `TestDefinition` class:
56+
```py
57+
class MyTestCmdArgs(CmdArgs):
58+
an_arg: str
59+
docker_image_url: str = "nvcr.io/nvidia/pytorch:24.02-py3"
6260

63-
[cmd_args]
64-
[cmd_args.docker_image_url]
65-
type = "str"
66-
default = "nvcr.io/nvidia/pytorch:24.02-py3"
67-
68-
[cmd_args.subtest_name]
69-
type = "preset"
70-
values = ["all_reduce_perf_mpi"]
71-
default = "all_reduce_perf_mpi"
72-
73-
[cmd_args.ngpus]
74-
type = "int"
75-
default = "1"
76-
77-
[cmd_args.minbytes]
78-
type = "str"
79-
default = "32M"
80-
81-
[cmd_args.maxbytes]
82-
type = "str"
83-
default = "32M"
84-
85-
[cmd_args.iters]
86-
type = "int"
87-
default = "20"
88-
89-
[cmd_args.warmup_iters]
90-
type = "int"
91-
default = "5"
61+
class MyTestDefinition(TestDefinition):
62+
cmd_args: MyTestCmdArgs
9263
```
9364
Notice that `cmd_args.docker_image_url` uses `nvcr.io/nvidia/pytorch:24.02-py3`, but you can use Docker image from Step 1.
9465

66+
A custom test definition should be registered to handle relevant Test Configs. For this, `Registry()` object is used:
67+
```py
68+
Registry().add_test_definition("MyTest", MyTestDefinition)
69+
Registry().add_test_template("MyTest", MyTest)
70+
```
71+
Relevant Test Configs should specify `test_template_name = MyTest` to use the custom test definition.
72+
9573
#### Step 3: System Config
9674
System config describes the system configuration. You can find more examples of system configs under `conf/common/system/`. Our example will be small for demonstration purposes. Below is the `myconfig/system.toml` file:
9775
```toml
@@ -119,12 +97,11 @@ Once all configs are ready, it is time to install test requirements. It is done
11997
```bash
12098
cloudai --mode install \
12199
--system-config myconfig/system.toml \
122-
--test-templates-dir myconfig/test_templates/ \
123100
--tests-dir myconfig/tests/
124101
```
125102

126103
#### Step 5: Test Configuration
127-
Test Config describes a particular test configuration to be run. It is based on Test Template and will be used in Test Sceanrio. Below is the `myconfig/tests/nccl_test.toml` file:
104+
Test Config describes a particular test configuration to be run. It is based on Test definition and will be used in Test Sceanrio. Below is the `myconfig/tests/nccl_test.toml` file, definition is based on built-in `NcclTest` definition:
128105
```toml
129106
name = "nccl_test_all_reduce_single_node"
130107
description = "all_reduce"
@@ -174,7 +151,6 @@ To generate NCCL test commands without actual execution, use the `dry-run` mode.
174151
cloudai --mode dry-run \
175152
--test-scenario myconfig/scenario.toml \
176153
--system-config myconfig/system.toml \
177-
--test-templates-dir myconfig/test_templates/ \
178154
--tests-dir myconfig/tests/
179155
```
180156

@@ -183,7 +159,6 @@ You can run NCCL test experiments with the following command. Whenever you run C
183159
cloudai --mode run \
184160
--test-scenario myconfig/scenario.toml \
185161
--system-config myconfig/system.toml \
186-
--test-templates-dir myconfig/test_templates/ \
187162
--tests-dir myconfig/tests/
188163
```
189164

@@ -193,7 +168,6 @@ Once the test scenario is completed, you can generate reports using the followin
193168
cloudai --mode generate-report \
194169
--test-scenario myconfig/scenario.toml \
195170
--system-config myconfig/system.toml \
196-
--test-templates-dir myconfig/test_templates/ \
197171
--tests-dir myconfig/tests/ \
198172
--output-dir results/2024-06-18_17-40-13/
199173
```
@@ -257,14 +231,14 @@ cache_docker_images_locally = true
257231
### Field Descriptions
258232
- **name**: Specifies the name of the system. Users can choose any name that is convenient for them.
259233
- **scheduler**: Indicates the type of system. It should be one of the supported types, currently `slurm` or `standalone`. `slurm` refers to a system with the Slurm scheduler, while `standalone` refers to a single-node system without any slave nodes.
260-
- **install_path**: Specifies the path where test templates are installed. Docker images are downloaded to this path if the user chooses to cache Docker images.
234+
- **install_path**: Specifies the path where test prerequisites are installed. Docker images are downloaded to this path if the user chooses to cache Docker images.
261235
- **output_path**: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
262236
- **default_partition**: Specifies the default partition where jobs are scheduled.
263237
- **partitions**: Describes the available partitions and nodes within those partitions.
264238
- **groups**: Within the same partition, users can define groups of nodes. This is a logical grouping that does not overlap between groups. The group concept can be used to allocate nodes from specific groups in a test scenario schema.
265239
- **mpi**: Indicates the Process Management Interface (PMI) implementation to be used for inter-process communication.
266240
- **gpus_per_node** and **ntasks_per_node**: These are Slurm arguments passed to the `sbatch` script and `srun`.
267-
- **cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test template is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
241+
- **cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
268242
- **global_env_vars**: Lists all global environment variables that will be applied globally whenever tests are run.
269243

270244
## Describing a Test Scenario in the Test Scenario Schema
@@ -306,7 +280,7 @@ Dependencies of a test can be described as a subsection of the test. The depende
306280

307281

308282
## Downloading and Installing the NeMo Dataset (The Pile Dataset)
309-
This section describes how you can download the NeMo datasets on your server. The install mode of CloudAI handles the installation of all test templates, but downloading and installing datasets is not the responsibility of the install mode. This is because any large datasets should be installed globally by the administrator and shared with multiple users, even if a user does not use CloudAI. For CloudAI users, we provide a detailed guide about downloading and installing the NeMo datasets in this section. To understand the datasets available in the NeMo framework, you can refer to the Data Preparation section of [the document](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/baichuan2/dataprep.html). According to the document, you can download and use the Pile dataset. The document also provides detailed instructions on how to download these datasets for various platforms. Let’s assume that we have a Slurm cluster.
283+
This section describes how you can download the NeMo datasets on your server. The install mode of CloudAI handles the installation of all test prerequisites, but downloading and installing datasets is not the responsibility of the install mode. This is because any large datasets should be installed globally by the administrator and shared with multiple users, even if a user does not use CloudAI. For CloudAI users, we provide a detailed guide about downloading and installing the NeMo datasets in this section. To understand the datasets available in the NeMo framework, you can refer to the Data Preparation section of [the document](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/baichuan2/dataprep.html). According to the document, you can download and use the Pile dataset. The document also provides detailed instructions on how to download these datasets for various platforms. Let’s assume that we have a Slurm cluster.
310284

311285
You can download the datasets with the following command:
312286
```bash
@@ -364,7 +338,7 @@ You can update the fields to adjust the behavior. For example, you can update th
364338
3. Replace "training.model.tokenizer.model=TOKENIZER_MODEL" with "training.model.tokenizer.model=YOUR_TOKENIZER_PATH" (the tokenizer should be a .model file) in conf/common/test/llama.toml.
365339

366340
## Troubleshooting
367-
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, adding test templates, tests, and test scenarios.
341+
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, tests, and test scenarios.
368342

369343
### Identifying the Root Cause
370344
If you encounter issues running a command, start by reading the error message to understand the root cause. We strive to make our error messages and exception messages as readable and interpretable as possible.
@@ -373,7 +347,7 @@ If you encounter issues running a command, start by reading the error message to
373347
To determine whether an issue is due to system infrastructure or a CloudAI bug, follow these steps:
374348

375349
1. **Check stdout Messages**
376-
If CloudAI fails to run a test template successfully, it will be indicated in the stdout messages that a test has failed.
350+
If CloudAI fails to run a test successfully, it will be indicated in the stdout messages that a test has failed.
377351

378352
2. **Review Log Files**
379353
- Navigate to the output directory and review `debug.log`, stdout, and stderr files.
@@ -392,8 +366,8 @@ To determine whether an issue is due to system infrastructure or a CloudAI bug,
392366

393367
If the problem persists, please report the issue at [https://github.com/NVIDIA/cloudai/issues/new/choose](https://github.com/NVIDIA/cloudai/issues/new/choose). When you report an issue, ensure it is reproducible. Follow the issue template and provide any necessary details, such as the hash commit used, system settings, any changes in the schema files, and the command.
394368

395-
### Test Template-Specific Troubleshooting Guides
396-
In addition to the general troubleshooting steps, this section provides specific troubleshooting guides for each test template used in CloudAI. These guides help you identify and resolve issues unique to each template.
369+
### Test Specific Troubleshooting Guides
370+
In addition to the general troubleshooting steps, this section provides specific troubleshooting guides for each test used in CloudAI. These guides help you identify and resolve issues unique to each template.
397371

398372
#### NeMo Launcher
399373
* If your run is not successful, please review the stderr and stdout files generated under the results directory. Within the output directory, locate the run directory, and under the run directory, you will find stderr files like log-nemo-megatron-run_[job_id].err. Please review these files for any meaningful error messages.

conf/common/test/chakra_replay.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,10 @@
1717
name = "chakra_replay"
1818
description = "chakra_replay"
1919
test_template_name = "ChakraReplay"
20-
extra_cmd_args = "--reuse-tensors"
2120

2221
[cmd_args]
2322
"trace_path" = "TRACE_PATH"
23+
docker_image_url = "DOCKER_IMAGE"
24+
25+
[extra_cmd_args]
26+
"--reuse-tensors" = ""

conf/common/test/gpt.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,5 @@
1717
name = "gpt"
1818
description = "gpt"
1919
test_template_name = "NeMoLauncher"
20+
21+
[cmd_args]

conf/common/test/llama.toml

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,31 @@
1717
name = "llama"
1818
description = "Llama2 70b"
1919
test_template_name = "NeMoLauncher"
20+
21+
[cmd_args]
22+
[cmd_args.training]
23+
values = "llama/llama2_70b"
24+
[cmd_args.training.trainer]
25+
max_steps = "120"
26+
[cmd_args.training.model]
27+
global_batch_size = "256"
28+
pipeline_model_parallel_size = "2"
29+
2030
# FIXME : ~training.model.position_embedding_type was added in the extra_cmd_args in order to fix a bug from NeMo repository (https://github.com/NVIDIA/NeMo).
2131
# the commit that should fix this issue in NeMo is : 5b296e8af832c67d361fdfb80a165db3affaf76a.
2232
# Once the new release of NeMoLauncher includes this commit (check by downloading the corresponding container and look inside /opt for this commit), ~training.model.position_embedding_type should be removed from the extra args
23-
extra_cmd_args = "~training.model.position_embedding_type +training.model.fsdp=True ~training.model.optim.bucket_cap_mb ~training.model.optim.overlap_grad_sync ~training.model.optim.overlap_param_sync ~training.model.optim.contiguous_grad_buffer training.model.virtual_pipeline_model_parallel_size=null training.model.megatron_amp_O2=False training.model.activations_checkpoint_num_layers=null training.model.gradient_accumulation_fusion=False training.model.use_cpu_initialization=True training.model.optim.name=fused_adam training.model.tokenizer.model=TOKENIZER_MODEL training.exp_manager.create_wandb_logger=False"
24-
25-
[cmd_args]
26-
"training" = "llama/llama2_70b"
27-
"training.trainer.max_steps" = "120"
28-
"training.model.global_batch_size" = "256"
29-
"training.model.pipeline_model_parallel_size" = "1"
33+
[extra_cmd_args]
34+
"~training.model.position_embedding_type" = ""
35+
"+training.model.fsdp" = "True"
36+
"~training.model.optim.bucket_cap_mb" = ""
37+
"~training.model.optim.overlap_grad_sync" = ""
38+
"~training.model.optim.overlap_param_sync" = ""
39+
"~training.model.optim.contiguous_grad_buffer" = ""
40+
"training.model.virtual_pipeline_model_parallel_size" = "null"
41+
"training.model.megatron_amp_O2" = "False"
42+
"training.model.activations_checkpoint_num_layers" = "null"
43+
"training.model.gradient_accumulation_fusion" = "False"
44+
"training.model.use_cpu_initialization" = "True"
45+
"training.model.optim.name" = "fused_adam"
46+
"training.model.tokenizer.model" = "TOKENIZER_MODEL"
47+
"training.exp_manager.create_wandb_logger" = "False"

conf/common/test/nccl_test_all_gather.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717
name = "nccl_test_all_gather"
1818
description = "all_gather"
1919
test_template_name = "NcclTest"
20-
extra_cmd_args = "--stepfactor 2"
2120

2221
[cmd_args]
2322
"subtest_name" = "all_gather_perf_mpi"
@@ -27,5 +26,8 @@ extra_cmd_args = "--stepfactor 2"
2726
"iters" = "100"
2827
"warmup_iters" = "50"
2928

29+
[extra_cmd_args]
30+
"--stepfactor" = "2"
31+
3032
[extra_env_vars]
3133
"NCCL_TEST_SPLIT_MASK" = "0x7"

conf/common/test/nccl_test_all_reduce.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717
name = "nccl_test_all_reduce"
1818
description = "all_reduce"
1919
test_template_name = "NcclTest"
20-
extra_cmd_args = "--stepfactor 2"
2120

2221
[cmd_args]
2322
"subtest_name" = "all_reduce_perf_mpi"
@@ -26,3 +25,6 @@ extra_cmd_args = "--stepfactor 2"
2625
"maxbytes" = "16G"
2726
"iters" = "100"
2827
"warmup_iters" = "50"
28+
29+
[extra_cmd_args]
30+
"--stepfactor" = "2"

conf/common/test/nccl_test_alltoall.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717
name = "nccl_test_alltoall"
1818
description = "alltoall"
1919
test_template_name = "NcclTest"
20-
extra_cmd_args = "--stepfactor 2"
2120

2221
[cmd_args]
2322
"subtest_name" = "alltoall_perf_mpi"
@@ -26,3 +25,6 @@ extra_cmd_args = "--stepfactor 2"
2625
"maxbytes" = "4G"
2726
"iters" = "100"
2827
"warmup_iters" = "50"
28+
29+
[extra_cmd_args]
30+
"--stepfactor" = "2"

conf/common/test/nccl_test_reduce_scatter.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717
name = "nccl_test_reduce_scatter"
1818
description = "reduce_scatter"
1919
test_template_name = "NcclTest"
20-
extra_cmd_args = "--stepfactor 2"
2120

2221
[cmd_args]
2322
"subtest_name" = "reduce_scatter_perf_mpi"
@@ -27,5 +26,8 @@ extra_cmd_args = "--stepfactor 2"
2726
"iters" = "100"
2827
"warmup_iters" = "50"
2928

29+
[extra_cmd_args]
30+
"--stepfactor" = "2"
31+
3032
[extra_env_vars]
3133
"NCCL_TEST_SPLIT_MASK" = "0x7"

0 commit comments

Comments
 (0)