You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: USER_GUIDE.md
+24-50Lines changed: 24 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,5 @@
1
1
# CloudAI User Guide
2
-
This is a CloudAI user guide to help users use CloudAI, covering topics such as adding a new test template and downloading datasets for running NeMo-launcher.
3
-
4
-
## Adding a New Test Template
5
-
CloudAI allows users to package workloads as test templates to facilitate the automation of running experiments. This method involves packaging workloads as docker images, which is one of several approaches you can take with CloudAI. Users can run workloads using test templates. However, since docker images are not part of the CloudAI distribution, users must build their own docker image. This guide describes how to build a docker image and then run experiments.
2
+
This is a CloudAI user guide to help users use CloudAI, covering topics such as adding new tests and downloading datasets for running NeMo-launcher.
6
3
7
4
#### Step 1: Create a Docker Image
8
5
1.**Set Up the GitLab Repository**
@@ -49,49 +46,30 @@ CloudAI allows users to package workloads as test templates to facilitate the au
49
46
50
47
#### Step 2: Prepare configuration files
51
48
CloudAI is fully configurable via set of TOML configuration files. You can find examples of these files under `conf/common`. In this guide, we will use the following configuration files:
52
-
1.`myconfig/test_templates/nccl_template.toml` - Describes the test template configuration.
53
49
1.`myconfig/system.toml` - Describes the system configuration.
54
50
1.`myconfig/tests/nccl_test.toml` - Describes the test to run.
55
51
1.`myconfig/scenario.toml` - Describes the test scenario configuration.
56
52
57
53
58
-
#### Step 3: Test Template
59
-
Test template config describes all arguments of a test. Let's create a test template file for the NCCL test. You can find more examples of test templates under `conf/common/test_template/`. Our example will be small for demonstration purposes. Below is the `myconfig/test_templates/nccl_template.toml` file:
60
-
```toml
61
-
name = "NcclTest"
54
+
#### Step 3: Test definition
55
+
Test definition is a Pydantic model that describes the arguments of a test. Such models should be inherited from the `TestDefinition` class:
Relevant Test Configs should specify `test_template_name = MyTest` to use the custom test definition.
72
+
95
73
#### Step 3: System Config
96
74
System config describes the system configuration. You can find more examples of system configs under `conf/common/system/`. Our example will be small for demonstration purposes. Below is the `myconfig/system.toml` file:
97
75
```toml
@@ -119,12 +97,11 @@ Once all configs are ready, it is time to install test requirements. It is done
119
97
```bash
120
98
cloudai --mode install \
121
99
--system-config myconfig/system.toml \
122
-
--test-templates-dir myconfig/test_templates/ \
123
100
--tests-dir myconfig/tests/
124
101
```
125
102
126
103
#### Step 5: Test Configuration
127
-
Test Config describes a particular test configuration to be run. It is based on Test Template and will be used in Test Sceanrio. Below is the `myconfig/tests/nccl_test.toml` file:
104
+
Test Config describes a particular test configuration to be run. It is based on Test definition and will be used in Test Sceanrio. Below is the `myconfig/tests/nccl_test.toml` file, definition is based on built-in `NcclTest` definition:
128
105
```toml
129
106
name = "nccl_test_all_reduce_single_node"
130
107
description = "all_reduce"
@@ -174,7 +151,6 @@ To generate NCCL test commands without actual execution, use the `dry-run` mode.
174
151
cloudai --mode dry-run \
175
152
--test-scenario myconfig/scenario.toml \
176
153
--system-config myconfig/system.toml \
177
-
--test-templates-dir myconfig/test_templates/ \
178
154
--tests-dir myconfig/tests/
179
155
```
180
156
@@ -183,7 +159,6 @@ You can run NCCL test experiments with the following command. Whenever you run C
183
159
cloudai --mode run \
184
160
--test-scenario myconfig/scenario.toml \
185
161
--system-config myconfig/system.toml \
186
-
--test-templates-dir myconfig/test_templates/ \
187
162
--tests-dir myconfig/tests/
188
163
```
189
164
@@ -193,7 +168,6 @@ Once the test scenario is completed, you can generate reports using the followin
-**name**: Specifies the name of the system. Users can choose any name that is convenient for them.
259
233
-**scheduler**: Indicates the type of system. It should be one of the supported types, currently `slurm` or `standalone`. `slurm` refers to a system with the Slurm scheduler, while `standalone` refers to a single-node system without any slave nodes.
260
-
-**install_path**: Specifies the path where test templates are installed. Docker images are downloaded to this path if the user chooses to cache Docker images.
234
+
-**install_path**: Specifies the path where test prerequisites are installed. Docker images are downloaded to this path if the user chooses to cache Docker images.
261
235
-**output_path**: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
262
236
-**default_partition**: Specifies the default partition where jobs are scheduled.
263
237
-**partitions**: Describes the available partitions and nodes within those partitions.
264
238
-**groups**: Within the same partition, users can define groups of nodes. This is a logical grouping that does not overlap between groups. The group concept can be used to allocate nodes from specific groups in a test scenario schema.
265
239
-**mpi**: Indicates the Process Management Interface (PMI) implementation to be used for inter-process communication.
266
240
-**gpus_per_node** and **ntasks_per_node**: These are Slurm arguments passed to the `sbatch` script and `srun`.
267
-
-**cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test template is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
241
+
-**cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
268
242
-**global_env_vars**: Lists all global environment variables that will be applied globally whenever tests are run.
269
243
270
244
## Describing a Test Scenario in the Test Scenario Schema
@@ -306,7 +280,7 @@ Dependencies of a test can be described as a subsection of the test. The depende
306
280
307
281
308
282
## Downloading and Installing the NeMo Dataset (The Pile Dataset)
309
-
This section describes how you can download the NeMo datasets on your server. The install mode of CloudAI handles the installation of all test templates, but downloading and installing datasets is not the responsibility of the install mode. This is because any large datasets should be installed globally by the administrator and shared with multiple users, even if a user does not use CloudAI. For CloudAI users, we provide a detailed guide about downloading and installing the NeMo datasets in this section. To understand the datasets available in the NeMo framework, you can refer to the Data Preparation section of [the document](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/baichuan2/dataprep.html). According to the document, you can download and use the Pile dataset. The document also provides detailed instructions on how to download these datasets for various platforms. Let’s assume that we have a Slurm cluster.
283
+
This section describes how you can download the NeMo datasets on your server. The install mode of CloudAI handles the installation of all test prerequisites, but downloading and installing datasets is not the responsibility of the install mode. This is because any large datasets should be installed globally by the administrator and shared with multiple users, even if a user does not use CloudAI. For CloudAI users, we provide a detailed guide about downloading and installing the NeMo datasets in this section. To understand the datasets available in the NeMo framework, you can refer to the Data Preparation section of [the document](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/baichuan2/dataprep.html). According to the document, you can download and use the Pile dataset. The document also provides detailed instructions on how to download these datasets for various platforms. Let’s assume that we have a Slurm cluster.
310
284
311
285
You can download the datasets with the following command:
312
286
```bash
@@ -364,7 +338,7 @@ You can update the fields to adjust the behavior. For example, you can update th
364
338
3. Replace "training.model.tokenizer.model=TOKENIZER_MODEL" with "training.model.tokenizer.model=YOUR_TOKENIZER_PATH" (the tokenizer should be a .model file) in conf/common/test/llama.toml.
365
339
366
340
## Troubleshooting
367
-
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, adding test templates, tests, and test scenarios.
341
+
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, tests, and test scenarios.
368
342
369
343
### Identifying the Root Cause
370
344
If you encounter issues running a command, start by reading the error message to understand the root cause. We strive to make our error messages and exception messages as readable and interpretable as possible.
@@ -373,7 +347,7 @@ If you encounter issues running a command, start by reading the error message to
373
347
To determine whether an issue is due to system infrastructure or a CloudAI bug, follow these steps:
374
348
375
349
1.**Check stdout Messages**
376
-
If CloudAI fails to run a test template successfully, it will be indicated in the stdout messages that a test has failed.
350
+
If CloudAI fails to run a test successfully, it will be indicated in the stdout messages that a test has failed.
377
351
378
352
2.**Review Log Files**
379
353
- Navigate to the output directory and review `debug.log`, stdout, and stderr files.
@@ -392,8 +366,8 @@ To determine whether an issue is due to system infrastructure or a CloudAI bug,
392
366
393
367
If the problem persists, please report the issue at [https://github.com/NVIDIA/cloudai/issues/new/choose](https://github.com/NVIDIA/cloudai/issues/new/choose). When you report an issue, ensure it is reproducible. Follow the issue template and provide any necessary details, such as the hash commit used, system settings, any changes in the schema files, and the command.
394
368
395
-
### Test Template-Specific Troubleshooting Guides
396
-
In addition to the general troubleshooting steps, this section provides specific troubleshooting guides for each test template used in CloudAI. These guides help you identify and resolve issues unique to each template.
369
+
### Test Specific Troubleshooting Guides
370
+
In addition to the general troubleshooting steps, this section provides specific troubleshooting guides for each test used in CloudAI. These guides help you identify and resolve issues unique to each template.
397
371
398
372
#### NeMo Launcher
399
373
* If your run is not successful, please review the stderr and stdout files generated under the results directory. Within the output directory, locate the run directory, and under the run directory, you will find stderr files like log-nemo-megatron-run_[job_id].err. Please review these files for any meaningful error messages.
Copy file name to clipboardExpand all lines: conf/common/test/llama.toml
+25-7Lines changed: 25 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -17,13 +17,31 @@
17
17
name = "llama"
18
18
description = "Llama2 70b"
19
19
test_template_name = "NeMoLauncher"
20
+
21
+
[cmd_args]
22
+
[cmd_args.training]
23
+
values = "llama/llama2_70b"
24
+
[cmd_args.training.trainer]
25
+
max_steps = "120"
26
+
[cmd_args.training.model]
27
+
global_batch_size = "256"
28
+
pipeline_model_parallel_size = "2"
29
+
20
30
# FIXME : ~training.model.position_embedding_type was added in the extra_cmd_args in order to fix a bug from NeMo repository (https://github.com/NVIDIA/NeMo).
21
31
# the commit that should fix this issue in NeMo is : 5b296e8af832c67d361fdfb80a165db3affaf76a.
22
32
# Once the new release of NeMoLauncher includes this commit (check by downloading the corresponding container and look inside /opt for this commit), ~training.model.position_embedding_type should be removed from the extra args
0 commit comments