Skip to content

Commit 18b5b9a

Browse files
authored
Merge pull request #743 from NVIDIA/am/doc
Update documentation
2 parents 0b321e9 + 335384a commit 18b5b9a

File tree

14 files changed

+388
-417
lines changed

14 files changed

+388
-417
lines changed

README.md

Lines changed: 46 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -29,76 +29,90 @@ These schemas enable CloudAI to be flexible and compatible with different system
2929
## Support matrix
3030
|Test|Slurm|Kubernetes|RunAI|Standalone|
3131
|---|---|---|---|---|
32+
|AI Dynamo|||||
33+
|BashCmd|||||
3234
|ChakraReplay|||||
33-
|GPT|||||
34-
|Grok|||||
35+
|DDLB|||||
36+
|DeepEP|||||
37+
|JaxToolbox workloads (DEPRECATED)|||||
38+
|MegatronRun|||||
3539
|NCCL|||||
36-
|NeMo Launcher|||||
37-
|NeMo Run|||||
38-
|Nemotron|||||
40+
|NeMo v1.0 aka NemoLauncher (DEPRECATED)|||||
41+
|NeMo v2.0 (aka NemoRun)|||||
42+
|NIXL benchmark|||||
43+
|NIXL kvbench|||||
44+
|NIXL CTPerf|||||
3945
|Sleep|||||
40-
|UCC|||||
4146
|SlurmContainer|||||
42-
|MegatronRun (experimental)|||||
47+
|Triton Inference|||||
48+
|UCC|||||
49+
50+
*deprecated means that a workload support exists, but we are not maintaining it actively anymore and newer configurations might not work.
4351

4452
For more detailed information, please refer to the [official documentation](https://nvidia.github.io/cloudai/workloads/index.html).
4553

4654
## CloudAI Modes Usage Examples
4755

48-
CloudAI supports five modes:
49-
- [install](#install) - Use the install mode to install all test templates in the specified installation path
50-
- [dry-run](#dry-run) - Use the dry-run mode to simulate running experiments without actually executing them. This is useful for verifying configurations and testing experiment setups
51-
- [run](#run) - Use the run mode to run experiments
52-
- [generate-report](#generate-report) - Use the generate-report mode to generate reports under the test directories alongside the raw data
53-
- [uninstall](#uninstall) - Use the uninstall mode to remove installed test templates
54-
55-
### install
56-
57-
To install test prerequisites, run CloudAI CLI in install mode. For more details, please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html).
56+
### run
57+
This mode runs workloads. It automatically installs prerequisites if they are not met.
5858

59-
Please make sure to use the correct system configuration file that corresponds to your current setup for installation and experiments.
6059
```bash
61-
cloudai install\
60+
cloudai run\
6261
--system-config conf/common/system/example_slurm_cluster.toml\
6362
--tests-dir conf/common/test\
6463
--test-scenario conf/common/test_scenario/sleep.toml
6564
```
65+
6666
### dry-run
67-
To simulate running experiments without execution, use the dry-run mode:
67+
This mode simulates running experiments without actually executing them. This is useful for verifying configurations and testing experiment setups.
68+
6869
```bash
6970
cloudai dry-run\
7071
--system-config conf/common/system/example_slurm_cluster.toml\
7172
--tests-dir conf/common/test\
7273
--test-scenario conf/common/test_scenario/sleep.toml
7374
```
74-
### run
75-
To run experiments, execute CloudAI CLI in run mode:
76-
```bash
77-
cloudai run\
78-
--system-config conf/common/system/example_slurm_cluster.toml\
79-
--tests-dir conf/common/test\
80-
--test-scenario conf/common/test_scenario/sleep.toml
81-
```
75+
8276
### generate-report
83-
To generate reports, execute CloudAI CLI in generate-report mode:
77+
This mode generates reports under the scenario directory. It automatically runs as part of the `run` mode after experiments are completed.
78+
8479
```bash
8580
cloudai generate-report\
8681
--system-config conf/common/system/example_slurm_cluster.toml\
8782
--tests-dir conf/common/test\
8883
--test-scenario conf/common/test_scenario/sleep.toml\
8984
--result-dir /path/to/result_directory
9085
```
91-
In the generate-report mode, use the --result-dir argument to specify a subdirectory under the output directory.
92-
This subdirectory is usually named with a timestamp for unique identification.
86+
87+
### install
88+
This mode installs test prerequisites. For more details, please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html). It automatically runs as part of the `run` mode if prerequisites are not met.
89+
90+
```bash
91+
cloudai install\
92+
--system-config conf/common/system/example_slurm_cluster.toml\
93+
--tests-dir conf/common/test\
94+
--test-scenario conf/common/test_scenario/sleep.toml
95+
```
96+
9397
### uninstall
94-
To uninstall test prerequisites, run CloudAI CLI in uninstall mode:
98+
The opposite to the install mode, this mode removes installed test prerequisites.
99+
95100
```bash
96101
cloudai uninstall\
97102
--system-config conf/common/system/example_slurm_cluster.toml\
98103
--tests-dir conf/common/test\
99104
--test-scenario conf/common/test_scenario/sleep.toml
100105
```
106+
107+
### list
108+
This mode lists internal components available within CloudAI.
109+
```bash
110+
cloudai list <component_type>
111+
```
112+
101113
### verify-configs
114+
This mode verifies the correctness of system, test and test scenario configuration files.
115+
102116
```bash
103117
# verify all at once
104118
cloudai verify-configs conf

doc/USER_GUIDE.md

Lines changed: 56 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -55,14 +55,24 @@ CloudAI is fully configurable via set of TOML configuration files. You can find
5555
Test definition is a Pydantic model that describes the arguments of a test. Such models should be inherited from the `TestDefinition` class:
5656
```py
5757
class MyTestCmdArgs(CmdArgs):
58-
an_arg: str
58+
an_arg: str | list[str]
5959
docker_image_url: str = "nvcr.io/nvidia/pytorch:24.02-py3"
6060

6161
class MyTestDefinition(TestDefinition):
6262
cmd_args: MyTestCmdArgs
6363
```
6464
Notice that `cmd_args.docker_image_url` uses `nvcr.io/nvidia/pytorch:24.02-py3`, but you can use Docker image from Step 1.
6565

66+
`an_arg` has mixed type of `str | list[str]`, so in a TOML config it can be defined as either:
67+
```toml
68+
an_arg = "a single string"
69+
```
70+
or
71+
```toml
72+
an_arg = ["list", "of", "strings"]
73+
```
74+
When a list is used, CloudAI will automatically generate multiple test cases for each value in the list.
75+
6676
A custom test definition should be registered to handle relevant Test Configs. For this, `Registry()` object is used:
6777
```py
6878
Registry().add_test_definition("MyTest", MyTestDefinition)
@@ -90,15 +100,7 @@ name = "partition_1"
90100
```
91101
Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.
92102

93-
## Step 5: Install Test Requirements
94-
Once all configs are ready, it is time to install test requirements. It is done once so that you can run multiple experiments without reinstalling the requirements. This step requires the system config file from the step 3.
95-
```bash
96-
cloudai install \
97-
--system-config myconfig/system.toml \
98-
--tests-dir myconfig/tests/
99-
```
100-
101-
## Step 6: Test Configuration
103+
## Step 5: Test Configuration
102104
Test Configuration describes a particular test configuration to be run. It is based on Test definition and will be used in Test Sceanrio. Below is the `myconfig/tests/nccl_test.toml` file, definition is based on built-in `NcclTest` definition:
103105
```toml
104106
name = "nccl_test_all_reduce_single_node"
@@ -116,7 +118,8 @@ stepfactor = 2
116118
```
117119
You can find more examples under `conf/common/test`. In a test schema file, you can adjust arguments as shown above. In the `cmd_args` section, you can provide different values other than the default values for each argument. In `extra_cmd_args`, you can provide additional arguments that will be appended after the NCCL test command. You can specify additional environment variables in the `extra_env_vars` section.
118120

119-
## Step 7: Run Experiments
121+
122+
## Step 6: Run Experiments
120123
Test Scenario uses Test description from step 5. Below is the `myconfig/scenario.toml` file:
121124
```toml
122125
name = "nccl-test"
@@ -165,7 +168,46 @@ cloudai run \
165168
--tests-dir myconfig/tests/
166169
```
167170

168-
## Step 8: Generate Reports
171+
### Test-in-Scenario
172+
One can override some args or even fully define a workload inside a scenario file:
173+
```toml
174+
name = "nccl-test"
175+
176+
[[Tests]]
177+
id = "allreduce.in.scenario"
178+
num_nodes = 1
179+
time_limit = "00:20:00"
180+
181+
name = "nccl_test_all_reduce_single_node"
182+
description = "all_reduce"
183+
test_template_name = "NcclTest"
184+
185+
[Tests.cmd_args]
186+
subtest_name = "all_reduce_perf_mpi"
187+
ngpus = 1
188+
minbytes = "8M"
189+
maxbytes = "16G"
190+
iters = 5
191+
warmup_iters = 3
192+
stepfactor = 2
193+
194+
[[Tests]]
195+
id = "allreduce.override"
196+
num_nodes = 1
197+
test_name = "nccl_test_all_reduce_single_node"
198+
time_limit = "00:20:00"
199+
200+
[Tests.cmd_args]
201+
stepfactor = 4
202+
```
203+
204+
`allreduce.in.scenario` fully defines a workload, in this case `test_name` must not be set, while `name`, `description` and `test_template_name` must be set.
205+
206+
`allreduce.override` overrides only `stepfactor` arg from the test defined in the tests dir.
207+
208+
If a scenario contains only fully defined tests, `--tests-dir` arg is not required.
209+
210+
## Step 7: Generate Reports
169211
Once the test scenario is completed, you can generate reports using the following command:
170212
```bash
171213
cloudai generate-report \
@@ -221,7 +263,7 @@ CUDA_VISIBLE_DEVICES = "0,1,2,3,4,5,6,7"
221263

222264
## Field Descriptions
223265
- **name**: Specifies the name of the system. Users can choose any name that is convenient for them.
224-
- **scheduler**: Indicates the type of system. It should be one of the supported types, currently `slurm` or `standalone`. `slurm` refers to a system with the Slurm scheduler, while `standalone` refers to a single-node system without any slave nodes.
266+
- **scheduler**: Indicates the type of system. It should be one of the supported types, currently `slurm` or `standalone`. `slurm` refers to a system with the Slurm scheduler, while `standalone` refers to a single-node system without any slave nodes. Other values are possible depending on the available schedulers supported by CloudAI.
225267
- **install_path**: Specifies the path where test prerequisites are installed. Docker images are downloaded to this path if the user chooses to cache Docker images.
226268
- **output_path**: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
227269
- **default_partition**: Specifies the default partition where jobs are scheduled.
@@ -334,71 +376,6 @@ Replace `<your-api-token-here>` with your actual token.
334376
Both the endpoint and token must be valid for the HTTP Data Repository to function correctly. If either is missing or incorrect, data will not be posted.
335377

336378

337-
# Downloading and Installing the NeMo Dataset (The Pile Dataset)
338-
This section describes how you can download the NeMo datasets on your server. The install mode of CloudAI handles the installation of all test prerequisites, but downloading and installing datasets is not the responsibility of the install mode. This is because any large datasets should be installed globally by the administrator and shared with multiple users, even if a user does not use CloudAI.
339-
340-
For CloudAI users, we provide a detailed guide about downloading and installing the NeMo datasets in this section. By default, the NeMo launcher uses mock datasets for testing purposes. If you want to run tests using real datasets, you must download the datasets and update the test `.toml` files accordingly to locate the datasets and provide appropriate prefixes.
341-
342-
To understand the datasets available in the NeMo framework, you can refer to the Data Preparation section of [the document](https://docs.nvidia.com/launchpad/ai/base-command-nemo/latest/bc-nemo-step-02.html#use-bignlp-to-download-and-prepare-the-pile-dataset). According to the document, you can download and use the Pile dataset. The document also provides detailed instructions on how to download these datasets for various platforms.
343-
344-
Let’s assume that we have a Slurm cluster.
345-
346-
You can download the datasets with the following command:
347-
```bash
348-
$ git clone https://github.com/NVIDIA/NeMo-Framework-Launcher.git
349-
$ cd NeMo-Framework-Launcher
350-
$ python3 launcher_scripts/main.py \
351-
container=nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.11\
352-
stages=["data_preparation"]\
353-
launcher_scripts_path=$PWD/launcher_scripts\
354-
base_results_dir=$PWD/result\
355-
env_vars.TRANSFORMERS_OFFLINE=0\
356-
data_dir=directory_path_to_download_dataset\
357-
data_preparation.run.time_limit="96:00:00"
358-
```
359-
360-
Once you submit a NeMo job with the data preparation stage, you should be able to find data downloading jobs with the squeue command. If this command does not work, please review the log files under $PWD/result. If you want to download the full Pile dataset, you should have at least 1TB of space in the directory to download the dataset because the Pile dataset size is 800GB.
361-
By default, NeMo will look at the configuration file under conf/config.yaml:
362-
```
363-
defaults:
364-
- data_preparation: baichuan2/download_baichuan2_pile
365-
366-
stages:
367-
- data_preparation
368-
```
369-
370-
As the data preparation field points to baichuan2/download_baichuan2_pile, it will read the YAML file:
371-
```
372-
run:
373-
name: download_baichuan2_pile
374-
results_dir: ${base_results_dir}/${.name}
375-
time_limit: "4:00:00"
376-
dependency: "singleton"
377-
node_array_size: 30
378-
array: ${..file_numbers}
379-
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
380-
381-
dataset: pile
382-
download_the_pile: True # Whether to download the pile dataset from the internet.
383-
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/" # Source URL to download The Pile dataset from.
384-
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
385-
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
386-
download_tokenizer_url: "https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/tokenizer.model"
387-
tokenizer_typzer_library: "sentencepiece"
388-
tokenizer_save_dir: ${data_dir}/baichuan2
389-
tokenizer_model: ${.tokenizer_save_dir}/baichuan2_tokenizer.model
390-
rm_downloaded: False # Extract script will remove downloaded zst after extraction
391-
rm_extracted: False # Preprocess script will remove extracted files after preproc.
392-
```
393-
394-
You can update the fields to adjust the behavior. For example, you can update the file_numbers field to adjust the number of dataset files to download. This will allow you to save disk space.
395-
396-
## Note: For running Nemo Llama model, it is important to follow these additional steps:
397-
1. Go to [🤗 Hugging Face](https://huggingface.co/docs/transformers/en/model_doc/llama).
398-
2. Follow the instructions on how to download the tokenizer.
399-
3. Replace `TOKENIZER_MODEL` in `training.model.tokenizer.model=TOKENIZER_MODEL` with your path (the tokenizer should be a `.model` file) in `conf/common/test/llama.toml`.
400-
401-
402379
# Using Test Hooks in CloudAI
403380

404381
A test hook in CloudAI is a specialized test that runs either before or after each main test in a scenario, providing flexibility to prepare the environment or clean up resources. Hooks are defined as pre-test or post-test and referenced in the test scenario’s TOML file using `pre_test` and `post_test` fields.
@@ -629,7 +606,7 @@ extra_args: list[str] = []
629606
Fields with `None` value are not passed to `nsys` command.
630607

631608
# Troubleshooting
632-
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI. Users should closely follow the USER_GUIDE.md and README.md for installation, tests, and test scenarios.
609+
In this section, we will guide you through identifying the root cause of issues, determining whether they stem from system infrastructure or a bug in CloudAI.
633610

634611
## Identifying the Root Cause
635612
If you encounter issues running a command, start by reading the error message to understand the root cause. We strive to make our error messages and exception messages as readable and interpretable as possible.
@@ -656,30 +633,3 @@ To determine whether an issue is due to system infrastructure or a CloudAI bug,
656633
- Execute the command manually to debug further
657634

658635
If the problem persists, please report the issue at [https://github.com/NVIDIA/cloudai/issues/new/choose](https://github.com/NVIDIA/cloudai/issues/new/choose). When you report an issue, ensure it is reproducible. Follow the issue template and provide any necessary details, such as the hash commit used, system settings, any changes in the schema files, and the command.
659-
660-
## Test Specific Troubleshooting Guides
661-
In addition to the general troubleshooting steps, this section provides specific troubleshooting guides for each test used in CloudAI. These guides help you identify and resolve issues unique to each template.
662-
663-
### NeMo Launcher
664-
* If your run is not successful, please review the stderr and stdout files generated under the results directory. Within the output directory, locate the run directory, and under the run directory, you will find stderr files like log-nemo-megatron-run_[job_id].err. Please review these files for any meaningful error messages
665-
* Trying the CloudAI-generated NeMo launcher command can be helpful as well. You can find the executed command in your stdout and in your log file (debug.log) in your current working directory. Review and run the command, and you can modify the arguments to troubleshoot the issue
666-
667-
### JaxToolbox (Grok)
668-
#### Troubleshooting Steps
669-
If an error occurs, follow these steps sequentially:
670-
671-
1. **Read the Error Messages**:
672-
Begin by reading the error messages printed by CloudAI. We strive to make our error messages clear and informative, so they are a good starting point for troubleshooting
673-
674-
2. **Review `profile_stderr.txt`**: JaxToolbox operates in two stages: the profiling phase and the actual run phase. We follow the PGLE workflow as described in the [PGLE workflow documentation](https://github.com/google/paxml?tab=readme-ov-file#run-pgle-workflow-on-gpu). All stderr and stdout messages from the profiling phase are stored in `profile_stderr.txt`. If the profiling stage fails, you should find relevant error messages in this file. Attempt to understand the cause of the error from these messages.
675-
676-
3. **Check the Actual Run Phase**:
677-
If the profiling stage completes successfully, CloudAI moves on to the actual run phase. The actual run generates stdout and stderr messages in separate files for each rank. Review these files to diagnose any issues during this phase.
678-
679-
#### Common Errors
680-
**DEADLINE_EXCEEDED**:
681-
- When running JaxToolbox on multiple nodes, the nodes must be able to communicate to execute a training job collaboratively. The DEADLINE_EXCEEDED error indicates a failure in the connection during the initialization stage. Potential causes include:
682-
- Hostname resolution failure by the slave nodes
683-
- The port opened by the master node is not accessible by other nodes
684-
- Network interface malfunctions
685-
- Significant time gap in the initialization phase among nodes. If one node starts early while others are still loading the Docker image, this error can occur. This can happen when a Docker image is not locally cached, and all nodes try to download it from a remote registry without sufficient network bandwidth. The resulting difference in initialization times can lead to a timeout on some nodes

0 commit comments

Comments
 (0)