Skip to content

Commit e3451a4

Browse files
authored
Merge pull request #710 from NVIDIA/am/bug-4657975
Update docs
2 parents f3725a7 + 1a64f7f commit e3451a4

File tree

4 files changed

+15
-21
lines changed

4 files changed

+15
-21
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ cloudai uninstall\
141141
--tests-dir conf/common/test\
142142
--test-scenario conf/common/test_scenario/sleep.toml
143143
```
144-
Verify TOML configs:
144+
### verify-configs
145145
```bash
146146
# verify all at once
147147
cloudai verify-configs conf

doc/DEV.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,15 +35,6 @@ We use [import-linter](https://github.com/seddonym/import-linter) to ensure no c
3535

3636
`Registry` object is a singleton that holds implementation mappings. Users can register their own implementations to the registry or replace the default implementations.
3737

38-
## Runners
39-
TBD
40-
41-
## Installers
42-
TBD
43-
44-
## Systems
45-
TBD
46-
4738
## Cache
4839
Some prerequisites can be installed: docker images, git repos with executable scripts, etc. All such "installables" are kept under System's `install_path`.
4940

doc/USER_GUIDE.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Registry().add_test_template("MyTest", MyTest)
7070
```
7171
Relevant Test Configs should specify `test_template_name = MyTest` to use the custom test definition.
7272

73-
## Step 3: System Configuration
73+
## Step 4: System Configuration
7474
System configuration describes the system configuration. You can find more examples of system configs under `conf/common/system/`. Our example will be small for demonstration purposes. Below is the `myconfig/system.toml` file:
7575
```toml
7676
name = "my-cluster"
@@ -90,15 +90,15 @@ name = "partition_1"
9090
```
9191
Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.
9292

93-
## Step 4: Install Test Requirements
93+
## Step 5: Install Test Requirements
9494
Once all configs are ready, it is time to install test requirements. It is done once so that you can run multiple experiments without reinstalling the requirements. This step requires the system config file from the step 3.
9595
```bash
9696
cloudai install \
9797
--system-config myconfig/system.toml \
9898
--tests-dir myconfig/tests/
9999
```
100100

101-
## Step 5: Test Configuration
101+
## Step 6: Test Configuration
102102
Test Configuration describes a particular test configuration to be run. It is based on Test definition and will be used in Test Sceanrio. Below is the `myconfig/tests/nccl_test.toml` file, definition is based on built-in `NcclTest` definition:
103103
```toml
104104
name = "nccl_test_all_reduce_single_node"
@@ -116,7 +116,7 @@ extra_cmd_args = "--stepfactor 2"
116116
```
117117
You can find more examples under `conf/common/test`. In a test schema file, you can adjust arguments as shown above. In the `cmd_args` section, you can provide different values other than the default values for each argument. In `extra_cmd_args`, you can provide additional arguments that will be appended after the NCCL test command. You can specify additional environment variables in the `extra_env_vars` section.
118118

119-
## Step 6: Run Experiments
119+
## Step 7: Run Experiments
120120
Test Scenario uses Test description from step 5. Below is the `myconfig/scenario.toml` file:
121121
```toml
122122
name = "nccl-test"
@@ -147,7 +147,7 @@ Notes on the test scenario:
147147
All dependencies are described as a pair of the depending test name and a delay. The name should be taken from the test name as set in the test scenario. The delay is described in the number of seconds.
148148

149149

150-
To generate NCCL test commands without actual execution, use the `dry-run` mode. You can review `debug.log` (or other file specifued with `--log-file`) to see the generated commands from CloudAI. Please note that group node allocations are not currently supported in the `dry-run` mode.
150+
To generate NCCL test commands without actual execution, use the `dry-run` mode. You can review `debug.log` (or other file specified with `--log-file`) to see the generated commands from CloudAI. Please note that group node allocations are not currently supported in the `dry-run` mode.
151151
```bash
152152
cloudai dry-run \
153153
--test-scenario myconfig/scenario.toml \
@@ -163,7 +163,7 @@ cloudai run \
163163
--tests-dir myconfig/tests/
164164
```
165165

166-
## Step 7: Generate Reports
166+
## Step 8: Generate Reports
167167
Once the test scenario is completed, you can generate reports using the following command:
168168
```bash
169169
cloudai generate-report \
@@ -392,9 +392,9 @@ rm_extracted: False # Preprocess script will remove extracted files after prepro
392392
You can update the fields to adjust the behavior. For example, you can update the file_numbers field to adjust the number of dataset files to download. This will allow you to save disk space.
393393

394394
## Note: For running Nemo Llama model, it is important to follow these additional steps:
395-
1. Go to https://huggingface.co/docs/transformers/en/model_doc/llama.
396-
2. Follow the instructions under 'Usage Tips' on how to download the tokenizer.
397-
3. Replace "training.model.tokenizer.model=TOKENIZER_MODEL" with "training.model.tokenizer.model=YOUR_TOKENIZER_PATH" (the tokenizer should be a .model file) in conf/common/test/llama.toml.
395+
1. Go to [🤗 Hugging Face](https://huggingface.co/docs/transformers/en/model_doc/llama).
396+
2. Follow the instructions on how to download the tokenizer.
397+
3. Replace `TOKENIZER_MODEL` in `training.model.tokenizer.model=TOKENIZER_MODEL` with your path (the tokenizer should be a `.model` file) in `conf/common/test/llama.toml`.
398398

399399

400400
# Using Test Hooks in CloudAI

doc/index.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,8 +135,9 @@ cloudai generate-report\
135135
--test-scenario conf/common/test_scenario/sleep.toml\
136136
--result-dir /path/to/result_directory
137137
```
138-
In the generate-report mode, use the --result-dir argument to specify a subdirectory under the output directory.
138+
In the generate-report mode, use the `--result-dir` argument to specify a subdirectory under the output directory.
139139
This subdirectory is usually named with a timestamp for unique identification.
140+
140141
(uninstall)=
141142
### uninstall
142143
To uninstall test prerequisites, run CloudAI CLI in uninstall mode:
@@ -146,7 +147,9 @@ cloudai uninstall\
146147
--tests-dir conf/common/test\
147148
--test-scenario conf/common/test_scenario/sleep.toml
148149
```
149-
Verify TOML configs:
150+
151+
(verify-configs)=
152+
### verify-configs
150153
```bash
151154
# verify all at once
152155
cloudai verify-configs conf

0 commit comments

Comments
 (0)