Skip to content

Commit dc6b76b

Browse files
Merge branch 'main' into m-bridge
2 parents ae12297 + fbf9891 commit dc6b76b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+2130
-109
lines changed

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,22 @@ cd cloudai
1111
uv run cloudai --help
1212
```
1313

14-
Please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html) for details on setting up workloads' requirements.
14+
For details on setting up workloads' requirements, please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html)
1515

1616
For details and `pip`-based installation, please refer to the [documentation](https://nvidia.github.io/cloudai/#get-started).
1717

1818
## Key Concepts
1919

20-
CloudAI operates on four main schemas:
20+
CloudAI operates on three main schemas:
2121

22-
- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables.
23-
- **Test Schema**: An instance of a test template with custom arguments and environment variables.
24-
- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario.
22+
- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables
23+
- **Test Schema**: An instance of a test template with custom arguments and environment variables
24+
- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario
2525

2626
These schemas enable CloudAI to be flexible and compatible with different systems and configurations.
2727

2828

29-
## Support matrix
29+
## Support Matrix
3030
|Test|Slurm|Kubernetes|RunAI|Standalone|
3131
|---|---|---|---|---|
3232
|AI Dynamo|||||
@@ -47,7 +47,7 @@ These schemas enable CloudAI to be flexible and compatible with different system
4747
|Triton Inference|||||
4848
|UCC|||||
4949

50-
*deprecated means that a workload support exists, but we are not maintaining it actively anymore and newer configurations might not work.
50+
Note: Deprecated means that a workload support exists, but we are not maintaining it actively anymore and newer configurations might not work.
5151

5252
For more detailed information, please refer to the [official documentation](https://nvidia.github.io/cloudai/workloads/index.html).
5353

conf/common/test/osu_test.toml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
name = "osu_test"
18+
test_template_name = "OSUBench"
19+
description = "OSU Benchmark example"
20+
21+
[cmd_args]
22+
"docker_image_url" = "nvcr.io#nvidia/pytorch:25.06-py3"
23+
"benchmarks_dir" = "/opt/hpcx/ompi/tests/osu-micro-benchmarks"
24+
"benchmark" = "osu_allreduce"
25+
"iterations" = 10
26+
"message_size" = "1024"
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
name = "osu_test_scenario"
18+
job_status_check = true
19+
20+
[[Tests]]
21+
id = "Tests.1"
22+
test_name = "osu_test"
23+
num_nodes = "2"
24+
time_limit = "00:20:00"
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
name = "aiconfigurator_disagg_demo"
18+
description = "Aiconfigurator disaggregated predictor demo"
19+
test_template_name = "Aiconfigurator"
20+
21+
[cmd_args]
22+
model_name = "LLAMA3.1_70B"
23+
system = "h100_sxm"
24+
# backend and version use defaults
25+
isl = 3000
26+
osl = 150
27+
28+
[cmd_args.disagg]
29+
p_tp = 4
30+
p_pp = 1
31+
p_dp = 1
32+
p_bs = 1
33+
p_workers = 1
34+
35+
d_tp = 4
36+
d_pp = 1
37+
d_dp = 1
38+
d_bs = 256
39+
d_workers = 1
40+
41+
prefill_correction_scale = 1.0
42+
decode_correction_scale = 1.0
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
name = "dse_aiconfigurator_disagg_demo_Llama3.1_70B"
18+
description = "Aiconfigurator disaggregated predictor DSE sweeps"
19+
test_template_name = "Aiconfigurator"
20+
agent_metrics = [
21+
"ttft_ms",
22+
"tpot_ms",
23+
"tokens_per_s_per_gpu",
24+
"tokens_per_s_per_user",
25+
]
26+
agent_reward_function = "ai_dynamo_log_scale"
27+
28+
29+
[cmd_args]
30+
model_name = "LLAMA3.1_70B"
31+
system = "h200_sxm"
32+
# backend and version use defaults
33+
isl = 4000
34+
osl = 500
35+
36+
[cmd_args.disagg]
37+
p_tp = [1]
38+
p_pp = [1]
39+
p_dp = [1]
40+
p_bs = 1
41+
p_workers = [1, 2]
42+
43+
d_tp = [1]
44+
d_pp = [1]
45+
d_dp = [1]
46+
d_bs = [8]
47+
d_workers = [1, 2]
48+
49+
gemm_quant_mode = "fp8_block"
50+
moe_quant_mode = "fp8"
51+
kvcache_quant_mode = "fp8"
52+
fmha_quant_mode = "fp8"
53+
comm_quant_mode = "half"
54+
prefill_correction_scale = 1.0
55+
decode_correction_scale = 1.0
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
name = "aiconfigurator-disagg-scenario"
18+
19+
[[Tests]]
20+
id = "aiconfigurator.disagg.demo"
21+
time_limit = "00:05:00"
22+
test_name = "dse_aiconfigurator_disagg_demo_Llama3.1_70B"
23+
num_nodes = 1
24+
agent_metrics = [
25+
"ttft_ms",
26+
"tpot_ms",
27+
"tokens_per_s_per_gpu",
28+
"tokens_per_s_per_user",
29+
]

doc/DEV.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@ We use [import-linter](https://github.com/seddonym/import-linter) to ensure no c
3636
`Registry` object is a singleton that holds implementation mappings. Users can register their own implementations to the registry or replace the default implementations.
3737

3838
## Cache
39-
Some prerequisites can be installed: docker images, git repos with executable scripts, etc. All such "installables" are kept under System's `install_path`.
39+
Some prerequisites can be installed. For example: Docker images, git repos with executable scripts, etc.
40+
All such "installables" are kept under System's `install_path`.
4041

4142
Installables are shared among all tests. So if any number of tests use the same installable, it is installed only once for a particular System TOML.
4243

doc/USER_GUIDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# CloudAI User Guide
2-
This is a CloudAI user guide to help users use CloudAI, covering topics such as adding new tests and downloading datasets for running NeMo-launcher.
2+
The purpose of this guide is to help users use CloudAI. The user guide covers topics such as adding new tests and downloading datasets for running NeMo-launcher.
33

44
## Step 1: Create a Docker Image
55
1. **Set Up the GitLab Repository:**

doc/index.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@ cd cloudai
99
uv run cloudai --help
1010
```
1111

12-
**Note**: instructions for setting up access for `enroot` are available [installation guide](./workloads_requirements_installation.rst).
12+
**Note**: For instructions for setting up access for `enroot`, see [installation guide](./workloads_requirements_installation.rst).
1313

14-
### `pip`-based installation
15-
See required Python version in the `.python-version` file, please ensure you have it installed (see how a custom python version [can be installed](#install-custom-python-version)). Follow these steps:
14+
### `pip`-based Installation
15+
See required Python version in the `.python-version` file and make sure you have it installed (For Installation, see [Custom Python version](#install-custom-python-version)). Follow these steps:
1616
```bash
1717
git clone git@github.com:NVIDIA/cloudai.git
1818
cd cloudai
@@ -22,7 +22,7 @@ pip install -e .
2222
```
2323

2424
(install-custom-python-version)=
25-
### Install custom python version
25+
### Install Custom Python Version
2626
If your system python version is not supported, you can install a custom version using [uv](https://docs.astral.sh/uv/getting-started/installation/) tool:
2727
```bash
2828
curl -LsSf https://astral.sh/uv/install.sh | sh
@@ -34,11 +34,11 @@ source .venv/bin/activate
3434

3535

3636
## Key Concepts
37-
CloudAI operates on four main schemas:
37+
CloudAI operates on three main schemas:
3838

39-
- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables.
40-
- **Test Schema**: An instance of a test template with custom arguments and environment variables.
41-
- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario.
39+
- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables
40+
- **Test Schema**: An instance of a test template with custom arguments and environment variables
41+
- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario
4242

4343
These schemas enable CloudAI to be flexible and compatible with different systems and configurations.
4444

doc/reporting.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,42 +3,46 @@ This document describes the reporting system in CloudAI.
33

44

55
## Overview
6-
CloudAI has two reporting levels: per-test (per each case in a test scenario) and per-scenario (per each test scenario). All reports are generated after the test scenario is completed as part of main CloudAI process. For Slurm this means that login node is used to generate reports.
6+
CloudAI has two reporting levels:
7+
- per-test (per each case in a test scenario)
8+
- per-scenario (per each test scenario)
9+
10+
All reports are generated after the test scenario is completed as part of main CloudAI process. For Slurm this means that login node is used to generate reports.
711

812
Per-test reports are linked to a particular workload type (e.g. `NcclTest`). All per-test reports are implemented as part of `per_test` scenario report and can be enabled/disabled via single configuration option, see [Enable, disable and configure reports](enable-disable-and-configure-reports) section.
913

10-
To list all available reports, one can use `cloudai list-reports` command. Use verbose output to also print report configurations.
14+
To list all available reports, users can use `cloudai list-reports` command. Use verbose output to also print report configurations.
1115

1216

13-
## Notes and general flow
17+
## Notes and General Flow
1418
1. All reports should be registered via `Registry()` (`.add_report()` or `.add_scenario_report()`).
1519
1. Scenario reports are configurable via system config (Slurm-only for now) and scenario config.
1620
1. Configuration in a scenario config has the highest priority. Next, system config is checked. Then it defaults to report config from the registry.
1721
1. Then report is generated (or not) according to this final config.
1822

1923

2024
(enable-disable-and-configure-reports)=
21-
## Enable, disable and configure reports
22-
**NOTE** Only scenario-level reports can be configured today.
25+
## Enable, Disable and Configure Reports
26+
**NOTE** Only scenario-level reports can be configured.
2327

24-
To enable or disable a report, one needs to do it via System configuration:
28+
To enable or disable a report, users need to do it via system configuration:
2529
```toml
2630
[reports]
2731
per_test = { enable = false }
2832
status = { enable = true }
2933
```
3034

3135

32-
## Report registration
36+
## Report Registration
3337
Report registration is done via `Registry` class:
3438

3539
```python
3640
Registry().add_scenario_report("per_test", PerTestReporter, ReportConfig(enable=True))
3741
```
3842

3943

40-
## Report configuration implementation
41-
Each report can define its own configuration which is constructed and passed as an argument to `Registry.add_scenario_report` method. `reports` field is parsed during TOMLs reading and respective Pydantic model is created.
44+
## Report Configuration Implementation
45+
Each report can define its own configuration, which is constructed and passed as an argument to `Registry.add_scenario_report` method. `reports` field is parsed during TOMLs reading and respective Pydantic model is created.
4246

4347
For example, we can define a custom report configuration:
4448
```python

0 commit comments

Comments
 (0)