NVIDIA
diff --git a/‎README.md‎
Lines changed: 7 additions & 7 deletions b/‎README.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎conf/common/test/osu_test.toml‎
Lines changed: 26 additions & 0 deletions b/‎conf/common/test/osu_test.toml‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎conf/common/test_scenario/osu_test.toml‎
Lines changed: 24 additions & 0 deletions b/‎conf/common/test_scenario/osu_test.toml‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎conf/experimental/aiconfigurator/test/aiconfigurator_disagg.toml‎
Lines changed: 42 additions & 0 deletions b/‎conf/experimental/aiconfigurator/test/aiconfigurator_disagg.toml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎conf/experimental/aiconfigurator/test/dse_aiconfigurator_disagg.toml‎
Lines changed: 55 additions & 0 deletions b/‎conf/experimental/aiconfigurator/test/dse_aiconfigurator_disagg.toml‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎conf/experimental/aiconfigurator/test_scenario/aiconfigurator_disagg.toml‎
Lines changed: 29 additions & 0 deletions b/‎conf/experimental/aiconfigurator/test_scenario/aiconfigurator_disagg.toml‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎doc/DEV.md‎
Lines changed: 2 additions & 1 deletion b/‎doc/DEV.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎doc/USER_GUIDE.md‎
Lines changed: 1 addition & 1 deletion b/‎doc/USER_GUIDE.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/index.md‎
Lines changed: 8 additions & 8 deletions b/‎doc/index.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎doc/reporting.md‎
Lines changed: 13 additions & 9 deletions b/‎doc/reporting.md‎
Lines changed: 13 additions & 9 deletions
@@ -11,22 +11,22 @@ cd cloudai
 uv run cloudai --help
 ```
 
-Please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html) for details on setting up workloads' requirements.
+For details on setting up workloads' requirements, please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html)
 
 For details and `pip`-based installation, please refer to the [documentation](https://nvidia.github.io/cloudai/#get-started).
 
 ## Key Concepts
 
-CloudAI operates on four main schemas:
+CloudAI operates on three main schemas:
 
-- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables.
-- **Test Schema**: An instance of a test template with custom arguments and environment variables.
-- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario.
+- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables
+- **Test Schema**: An instance of a test template with custom arguments and environment variables
+- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario
 
 These schemas enable CloudAI to be flexible and compatible with different systems and configurations.
 
 
-## Support matrix
+## Support Matrix
 |Test|Slurm|Kubernetes|RunAI|Standalone|
 |---|---|---|---|---|
 |AI Dynamo|✅|✅|❌|❌|
@@ -47,7 +47,7 @@ These schemas enable CloudAI to be flexible and compatible with different system
 |Triton Inference|✅|❌|❌|❌|
 |UCC|✅|❌|❌|❌|
 
-*deprecated means that a workload support exists, but we are not maintaining it actively anymore and newer configurations might not work.
+Note: Deprecated means that a workload support exists, but we are not maintaining it actively anymore and newer configurations might not work.
 
 For more detailed information, please refer to the [official documentation](https://nvidia.github.io/cloudai/workloads/index.html).
 
 
@@ -0,0 +1,26 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name = "osu_test"
+test_template_name = "OSUBench"
+description = "OSU Benchmark example"
+
+[cmd_args]
+"docker_image_url" = "nvcr.io#nvidia/pytorch:25.06-py3"
+"benchmarks_dir" = "/opt/hpcx/ompi/tests/osu-micro-benchmarks"
+"benchmark" = "osu_allreduce"
+"iterations" = 10
+"message_size" = "1024"
@@ -0,0 +1,24 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name = "osu_test_scenario"
+job_status_check = true
+
+[[Tests]]
+id = "Tests.1"
+test_name = "osu_test"
+num_nodes = "2"
+time_limit = "00:20:00"
@@ -0,0 +1,42 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name = "aiconfigurator_disagg_demo"
+description = "Aiconfigurator disaggregated predictor demo"
+test_template_name = "Aiconfigurator"
+
+[cmd_args]
+model_name = "LLAMA3.1_70B"
+system = "h100_sxm"
+# backend and version use defaults
+isl = 3000
+osl = 150
+
+  [cmd_args.disagg]
+  p_tp = 4
+  p_pp = 1
+  p_dp = 1
+  p_bs = 1
+  p_workers = 1
+
+  d_tp = 4
+  d_pp = 1
+  d_dp = 1
+  d_bs = 256
+  d_workers = 1
+
+  prefill_correction_scale = 1.0
+  decode_correction_scale = 1.0
@@ -0,0 +1,55 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name = "dse_aiconfigurator_disagg_demo_Llama3.1_70B"
+description = "Aiconfigurator disaggregated predictor DSE sweeps"
+test_template_name = "Aiconfigurator"
+agent_metrics = [
+  "ttft_ms",
+  "tpot_ms",
+  "tokens_per_s_per_gpu",
+  "tokens_per_s_per_user",
+]
+agent_reward_function = "ai_dynamo_log_scale"
+
+
+[cmd_args]
+model_name = "LLAMA3.1_70B"
+system = "h200_sxm"
+# backend and version use defaults
+isl = 4000
+osl = 500
+
+  [cmd_args.disagg]
+  p_tp = [1]
+  p_pp = [1]
+  p_dp = [1]
+  p_bs = 1
+  p_workers = [1, 2]
+
+  d_tp = [1]
+  d_pp = [1]
+  d_dp = [1]
+  d_bs = [8]
+  d_workers = [1, 2]
+
+  gemm_quant_mode = "fp8_block"
+  moe_quant_mode = "fp8"
+  kvcache_quant_mode = "fp8"
+  fmha_quant_mode = "fp8"
+  comm_quant_mode = "half"
+  prefill_correction_scale = 1.0
+  decode_correction_scale = 1.0
@@ -0,0 +1,29 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name = "aiconfigurator-disagg-scenario"
+
+[[Tests]]
+id = "aiconfigurator.disagg.demo"
+time_limit = "00:05:00"
+test_name = "dse_aiconfigurator_disagg_demo_Llama3.1_70B"
+num_nodes = 1
+agent_metrics = [
+  "ttft_ms",
+  "tpot_ms",
+  "tokens_per_s_per_gpu",
+  "tokens_per_s_per_user",
+]
@@ -36,7 +36,8 @@ We use [import-linter](https://github.com/seddonym/import-linter) to ensure no c
 `Registry` object is a singleton that holds implementation mappings. Users can register their own implementations to the registry or replace the default implementations.
 
 ## Cache
-Some prerequisites can be installed: docker images, git repos with executable scripts, etc. All such "installables" are kept under System's `install_path`.
+Some prerequisites can be installed. For example: Docker images, git repos with executable scripts, etc.
+All such "installables" are kept under System's `install_path`.
 
 Installables are shared among all tests. So if any number of tests use the same installable, it is installed only once for a particular System TOML.
 
 
@@ -1,5 +1,5 @@
 # CloudAI User Guide
-This is a CloudAI user guide to help users use CloudAI, covering topics such as adding new tests and downloading datasets for running NeMo-launcher.
+The purpose of this guide is to help users use CloudAI. The user guide covers topics such as adding new tests and downloading datasets for running NeMo-launcher.
 
 ## Step 1: Create a Docker Image
 1. **Set Up the GitLab Repository:**
 
@@ -9,10 +9,10 @@ cd cloudai
 uv run cloudai --help
 ```
 
-**Note**: instructions for setting up access for `enroot` are available [installation guide](./workloads_requirements_installation.rst).
+**Note**: For instructions for setting up access for `enroot`, see [installation guide](./workloads_requirements_installation.rst).
 
-### `pip`-based installation
-See required Python version in the `.python-version` file, please ensure you have it installed (see how a custom python version [can be installed](#install-custom-python-version)). Follow these steps:
+### `pip`-based Installation
+See required Python version in the `.python-version` file and make sure you have it installed (For Installation, see [Custom Python version](#install-custom-python-version)). Follow these steps:
 ```bash
 git clone git@github.com:NVIDIA/cloudai.git
 cd cloudai
@@ -22,7 +22,7 @@ pip install -e .
 ```
 
 (install-custom-python-version)=
-### Install custom python version
+### Install Custom Python Version
 If your system python version is not supported, you can install a custom version using [uv](https://docs.astral.sh/uv/getting-started/installation/) tool:
 ```bash
 curl -LsSf https://astral.sh/uv/install.sh | sh
@@ -34,11 +34,11 @@ source .venv/bin/activate
 
 
 ## Key Concepts
-CloudAI operates on four main schemas:
+CloudAI operates on three main schemas:
 
-- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables.
-- **Test Schema**: An instance of a test template with custom arguments and environment variables.
-- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario.
+- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables
+- **Test Schema**: An instance of a test template with custom arguments and environment variables
+- **Test Scenario Schema**: A set of tests with dependencies and additional descriptions about the test scenario
 
 These schemas enable CloudAI to be flexible and compatible with different systems and configurations.
 
 
@@ -3,42 +3,46 @@ This document describes the reporting system in CloudAI.
 
 
 ## Overview
-CloudAI has two reporting levels: per-test (per each case in a test scenario) and per-scenario (per each test scenario). All reports are generated after the test scenario is completed as part of main CloudAI process. For Slurm this means that login node is used to generate reports.
+CloudAI has two reporting levels: 
+- per-test (per each case in a test scenario)
+- per-scenario (per each test scenario)
+
+All reports are generated after the test scenario is completed as part of main CloudAI process. For Slurm this means that login node is used to generate reports.
 
 Per-test reports are linked to a particular workload type (e.g. `NcclTest`). All per-test reports are implemented as part of `per_test` scenario report and can be enabled/disabled via single configuration option, see [Enable, disable and configure reports](enable-disable-and-configure-reports) section.
 
-To list all available reports, one can use `cloudai list-reports` command. Use verbose output to also print report configurations.
+To list all available reports, users can use `cloudai list-reports` command. Use verbose output to also print report configurations.
 
 
-## Notes and general flow
+## Notes and General Flow
 1. All reports should be registered via `Registry()` (`.add_report()` or `.add_scenario_report()`).
 1. Scenario reports are configurable via system config (Slurm-only for now) and scenario config.
 1. Configuration in a scenario config has the highest priority. Next, system config is checked. Then it defaults to report config from the registry.
 1. Then report is generated (or not) according to this final config.
 
 
 (enable-disable-and-configure-reports)=
-## Enable, disable and configure reports
-**NOTE** Only scenario-level reports can be configured today.
+## Enable, Disable and Configure Reports
+**NOTE** Only scenario-level reports can be configured.
 
-To enable or disable a report, one needs to do it via System configuration:
+To enable or disable a report, users need to do it via system configuration:
 ```toml
 [reports]
 per_test = { enable = false }
 status = { enable = true }
 ```
 
 
-## Report registration
+## Report Registration
 Report registration is done via `Registry` class:
 
 ```python
 Registry().add_scenario_report("per_test", PerTestReporter, ReportConfig(enable=True))
 ```
 
 
-## Report configuration implementation
-Each report can define its own configuration which is constructed and passed as an argument to `Registry.add_scenario_report` method. `reports` field is parsed during TOMLs reading and respective Pydantic model is created.
+## Report Configuration Implementation
+Each report can define its own configuration, which is constructed and passed as an argument to `Registry.add_scenario_report` method. `reports` field is parsed during TOMLs reading and respective Pydantic model is created.
 
 For example, we can define a custom report configuration:
 ```python