Skip to content

Commit ae17cf1

Browse files
authored
Add RunAI scheduler support and enable NCCL tests submission (#436)
1 parent ef5bb3a commit ae17cf1

32 files changed

+2400
-21
lines changed

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -16,19 +16,19 @@ These schemas enable CloudAI to be flexible and compatible with different system
1616

1717

1818
## Support matrix
19-
|Test|Slurm|Kubernetes (experimental)|Standalone|
20-
|---|---|---|---|
21-
|ChakraReplay||||
22-
|GPT||||
23-
|Grok||||
24-
|NCCL||||
25-
|NeMo Launcher||||
26-
|NeMo Run||||
27-
|Nemotron||||
28-
|Sleep||||
29-
|UCC||||
30-
|SlurmContainer||||
31-
|MegatronRun (experimental)||||
19+
|Test|Slurm|Kubernetes|RunAI|Standalone|
20+
|---|---|---|---|---|
21+
|ChakraReplay|||||
22+
|GPT|||||
23+
|Grok|||||
24+
|NCCL|||||
25+
|NeMo Launcher|||||
26+
|NeMo Run|||||
27+
|Nemotron|||||
28+
|Sleep|||||
29+
|UCC|||||
30+
|SlurmContainer|||||
31+
|MegatronRun (experimental)|||||
3232

3333

3434
## Set Up Access to the Private NGC Registry

USER_GUIDE.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ ntasks_per_node = 8
8989
[partitions.<YOUR PARTITION NAME>]
9090
name = "<YOUR PARTITION NAME>"
9191
```
92-
Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.
92+
Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.
9393

9494
#### Step 4: Install Test Requirements
9595
Once all configs are ready, it is time to install test requirements. It is done once so that you can run multiple experiments without reinstalling the requirements. This step requires the system config file from the step 3.
@@ -237,12 +237,33 @@ cache_docker_images_locally = true
237237
- **output_path**: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
238238
- **default_partition**: Specifies the default partition where jobs are scheduled.
239239
- **partitions**: Describes the available partitions and nodes within those partitions.
240-
- **[optional] groups**: Within the same partition, users can define groups of nodes. The group concept can be used to allocate nodes from specific groups in a test scenario schema. For instance, this feature is useful for specifying topology awareness. Groups represents logical partitioning of nodes and users are responsible for ensuring no overlap across groups.
240+
- **[optional] groups**: Within the same partition, users can define groups of nodes. The group concept can be used to allocate nodes from specific groups in a test scenario schema. For instance, this feature is useful for specifying topology awareness. Groups represents logical partitioning of nodes and users are responsible for ensuring no overlap across groups.
241241
- **mpi**: Indicates the Process Management Interface (PMI) implementation to be used for inter-process communication.
242242
- **gpus_per_node** and **ntasks_per_node**: These are Slurm arguments passed to the `sbatch` script and `srun`.
243243
- **cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
244244
- **global_env_vars**: Lists all global environment variables that will be applied globally whenever tests are run.
245245

246+
## Describing a System for RunAI Scheduler
247+
When using RunAI as the scheduler, you need to specify additional fields in the system schema TOML file. Below is the list of required fields and how to set them:
248+
249+
```toml
250+
name = "runai-cluster"
251+
scheduler = "runai"
252+
253+
install_path = "./install"
254+
output_path = "./results"
255+
256+
base_url = "http://runai.example.com" # The URL of your RunAI system, typically the same as used for the web interface.
257+
user_email = "your_email" # The email address used to log into the RunAI system.
258+
app_id = "your_app_id" # Obtained by creating an application in the RunAI web interface.
259+
app_secret = "your_app_secret" # Obtained together with the app_id.
260+
project_id = "your_project_id" # Project ID assigned or created in the RunAI system (usually an integer).
261+
cluster_id = "your_cluster_id" # Cluster ID in UUID format (e.g., a69928cc-ccaa-48be-bda9-482440f4d855).
262+
```
263+
* After logging into the RunAI web interface, navigate to Access → Applications and create a new application to obtain app_id and app_secret.
264+
* Use your assigned project and cluster IDs. Contact your administrator if they are not available.
265+
* All other fields follow the same semantics as in the Slurm system schema (e.g., install_path, output_path).
266+
246267
## Describing a Test Scenario in the Test Scenario Schema
247268
A test scenario is a set of tests with specific dependencies between them. A test scenario is described in a TOML schema file. This is an example of a test scenario file:
248269
```toml
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
name = "example-runai-cluster"
18+
scheduler = "runai"
19+
20+
install_path = "./install_dir"
21+
output_path = "./results"
22+
monitor_interval = 1
23+
24+
base_url = "http://runai.example.com"
25+
user_email = "your_email"
26+
app_id = "your_app_id"
27+
app_secret = "your_app_secret"
28+
project_id = "your_project_id"
29+
cluster_id = "your_cluster_id"
30+
31+
[global_env_vars]
32+
NCCL_IB_GID_INDEX = "3"
33+
NCCL_IB_TIMEOUT = "20"
34+
NCCL_IB_QPS_PER_CONNECTION = "4"

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ dependencies = [
2525
"kubernetes==30.1.0",
2626
"pydantic==2.8.2",
2727
"jinja2==3.1.6",
28+
"websockets==15.0.1",
2829
]
2930
[project.scripts]
3031
cloudai = "cloudai.__main__:main"

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ tbparse==0.0.8
44
toml==0.10.2
55
kubernetes==30.1.0
66
pydantic==2.8.2
7-
jinja2==3.1.6
7+
jinja2==3.1.6
8+
websockets==15.0.1

src/cloudai/__init__.py

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,15 +48,18 @@
4848
from ._core.test_template_strategy import TestTemplateStrategy
4949
from .installer.kubernetes_installer import KubernetesInstaller
5050
from .installer.lsf_installer import LSFInstaller
51+
from .installer.runai_installer import RunAIInstaller
5152
from .installer.slurm_installer import SlurmInstaller
5253
from .installer.standalone_installer import StandaloneInstaller
5354
from .parser import Parser
5455
from .runner.kubernetes.kubernetes_runner import KubernetesRunner
5556
from .runner.lsf.lsf_runner import LSFRunner
57+
from .runner.runai.runai_runner import RunAIRunner
5658
from .runner.slurm.slurm_runner import SlurmRunner
5759
from .runner.standalone.standalone_runner import StandaloneRunner
5860
from .systems.kubernetes.kubernetes_system import KubernetesSystem
5961
from .systems.lsf.lsf_system import LSFSystem
62+
from .systems.runai.runai_system import RunAISystem
6063
from .systems.slurm.slurm_system import SlurmSystem
6164
from .systems.standalone_system import StandaloneSystem
6265
from .workloads.chakra_replay import (
@@ -91,6 +94,7 @@
9194
NcclTestJobStatusRetrievalStrategy,
9295
NcclTestKubernetesJsonGenStrategy,
9396
NcclTestPerformanceReportGenerationStrategy,
97+
NcclTestRunAIJsonGenStrategy,
9498
NcclTestSlurmCommandGenStrategy,
9599
)
96100
from .workloads.nemo_launcher import (
@@ -126,6 +130,7 @@
126130
Registry().add_runner("kubernetes", KubernetesRunner)
127131
Registry().add_runner("standalone", StandaloneRunner)
128132
Registry().add_runner("lsf", LSFRunner)
133+
Registry().add_runner("runai", RunAIRunner)
129134

130135
Registry().add_strategy(
131136
CommandGenStrategy, [StandaloneSystem], [SleepTestDefinition], SleepStandaloneCommandGenStrategy
@@ -134,6 +139,7 @@
134139
Registry().add_strategy(CommandGenStrategy, [SlurmSystem], [SleepTestDefinition], SleepSlurmCommandGenStrategy)
135140
Registry().add_strategy(JsonGenStrategy, [KubernetesSystem], [SleepTestDefinition], SleepKubernetesJsonGenStrategy)
136141
Registry().add_strategy(JsonGenStrategy, [KubernetesSystem], [NCCLTestDefinition], NcclTestKubernetesJsonGenStrategy)
142+
Registry().add_strategy(JsonGenStrategy, [RunAISystem], [NCCLTestDefinition], NcclTestRunAIJsonGenStrategy)
137143
Registry().add_strategy(GradingStrategy, [SlurmSystem], [NCCLTestDefinition], NcclTestGradingStrategy)
138144

139145
Registry().add_strategy(
@@ -164,6 +170,7 @@
164170
[GPTTestDefinition, GrokTestDefinition, NemotronTestDefinition],
165171
JaxToolboxSlurmCommandGenStrategy,
166172
)
173+
167174
Registry().add_strategy(
168175
JobIdRetrievalStrategy,
169176
[SlurmSystem],
@@ -184,8 +191,8 @@
184191
Registry().add_strategy(
185192
JobIdRetrievalStrategy, [StandaloneSystem], [SleepTestDefinition], StandaloneJobIdRetrievalStrategy
186193
)
187-
188194
Registry().add_strategy(JobIdRetrievalStrategy, [LSFSystem], [SleepTestDefinition], LSFJobIdRetrievalStrategy)
195+
189196
Registry().add_strategy(
190197
JobStatusRetrievalStrategy,
191198
[KubernetesSystem],
@@ -221,10 +228,16 @@
221228
Registry().add_strategy(
222229
JobStatusRetrievalStrategy, [StandaloneSystem], [SleepTestDefinition], DefaultJobStatusRetrievalStrategy
223230
)
224-
225231
Registry().add_strategy(
226232
JobStatusRetrievalStrategy, [LSFSystem], [SleepTestDefinition], DefaultJobStatusRetrievalStrategy
227233
)
234+
Registry().add_strategy(
235+
JobStatusRetrievalStrategy,
236+
[RunAISystem],
237+
[NCCLTestDefinition],
238+
DefaultJobStatusRetrievalStrategy,
239+
)
240+
228241
Registry().add_strategy(CommandGenStrategy, [SlurmSystem], [UCCTestDefinition], UCCTestSlurmCommandGenStrategy)
229242

230243
Registry().add_strategy(GradingStrategy, [SlurmSystem], [ChakraReplayTestDefinition], ChakraReplayGradingStrategy)
@@ -239,11 +252,13 @@
239252
Registry().add_installer("standalone", StandaloneInstaller)
240253
Registry().add_installer("kubernetes", KubernetesInstaller)
241254
Registry().add_installer("lsf", LSFInstaller)
255+
Registry().add_installer("runai", RunAIInstaller)
242256

243257
Registry().add_system("slurm", SlurmSystem)
244258
Registry().add_system("standalone", StandaloneSystem)
245259
Registry().add_system("kubernetes", KubernetesSystem)
246260
Registry().add_system("lsf", LSFSystem)
261+
Registry().add_system("runai", RunAISystem)
247262

248263
Registry().add_test_definition("UCCTest", UCCTestDefinition)
249264
Registry().add_test_definition("NcclTest", NCCLTestDefinition)
@@ -298,6 +313,7 @@
298313
"PythonExecutable",
299314
"ReportGenerationStrategy",
300315
"Reporter",
316+
"RunAISystem",
301317
"Runner",
302318
"System",
303319
"SystemConfigParsingError",
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
import logging
18+
19+
from cloudai import BaseInstaller, Installable, InstallStatusResult
20+
from cloudai.systems.runai.runai_system import RunAISystem
21+
22+
23+
class RunAIInstaller(BaseInstaller):
24+
"""Installer for RunAI systems."""
25+
26+
def __init__(self, system: RunAISystem):
27+
"""Initialize the RunAIInstaller with a system object."""
28+
super().__init__(system)
29+
30+
def _check_prerequisites(self) -> InstallStatusResult:
31+
logging.info("Checking prerequisites for RunAI installation.")
32+
return InstallStatusResult(True)
33+
34+
def install_one(self, item: Installable) -> InstallStatusResult:
35+
logging.info(f"Installing {item} for RunAI.")
36+
return InstallStatusResult(True)
37+
38+
def uninstall_one(self, item: Installable) -> InstallStatusResult:
39+
logging.info(f"Uninstalling {item} for RunAI.")
40+
return InstallStatusResult(True)
41+
42+
def is_installed_one(self, item: Installable) -> InstallStatusResult:
43+
logging.info(f"Checking if {item} is installed for RunAI.")
44+
return InstallStatusResult(True)
45+
46+
def mark_as_installed_one(self, item: Installable) -> InstallStatusResult:
47+
logging.info(f"Marking {item} as installed for RunAI.")
48+
return InstallStatusResult(True)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
from dataclasses import dataclass
18+
19+
from cloudai import BaseJob
20+
from cloudai.systems.runai.runai_training import ActualPhase
21+
22+
23+
@dataclass
24+
class RunAIJob(BaseJob):
25+
"""A job class for execution on an RunAI system."""
26+
27+
status: ActualPhase
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
2+
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
import logging
18+
from typing import cast
19+
20+
from cloudai import BaseJob, BaseRunner, TestRun
21+
from cloudai.systems.runai.runai_system import RunAISystem
22+
23+
from .runai_job import RunAIJob
24+
25+
26+
class RunAIRunner(BaseRunner):
27+
"""Class to manage and execute workloads using the RunAI platform."""
28+
29+
def _submit_test(self, tr: TestRun) -> RunAIJob:
30+
logging.info(f"Running test: {tr.name}")
31+
tr.output_path = self.get_job_output_path(tr)
32+
job_spec = tr.test.test_template.gen_json(tr)
33+
logging.debug(f"Generated JSON for test {tr.name}: {job_spec}")
34+
35+
if self.mode == "run":
36+
runai_system = cast(RunAISystem, self.system)
37+
training = runai_system.create_training(job_spec)
38+
job = RunAIJob(test_run=tr, id=training.workload_id, status=training.actual_phase)
39+
logging.info(f"Submitted RunAI job: {job.id}")
40+
return job
41+
else:
42+
raise RuntimeError("Invalid mode for submitting a test.")
43+
44+
async def job_completion_callback(self, job: BaseJob) -> None:
45+
runai_system = cast(RunAISystem, self.system)
46+
job = cast(RunAIJob, job)
47+
workload_id = str(job.id)
48+
runai_system.get_workload_events(workload_id, job.test_run.output_path / "events.txt")
49+
await runai_system.store_logs(workload_id, job.test_run.output_path / "stdout.txt")
50+
51+
def kill_job(self, job: BaseJob) -> None:
52+
runai_system = cast(RunAISystem, self.system)
53+
job = cast(RunAIJob, job)
54+
runai_system.delete_training(str(job.id))

0 commit comments

Comments
 (0)