Skip to content

Commit 8e108eb

Browse files
authored
Merge pull request #735 from NVIDIA/am/hf-model-installable
Add new installable type: HF model
2 parents 4e9c340 + 3227e21 commit 8e108eb

File tree

17 files changed

+419
-47
lines changed

17 files changed

+419
-47
lines changed

README.md

Lines changed: 4 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
CloudAI benchmark framework aims to develop an industry standard benchmark focused on grading Data Center (DC) scale AI systems in the Cloud. The primary motivation is to provide automated benchmarking on various systems.
44

55
## Get Started
6-
**Note**: instructions for setting up access for `enroot` are available [here](#set-up-access-to-the-private-ngc-registry).
76

87
Using `uv` tool allows users to run CloudAI without manually managing required Python versions and dependencies.
98
```bash
@@ -12,9 +11,12 @@ cd cloudai
1211
uv run cloudai --help
1312
```
1413

14+
Please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html) for details on setting up workloads' requirements.
15+
1516
For details and `pip`-based installation, please refer to the [documentation](https://nvidia.github.io/cloudai/#get-started).
1617

1718
## Key Concepts
19+
1820
CloudAI operates on four main schemas:
1921

2022
- **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables.
@@ -41,25 +43,6 @@ These schemas enable CloudAI to be flexible and compatible with different system
4143

4244
For more detailed information, please refer to the [official documentation](https://nvidia.github.io/cloudai/workloads/index.html).
4345

44-
## Details
45-
### Set Up Access to the Private NGC Registry
46-
First, ensure you have access to the Docker repository. Follow the following steps:
47-
48-
1. **Sign In**: Go to [NVIDIA NGC](https://ngc.nvidia.com/signin) and sign in with your credentials.
49-
2. **Generate API Key**:
50-
- On the top right corner, click on the dropdown menu next to your profile
51-
- Select "Setup"
52-
- In the "Setup" section, find "Keys/Secrets"
53-
- Click "Generate API Key" and confirm when prompted. A new API key will be presented
54-
- **Note**: Save this API key locally as you will not be able to view it again on NGC
55-
56-
Next, set up your enroot credentials. Ensure you have the correct credentials under `~/.config/enroot/.credentials`:
57-
```
58-
machine nvcr.io login $oauthtoken password <api-key>
59-
```
60-
Replace `<api-key>` with your respective credentials. Keep `$oauthtoken` as is.
61-
62-
6346
## CloudAI Modes Usage Examples
6447

6548
CloudAI supports five modes:
@@ -71,7 +54,7 @@ CloudAI supports five modes:
7154

7255
### install
7356

74-
To install test prerequisites, run CloudAI CLI in install mode.
57+
To install test prerequisites, run CloudAI CLI in install mode. For more details, please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html).
7558

7659
Please make sure to use the correct system configuration file that corresponds to your current setup for installation and experiments.
7760
```bash

conf/experimental/ai_dynamo/test/vllm.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ test_template_name = "AIDynamo"
2222
docker_image_url = "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1.post1"
2323

2424
[cmd_args.dynamo]
25+
model = "Qwen/Qwen3-0.6B"
2526
backend = "vllm"
2627

2728
[cmd_args.dynamo.prefill_worker]

doc/index.md

Lines changed: 7 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@
33
CloudAI benchmark framework aims to develop an industry standard benchmark focused on grading Data Center (DC) scale AI systems in the Cloud. The primary motivation is to provide automated benchmarking on various systems.
44

55
## Get Started
6-
**Note**: instructions for setting up access for `enroot` are available [here](set-up-access-to-the-private-ngc-registry).
7-
86
```bash
97
git clone git@github.com:NVIDIA/cloudai.git
108
cd cloudai
119
uv run cloudai --help
1210
```
1311

12+
**Note**: instructions for setting up access for `enroot` are available [installation guide](./workloads_requirements_installation.rst).
13+
1414
### `pip`-based installation
1515
See required Python version in the `.python-version` file, please ensure you have it installed (see how a custom python version [can be installed](#install-custom-python-version)). Follow these steps:
1616
```bash
@@ -58,24 +58,7 @@ These schemas enable CloudAI to be flexible and compatible with different system
5858
|SlurmContainer|||||
5959
|MegatronRun (experimental)|||||
6060

61-
## Details
62-
(set-up-access-to-the-private-ngc-registry)=
63-
### Set Up Access to the Private NGC Registry
64-
First, ensure you have access to the Docker repository. Follow the following steps:
6561

66-
1. **Sign In**: Go to [NVIDIA NGC](https://ngc.nvidia.com/signin) and sign in with your credentials.
67-
2. **Generate API Key**:
68-
- On the top right corner, click on the dropdown menu next to your profile
69-
- Select "Setup"
70-
- In the "Setup" section, find "Keys/Secrets"
71-
- Click "Generate API Key" and confirm when prompted. A new API key will be presented
72-
- **Note**: Save this API key locally as you will not be able to view it again on NGC
73-
74-
Next, set up your enroot credentials. Ensure you have the correct credentials under `~/.config/enroot/.credentials`:
75-
```
76-
machine nvcr.io login $oauthtoken password <api-key>
77-
```
78-
Replace `<api-key>` with your respective credentials. Keep `$oauthtoken` as is.
7962

8063
## CloudAI Modes Usage Examples
8164

@@ -89,7 +72,7 @@ CloudAI supports five modes:
8972
(install)=
9073
### install
9174

92-
To install test prerequisites, run CloudAI CLI in install mode.
75+
To install test prerequisites, run CloudAI CLI in install mode. For more details, please refer to the [installation guide](./workloads_requirements_installation.rst).
9376

9477
Please make sure to use the correct system configuration file that corresponds to your current setup for installation and experiments.
9578
```bash
@@ -98,6 +81,7 @@ cloudai install\
9881
--tests-dir conf/common/test\
9982
--test-scenario conf/common/test_scenario/sleep.toml
10083
```
84+
10185
(dry-run)=
10286
### dry-run
10387
To simulate running experiments without execution, use the dry-run mode:
@@ -107,6 +91,7 @@ cloudai dry-run\
10791
--tests-dir conf/common/test\
10892
--test-scenario conf/common/test_scenario/sleep.toml
10993
```
94+
11095
(run)=
11196
### run
11297
To run experiments, execute CloudAI CLI in run mode:
@@ -116,6 +101,7 @@ cloudai run\
116101
--tests-dir conf/common/test\
117102
--test-scenario conf/common/test_scenario/sleep.toml
118103
```
104+
119105
(generate-report)=
120106
### generate-report
121107
To generate reports, execute CloudAI CLI in generate-report mode:
@@ -161,4 +147,5 @@ workloads/index
161147
DEV
162148
reporting
163149
USER_GUIDE
150+
workloads_requirements_installation
164151
```
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
Workloads requirements installation
2+
===================================
3+
4+
CloudAI workloads can define multiple "installables" as prerequisites. It can be a container image, git repository, HF model, etc.
5+
6+
7+
Set Up Access to the Private NGC Registry
8+
-----------------------------------------
9+
10+
First, ensure you have access to the Docker repository. Follow the following steps:
11+
12+
1. **Sign In**: Go to `NGC signin`_ and sign in with your credentials.
13+
2. **Generate API Key**:
14+
- On the top right corner, click on the dropdown menu next to your profile
15+
- Select "Setup"
16+
- In the "Setup" section, find "Keys/Secrets"
17+
- Click "Generate API Key" and confirm when prompted. A new API key will be presented
18+
- **Note**: Save this API key locally as you will not be able to view it again on NGC
19+
20+
.. _NGC signin: https://ngc.nvidia.com/signin
21+
22+
Next, set up your enroot credentials. Ensure you have the correct credentials under ``~/.config/enroot/.credentials``:
23+
24+
.. code-block:: text
25+
26+
machine nvcr.io login $oauthtoken password <api-key>
27+
28+
Replace `<api-key>` with your respective credentials. Keep `$oauthtoken` as is.
29+
30+
31+
🤗 Hugging Face models
32+
----------------------
33+
34+
Some workloads require Hugging Face models. CloudAI will download the models from Hugging Face and cache them in the location specified by System's ``hf_home_path`` field. By default, it is set to ``<INSTALL_DIR>/huggingface``, but any other location can be specified. When Slurm is used, this location will be mounted to the container.
35+
36+
37+
Authentication with Hugging Face
38+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39+
40+
As of now, CloudAI doesn't handle authentication with Hugging Face, so it is up to the user to enable authentication with Hugging Face in the shell where CloudAI is run. One might need to run the following command:
41+
42+
.. code-block:: bash
43+
44+
uv run hf auth login
45+
46+
Once done, all Hugging Face models will be downloaded using existing authentication.

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ dependencies = [
2828
"websockets~=15.0.1",
2929
"rich~=14.2",
3030
"click~=8.3",
31+
"huggingface-hub~=1.1.7",
3132
]
3233
requires-python = ">=3.10"
3334
scripts = { cloudai = "cloudai.cli:main" }
@@ -124,6 +125,7 @@ root_package = "cloudai"
124125
name = "Util modules are leaf dependencies"
125126
type = "forbidden"
126127
forbidden_modules = ["cloudai.systems", "cloudai.workloads", "cloudai.cli"]
128+
allow_indirect_imports = true
127129
source_modules = ["cloudai.util"]
128130

129131
[tool.vulture]

src/cloudai/_core/installables.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,3 +165,27 @@ def __eq__(self, other: object) -> bool:
165165

166166
def __hash__(self) -> int:
167167
return hash(self.src)
168+
169+
170+
@dataclass
171+
class HFModel(Installable):
172+
"""HuggingFace Model object."""
173+
174+
model_name: str
175+
_installed_path: Path | None = field(default=None, repr=False)
176+
177+
@property
178+
def installed_path(self) -> Path:
179+
if self._installed_path:
180+
return self._installed_path
181+
return Path("hub") / self.model_name
182+
183+
@installed_path.setter
184+
def installed_path(self, value: Path) -> None:
185+
self._installed_path = value
186+
187+
def __eq__(self, other: object) -> bool:
188+
return isinstance(other, HFModel) and other.model_name == self.model_name
189+
190+
def __hash__(self) -> int:
191+
return hash(self.model_name)

src/cloudai/cli/handlers.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,10 @@
4747
from cloudai.util import prepare_output_dir
4848

4949

50+
def _log_installation_dirs(prefix: str, system: System) -> None:
51+
logging.info(f"{prefix} '{system.install_path.absolute()}'. HF cache is {system.hf_home_path.absolute()}.")
52+
53+
5054
def handle_install_and_uninstall(args: argparse.Namespace) -> int:
5155
"""
5256
Manage the installation or uninstallation process for CloudAI.
@@ -69,12 +73,12 @@ def handle_install_and_uninstall(args: argparse.Namespace) -> int:
6973
if args.mode == "install":
7074
all_installed = installer.is_installed(installables)
7175
if all_installed:
72-
logging.info(f"CloudAI is already installed into '{system.install_path}'.")
76+
_log_installation_dirs("CloudAI is already installed into", system)
7377
else:
7478
logging.info("Not all components are ready")
7579
result = installer.install(installables)
7680
if result.success:
77-
logging.info(f"CloudAI is successfully installed into '{system.install_path.absolute()}'.")
81+
_log_installation_dirs("CloudAI is successfully installed into", system)
7882
else:
7983
logging.error(result.message)
8084
rc = 1
@@ -284,7 +288,7 @@ def handle_dry_run_and_run(args: argparse.Namespace) -> int:
284288

285289
result = installer.install(installables)
286290
if result.success:
287-
logging.info(f"CloudAI is successfully installed into '{system.install_path.absolute()}'.")
291+
_log_installation_dirs("CloudAI is successfully installed into", system)
288292
else:
289293
logging.error("Failed to install workloads components.")
290294
logging.error(result.message)

src/cloudai/core.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
)
3333
from ._core.grader import Grader
3434
from ._core.grading_strategy import GradingStrategy
35-
from ._core.installables import DockerImage, File, GitRepo, Installable, PythonExecutable
35+
from ._core.installables import DockerImage, File, GitRepo, HFModel, Installable, PythonExecutable
3636
from ._core.job_status_result import JobStatusResult
3737
from ._core.json_gen_strategy import JsonGenStrategy
3838
from ._core.registry import Registry
@@ -65,6 +65,7 @@
6565
"Grader",
6666
"GradingStrategy",
6767
"GridSearchAgent",
68+
"HFModel",
6869
"InstallStatusResult",
6970
"Installable",
7071
"JobIdRetrievalError",

src/cloudai/systems/slurm/slurm_installer.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,12 @@
2525
DockerImage,
2626
File,
2727
GitRepo,
28+
HFModel,
2829
Installable,
2930
InstallStatusResult,
3031
PythonExecutable,
3132
)
33+
from cloudai.util.hf_model_manager import HFModelManager
3234

3335
from .docker_image_cache_manager import DockerImageCacheManager, DockerImageCacheResult
3436
from .slurm_system import SlurmSystem
@@ -50,6 +52,7 @@ def __init__(self, system: SlurmSystem):
5052
super().__init__(system)
5153
self.system = system
5254
self.docker_image_cache_manager = DockerImageCacheManager(system)
55+
self.hf_model_downloader = HFModelManager(system.hf_home_path)
5356

5457
def _check_prerequisites(self) -> InstallStatusResult:
5558
base_prerequisites_result = super()._check_prerequisites()
@@ -98,6 +101,8 @@ def install_one(self, item: Installable) -> InstallStatusResult:
98101
item.installed_path = self.system.install_path / item.src.name
99102
shutil.copyfile(item.src, item.installed_path, follow_symlinks=False)
100103
return InstallStatusResult(True)
104+
elif isinstance(item, HFModel):
105+
return self.hf_model_downloader.download_model(item)
101106

102107
return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
103108

@@ -117,6 +122,8 @@ def uninstall_one(self, item: Installable) -> InstallStatusResult:
117122
return InstallStatusResult(True)
118123
logging.debug(f"File {item.installed_path} does not exist.")
119124
return InstallStatusResult(True)
125+
elif isinstance(item, HFModel):
126+
return self.hf_model_downloader.remove_model(item)
120127

121128
return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
122129

@@ -141,6 +148,8 @@ def is_installed_one(self, item: Installable) -> InstallStatusResult:
141148
item.installed_path = self.system.install_path / item.src.name
142149
return InstallStatusResult(True)
143150
return InstallStatusResult(False, f"File {item.installed_path} does not exist")
151+
elif isinstance(item, HFModel):
152+
return self.hf_model_downloader.is_model_downloaded(item)
144153

145154
return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
146155

@@ -159,6 +168,9 @@ def mark_as_installed_one(self, item: Installable) -> InstallStatusResult:
159168
elif isinstance(item, File):
160169
item.installed_path = self.system.install_path / item.src.name
161170
return InstallStatusResult(True)
171+
elif isinstance(item, HFModel):
172+
item.installed_path = self.system.hf_home_path # fake path is OK here as the whole HF home will be mounted
173+
return InstallStatusResult(True)
162174

163175
return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
164176

0 commit comments

Comments
 (0)