NVIDIA
diff --git a/‎README.md‎
Lines changed: 4 additions & 21 deletions b/‎README.md‎
Lines changed: 4 additions & 21 deletions
diff --git a/‎conf/experimental/ai_dynamo/test/vllm.toml‎
Lines changed: 1 addition & 0 deletions b/‎conf/experimental/ai_dynamo/test/vllm.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/index.md‎
Lines changed: 7 additions & 20 deletions b/‎doc/index.md‎
Lines changed: 7 additions & 20 deletions
diff --git a/‎doc/workloads_requirements_installation.rst‎
Lines changed: 46 additions & 0 deletions b/‎doc/workloads_requirements_installation.rst‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/cloudai/_core/installables.py‎
Lines changed: 24 additions & 0 deletions b/‎src/cloudai/_core/installables.py‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎src/cloudai/cli/handlers.py‎
Lines changed: 7 additions & 3 deletions b/‎src/cloudai/cli/handlers.py‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎src/cloudai/core.py‎
Lines changed: 2 additions & 1 deletion b/‎src/cloudai/core.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎src/cloudai/systems/slurm/slurm_installer.py‎
Lines changed: 12 additions & 0 deletions b/‎src/cloudai/systems/slurm/slurm_installer.py‎
Lines changed: 12 additions & 0 deletions
@@ -3,7 +3,6 @@
 CloudAI benchmark framework aims to develop an industry standard benchmark focused on grading Data Center (DC) scale AI systems in the Cloud. The primary motivation is to provide automated benchmarking on various systems.
 
 ## Get Started
-**Note**: instructions for setting up access for `enroot` are available [here](#set-up-access-to-the-private-ngc-registry).
 
 Using `uv` tool allows users to run CloudAI without manually managing required Python versions and dependencies.
 ```bash
@@ -12,9 +11,12 @@ cd cloudai
 uv run cloudai --help
 ```
 
+Please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html) for details on setting up workloads' requirements.
+
 For details and `pip`-based installation, please refer to the [documentation](https://nvidia.github.io/cloudai/#get-started).
 
 ## Key Concepts
+
 CloudAI operates on four main schemas:
 
 - **System Schema**: Describes the system, including the scheduler type, node list, and global environment variables.
@@ -41,25 +43,6 @@ These schemas enable CloudAI to be flexible and compatible with different system
 
 For more detailed information, please refer to the [official documentation](https://nvidia.github.io/cloudai/workloads/index.html).
 
-## Details
-###  Set Up Access to the Private NGC Registry
-First, ensure you have access to the Docker repository. Follow the following steps:
-
-1. **Sign In**: Go to [NVIDIA NGC](https://ngc.nvidia.com/signin) and sign in with your credentials.
-2. **Generate API Key**:
-    - On the top right corner, click on the dropdown menu next to your profile
-    - Select "Setup"
-    - In the "Setup" section, find "Keys/Secrets"
-    - Click "Generate API Key" and confirm when prompted. A new API key will be presented
-    - **Note**: Save this API key locally as you will not be able to view it again on NGC
-
-Next, set up your enroot credentials. Ensure you have the correct credentials under `~/.config/enroot/.credentials`:
-```
-machine nvcr.io login $oauthtoken password <api-key>
-```
-Replace `<api-key>` with your respective credentials. Keep `$oauthtoken` as is.
-
-
 ## CloudAI Modes Usage Examples
 
 CloudAI supports five modes:
@@ -71,7 +54,7 @@ CloudAI supports five modes:
 
 ### install
 
-To install test prerequisites, run CloudAI CLI in install mode.
+To install test prerequisites, run CloudAI CLI in install mode. For more details, please refer to the [installation guide](https://nvidia.github.io/cloudai/workloads_requirements_installation.html).
 
 Please make sure to use the correct system configuration file that corresponds to your current setup for installation and experiments.
 ```bash
 
@@ -22,6 +22,7 @@ test_template_name = "AIDynamo"
 docker_image_url = "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1.post1"
 
   [cmd_args.dynamo]
+  model = "Qwen/Qwen3-0.6B"
   backend = "vllm"
 
     [cmd_args.dynamo.prefill_worker]
 
@@ -3,14 +3,14 @@
 CloudAI benchmark framework aims to develop an industry standard benchmark focused on grading Data Center (DC) scale AI systems in the Cloud. The primary motivation is to provide automated benchmarking on various systems.
 
 ## Get Started
-**Note**: instructions for setting up access for `enroot` are available [here](set-up-access-to-the-private-ngc-registry).
-
 ```bash
 git clone git@github.com:NVIDIA/cloudai.git
 cd cloudai
 uv run cloudai --help
 ```
 
+**Note**: instructions for setting up access for `enroot` are available [installation guide](./workloads_requirements_installation.rst).
+
 ### `pip`-based installation
 See required Python version in the `.python-version` file, please ensure you have it installed (see how a custom python version [can be installed](#install-custom-python-version)). Follow these steps:
 ```bash
@@ -58,24 +58,7 @@ These schemas enable CloudAI to be flexible and compatible with different system
 |SlurmContainer|✅|❌|❌|❌|
 |MegatronRun (experimental)|✅|❌|❌|❌|
 
-## Details
-(set-up-access-to-the-private-ngc-registry)=
-###  Set Up Access to the Private NGC Registry
-First, ensure you have access to the Docker repository. Follow the following steps:
 
-1. **Sign In**: Go to [NVIDIA NGC](https://ngc.nvidia.com/signin) and sign in with your credentials.
-2. **Generate API Key**:
-    - On the top right corner, click on the dropdown menu next to your profile
-    - Select "Setup"
-    - In the "Setup" section, find "Keys/Secrets"
-    - Click "Generate API Key" and confirm when prompted. A new API key will be presented
-    - **Note**: Save this API key locally as you will not be able to view it again on NGC
-
-Next, set up your enroot credentials. Ensure you have the correct credentials under `~/.config/enroot/.credentials`:
-```
-machine nvcr.io login $oauthtoken password <api-key>
-```
-Replace `<api-key>` with your respective credentials. Keep `$oauthtoken` as is.
 
 ## CloudAI Modes Usage Examples
 
@@ -89,7 +72,7 @@ CloudAI supports five modes:
 (install)=
 ### install
 
-To install test prerequisites, run CloudAI CLI in install mode.
+To install test prerequisites, run CloudAI CLI in install mode. For more details, please refer to the [installation guide](./workloads_requirements_installation.rst).
 
 Please make sure to use the correct system configuration file that corresponds to your current setup for installation and experiments.
 ```bash
@@ -98,6 +81,7 @@ cloudai install\
     --tests-dir conf/common/test\
     --test-scenario conf/common/test_scenario/sleep.toml
 ```
+
 (dry-run)=
 ### dry-run
 To simulate running experiments without execution, use the dry-run mode:
@@ -107,6 +91,7 @@ cloudai dry-run\
     --tests-dir conf/common/test\
     --test-scenario conf/common/test_scenario/sleep.toml
 ```
+
 (run)=
 ### run
 To run experiments, execute CloudAI CLI in run mode:
@@ -116,6 +101,7 @@ cloudai run\
     --tests-dir conf/common/test\
     --test-scenario conf/common/test_scenario/sleep.toml
 ```
+
 (generate-report)=
 ### generate-report
 To generate reports, execute CloudAI CLI in generate-report mode:
@@ -161,4 +147,5 @@ workloads/index
 DEV
 reporting
 USER_GUIDE
+workloads_requirements_installation
 ```
@@ -0,0 +1,46 @@
+Workloads requirements installation
+===================================
+
+CloudAI workloads can define multiple "installables" as prerequisites. It can be a container image, git repository, HF model, etc.
+
+
+Set Up Access to the Private NGC Registry
+-----------------------------------------
+
+First, ensure you have access to the Docker repository. Follow the following steps:
+
+1. **Sign In**: Go to `NGC signin`_ and sign in with your credentials.
+2. **Generate API Key**:
+    - On the top right corner, click on the dropdown menu next to your profile
+    - Select "Setup"
+    - In the "Setup" section, find "Keys/Secrets"
+    - Click "Generate API Key" and confirm when prompted. A new API key will be presented
+    - **Note**: Save this API key locally as you will not be able to view it again on NGC
+
+.. _NGC signin: https://ngc.nvidia.com/signin
+
+Next, set up your enroot credentials. Ensure you have the correct credentials under ``~/.config/enroot/.credentials``:
+
+.. code-block:: text
+
+    machine nvcr.io login $oauthtoken password <api-key>
+
+Replace `<api-key>` with your respective credentials. Keep `$oauthtoken` as is.
+
+
+🤗 Hugging Face models
+----------------------
+
+Some workloads require Hugging Face models. CloudAI will download the models from Hugging Face and cache them in the location specified by System's ``hf_home_path`` field. By default, it is set to ``<INSTALL_DIR>/huggingface``, but any other location can be specified. When Slurm is used, this location will be mounted to the container.
+
+
+Authentication with Hugging Face
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As of now, CloudAI doesn't handle authentication with Hugging Face, so it is up to the user to enable authentication with Hugging Face in the shell where CloudAI is run. One might need to run the following command:
+
+.. code-block:: bash
+
+    uv run hf auth login
+
+Once done, all Hugging Face models will be downloaded using existing authentication.
@@ -28,6 +28,7 @@ dependencies = [
   "websockets~=15.0.1",
   "rich~=14.2",
   "click~=8.3",
+  "huggingface-hub~=1.1.7",
 ]
 requires-python = ">=3.10"
 scripts = { cloudai = "cloudai.cli:main" }
@@ -124,6 +125,7 @@ root_package = "cloudai"
   name = "Util modules are leaf dependencies"
   type = "forbidden"
   forbidden_modules = ["cloudai.systems", "cloudai.workloads", "cloudai.cli"]
+  allow_indirect_imports = true
   source_modules = ["cloudai.util"]
 
 [tool.vulture]
 
@@ -165,3 +165,27 @@ def __eq__(self, other: object) -> bool:
 
     def __hash__(self) -> int:
         return hash(self.src)
+
+
+@dataclass
+class HFModel(Installable):
+    """HuggingFace Model object."""
+
+    model_name: str
+    _installed_path: Path | None = field(default=None, repr=False)
+
+    @property
+    def installed_path(self) -> Path:
+        if self._installed_path:
+            return self._installed_path
+        return Path("hub") / self.model_name
+
+    @installed_path.setter
+    def installed_path(self, value: Path) -> None:
+        self._installed_path = value
+
+    def __eq__(self, other: object) -> bool:
+        return isinstance(other, HFModel) and other.model_name == self.model_name
+
+    def __hash__(self) -> int:
+        return hash(self.model_name)
@@ -47,6 +47,10 @@
 from cloudai.util import prepare_output_dir
 
 
+def _log_installation_dirs(prefix: str, system: System) -> None:
+    logging.info(f"{prefix} '{system.install_path.absolute()}'. HF cache is {system.hf_home_path.absolute()}.")
+
+
 def handle_install_and_uninstall(args: argparse.Namespace) -> int:
     """
     Manage the installation or uninstallation process for CloudAI.
@@ -69,12 +73,12 @@ def handle_install_and_uninstall(args: argparse.Namespace) -> int:
     if args.mode == "install":
         all_installed = installer.is_installed(installables)
         if all_installed:
-            logging.info(f"CloudAI is already installed into '{system.install_path}'.")
+            _log_installation_dirs("CloudAI is already installed into", system)
         else:
             logging.info("Not all components are ready")
             result = installer.install(installables)
             if result.success:
-                logging.info(f"CloudAI is successfully installed into '{system.install_path.absolute()}'.")
+                _log_installation_dirs("CloudAI is successfully installed into", system)
             else:
                 logging.error(result.message)
                 rc = 1
@@ -284,7 +288,7 @@ def handle_dry_run_and_run(args: argparse.Namespace) -> int:
 
         result = installer.install(installables)
         if result.success:
-            logging.info(f"CloudAI is successfully installed into '{system.install_path.absolute()}'.")
+            _log_installation_dirs("CloudAI is successfully installed into", system)
         else:
             logging.error("Failed to install workloads components.")
             logging.error(result.message)
 
@@ -32,7 +32,7 @@
 )
 from ._core.grader import Grader
 from ._core.grading_strategy import GradingStrategy
-from ._core.installables import DockerImage, File, GitRepo, Installable, PythonExecutable
+from ._core.installables import DockerImage, File, GitRepo, HFModel, Installable, PythonExecutable
 from ._core.job_status_result import JobStatusResult
 from ._core.json_gen_strategy import JsonGenStrategy
 from ._core.registry import Registry
@@ -65,6 +65,7 @@
     "Grader",
     "GradingStrategy",
     "GridSearchAgent",
+    "HFModel",
     "InstallStatusResult",
     "Installable",
     "JobIdRetrievalError",
 
@@ -25,10 +25,12 @@
     DockerImage,
     File,
     GitRepo,
+    HFModel,
     Installable,
     InstallStatusResult,
     PythonExecutable,
 )
+from cloudai.util.hf_model_manager import HFModelManager
 
 from .docker_image_cache_manager import DockerImageCacheManager, DockerImageCacheResult
 from .slurm_system import SlurmSystem
@@ -50,6 +52,7 @@ def __init__(self, system: SlurmSystem):
         super().__init__(system)
         self.system = system
         self.docker_image_cache_manager = DockerImageCacheManager(system)
+        self.hf_model_downloader = HFModelManager(system.hf_home_path)
 
     def _check_prerequisites(self) -> InstallStatusResult:
         base_prerequisites_result = super()._check_prerequisites()
@@ -98,6 +101,8 @@ def install_one(self, item: Installable) -> InstallStatusResult:
             item.installed_path = self.system.install_path / item.src.name
             shutil.copyfile(item.src, item.installed_path, follow_symlinks=False)
             return InstallStatusResult(True)
+        elif isinstance(item, HFModel):
+            return self.hf_model_downloader.download_model(item)
 
         return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
 
@@ -117,6 +122,8 @@ def uninstall_one(self, item: Installable) -> InstallStatusResult:
                 return InstallStatusResult(True)
             logging.debug(f"File {item.installed_path} does not exist.")
             return InstallStatusResult(True)
+        elif isinstance(item, HFModel):
+            return self.hf_model_downloader.remove_model(item)
 
         return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
 
@@ -141,6 +148,8 @@ def is_installed_one(self, item: Installable) -> InstallStatusResult:
                 item.installed_path = self.system.install_path / item.src.name
                 return InstallStatusResult(True)
             return InstallStatusResult(False, f"File {item.installed_path} does not exist")
+        elif isinstance(item, HFModel):
+            return self.hf_model_downloader.is_model_downloaded(item)
 
         return InstallStatusResult(False, f"Unsupported item type: {type(item)}")
 
@@ -159,6 +168,9 @@ def mark_as_installed_one(self, item: Installable) -> InstallStatusResult:
         elif isinstance(item, File):
             item.installed_path = self.system.install_path / item.src.name
             return InstallStatusResult(True)
+        elif isinstance(item, HFModel):
+            item.installed_path = self.system.hf_home_path  # fake path is OK here as the whole HF home will be mounted
+            return InstallStatusResult(True)
 
         return InstallStatusResult(False, f"Unsupported item type: {type(item)}")