Skip to content

Commit ab57540

Browse files
authored
287 correct deploy the nvidia gpu operator on cce article (#296)
1 parent d3865c7 commit ab57540

File tree

5 files changed

+119
-17
lines changed

5 files changed

+119
-17
lines changed

docs/blueprints/by-use-case/ai/deploy-the-nvidia-gpu-operator-on-cce.md

Lines changed: 119 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ title: Deploy the NVIDIA GPU Operator on CCE
44
tags: [nvidia,nvidia-operator,gpu, ai]
55
---
66

7+
import Tabs from '@theme/Tabs';
8+
import TabItem from '@theme/TabItem';
9+
710
# Deploy the NVIDIA GPU Operator on CCE
811

912
The [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) is a critical tool for effectively managing GPU resources in Kubernetes clusters. It serves as an abstraction layer over Kubernetes APIs, automating tasks such as dynamic provisioning, driver updates, resource allocation, and optimization for GPU-intensive workloads, thereby simplifying the deployment and management of GPU-accelerated applications. Its functionality extends to dynamic provisioning of GPUs on demand, managing driver updates, optimizing resource allocation for varied workloads, and integrating with monitoring tools for comprehensive insights into GPU usage and health. This guide outlines how to deploy the NVIDIA GPU Operator on CCE cluster. The process involves preparing GPU nodes, installing necessary components, configuring the cluster for GPU support, deploying an application leveraging GPUs, and verifying functionality.
@@ -44,7 +47,15 @@ Wait for some minutes until the nodes get provisioned and check if they have suc
4447
New GPU nodes should contain a label with `accelerator` as key and `nvidia*` as value (e.g. **accelerator=nvidia-t4**).
4548
:::
4649

47-
## Installing the NVIDIA GPU Plugin
50+
## Installing the Driver with NVIDIA GPU Plugin
51+
52+
:::important Different Driver Installation Methods - Read Carefully
53+
54+
If your GPU nodes use **Ubuntu** or other **major Linux distribution**, you can bypass installing the **CCE AI Suite** plugin and install the NVIDIA driver directly on the nodes through the **Nvidia GPU Operator** (skip to [Deploying via Helm](#deploying-via-helm)) and the follow the instructions in the tab **Driver managed by GPU Operator**.
55+
56+
This method is recommended if none of your GPU nodes are using specialized distributions like **HCE** or **openEuler**, as it allows the operator to manage the entire driver lifecycle for a more streamlined setup.
57+
58+
:::
4859

4960
### Installation
5061

@@ -56,18 +67,48 @@ From sidebar select *Add-ons* and install the **CCE AI Suite (NVIDIA GPU)**.
5667

5768
### Plugin Configuration
5869

59-
For more information see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.com/cloud-container-engine/umn/add-ons/cloud_native_heterogeneous_computing_add-ons/cce_ai_suite_nvidia_gpu.html).
70+
When configuring the CCE AI Suite, you must provide a download link for the NVIDIA driver.
71+
72+
:::caution
73+
The selected driver must be compatible with both the GPU nodes and the NVIDIA GPU Operator; otherwise, the cluster will not be able to allocate GPU resources. It is crucial to **check for the most compatible driver version on the [NVIDIA GPU Operator Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html)**. You can find and download drivers from the **[NVIDIA Driver](https://www.nvidia.com/download/index.aspx)**.
74+
:::
75+
76+
Follow these steps to find and provide the correct driver download link:
77+
78+
1. **Find a Compatible Driver Version**:
79+
- Navigate to the [NVIDIA GPU Operator Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html)
80+
- Scroll down to [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#gpu-operator-component-matrix). This table lists the specific component versions, including the **recommended NVIDIA driver versions**, that are tested and supported. For example for the NVIDIA GPU Operator **v25.3.1**, the recommended driver version is **570.158.01**.
81+
82+
![imag](/img/docs/blueprints/by-use-case/ai/nvidia-operator/driver-version.png)
83+
84+
2. **Get the Driver Download Link**:
85+
- Go to the official **[NVIDIA Driver](https://www.nvidia.com/download/index.aspx)** page.
86+
- Manually search for the driver by entering your GPU's specifications, such as Product Type (e.g., Tesla), Product Series, Operating System (Linux) based on the node flavor that you are using and click **Find** to search for drivers.
87+
![img](/img/docs/blueprints/by-use-case/ai/nvidia-operator/driver-finding.png)
88+
- On the next page search for the **driver version** you identified in the previous step. Once you find the correct driver, click the view button to view the download page. Then right-click the **Download** button and copy the link address. This is the direct download link you will provide to the plugin.
89+
![img](/img/docs/blueprints/by-use-case/ai/nvidia-operator/driver-download.png)
90+
91+
1. **Configure the Plugin**: Paste the driver download link you obtained in previous step into the **Path to custom driver** field of the plugin and click **Install**.
92+
6093
![image](/img/docs/blueprints/by-use-case/ai/nvidia-operator/configure-plugin.png)
6194

62-
:::caution
63-
The selected driver must be compatible with the GPU nodes and supported by NVIDIA GPU Operator, otherwise the cluster will not be able to allocate GPU resources. Check supported drivers at [Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html).
64-
:::
95+
:::info
96+
For more information about the **CCE AI Suite (NVIDIA GPU)** plugin, see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.com/cloud-container-engine/umn/add-ons/cloud_native_heterogeneous_computing_add-ons/cce_ai_suite_nvidia_gpu.html).
97+
:::
6598

66-
## Deploying the NVIDIA GPU Operator via Helm
99+
## NVIDIA GPU Operator
67100

68-
Create a `values.yaml` file to include the required Helm Chart configuration values:
101+
### Deploying via Helm
69102

70-
```yaml title="values.yaml"
103+
Create a `values.yaml` file to include the required Helm Chart configuration values based on your setup:
104+
105+
- If you installed the NVIDIA driver using the **CCE AI Suite** (typically for HCE or openEuler nodes), use the configuration under **Driver managed by CCE AI Suite**. This setup informs the GPU Operator that the driver and toolkit are already present on the node.
106+
107+
- If you are using **Ubuntu** or other major Linux distribution and want the GPU Operator to manage the driver installation, use configurations under **Driver managed by GPU Operator**. This is the recommended approach for a streamlined setup on non-specialized operating systems.
108+
109+
<Tabs>
110+
<TabItem value="plugin" label="Driver managed by CCE AI Suite" default>
111+
```yaml title="values.yaml"
71112
hostPaths:
72113
driverInstallDir: "/usr/local/nvidia/"
73114

@@ -76,29 +117,90 @@ For more information see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.
76117

77118
toolkit:
78119
enabled: false
79-
```
80120

81-
:::important
121+
```
82122

83-
- `hostPaths.driverInstallDir`: The driver installation directory on CCE is different. *Do not change* this value!
84-
- `driver.enabled`: Driver installation is disabled because it's already installed via CCE AI Suite.
85-
- `toolkit.enabled`: The container toolkit installation is disabled because it's already installed via CCE AI Suite.
123+
:::important
124+
- `hostPaths.driverInstallDir`: The driver installation directory when managed by CCE AI Suite is different than default. **Do not change this value!**
125+
- `driver.enabled`: Driver installation is disabled because it's already installed via CCE AI Suite.
126+
- `toolkit.enabled`: The container toolkit installation is disabled because it's already installed via CCE AI Suite.
127+
:::
86128

87-
:::
129+
</TabItem>
130+
<TabItem value="gpu-operator" label="Driver managed by GPU Operator">
131+
```yaml title="values.yaml"
132+
driver:
133+
enabled: true
134+
135+
toolkit:
136+
enabled: true
137+
```
138+
139+
:::important
140+
141+
- `driver.enabled: true`: Allows the operator to download and install the appropriate NVIDIA driver on the nodes.
142+
- `toolkit.enabled: true`: Allows the operator to install the NVIDIA container toolkit, which is required for GPU-aware containers.
143+
144+
:::
145+
</TabItem>
146+
</Tabs>
88147

89148
Now deploy the operator via helm:
90149

91150
```bash
92151
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
93152
helm repo update
94153
95-
helm install --wait gpu-operator \
154+
helm install gpu-operator \
96155
-n gpu-operator --create-namespace \
97156
nvidia/gpu-operator \
98157
-f values.yaml \
99-
--version=v24.9.2
158+
--version=v25.3.1
100159
```
101160

161+
Of course. Here is the updated section for your article with instructions on how to check for Multi-Instance GPU (MIG) support online.
162+
163+
### Multi-Instance GPU (MIG) - Optional
164+
165+
[Multi-Instance GPU (MIG)](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/) allows a single physical GPU to be partitioned into multiple smaller, fully isolated GPU instances. Each instance has its own dedicated resources, including memory, cache, and compute cores, making it ideal for running multiple workloads in parallel without interference.
166+
167+
#### Verify MIG Support
168+
169+
:::important
170+
Before configuring MIG, you must first ensure that the chosen GPU hardware supports this feature. MIG is available on GPUs from the **NVIDIA Ampere architecture and newer**.
171+
:::
172+
173+
To verify if your specific GPU model is compatible, you should consult [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). This contains an up-to-date list of all supported GPUs.
174+
175+
#### Configure and Deploy with MIG
176+
177+
Set the `mig.strategy` value in your Helm `values.yaml` file. There are two strategies available:
178+
179+
- **single**: This strategy partitions the GPU into homogenous slices. All GPU instances will be of the same size.
180+
- **mixed**: This strategy allows for a mix of different-sized GPU instances on the same physical GPU, providing more flexibility for varied workloads.
181+
182+
Update your Helm configuration and add the `mig` configuration to your existing `values.yaml`.
183+
184+
```yaml title="values.yaml"
185+
# ... other fields ...
186+
mig:
187+
strategy: "single" # or "mixed"
188+
```
189+
190+
After applying the changes, upgrade the GPU Operator with the MIG-enabled configuration.
191+
192+
```bash
193+
helm upgrade --install gpu-operator \
194+
-n gpu-operator --create-namespace \
195+
nvidia/gpu-operator \
196+
-f values.yaml \
197+
--version=v25.3.1
198+
```
199+
200+
:::info
201+
For more information about configuring **MIG**, refer to [GPU Operator with MIG](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html).
202+
:::
203+
102204
## Deploying an application with GPU Support
103205

104206
1. **Create a Pod Manifest**: For example, deploying a CUDA job.
@@ -211,7 +313,7 @@ docker logs Container ID
211313

212314
### Validating Pod Resource Requests
213315

214-
Make sure the nodes that have GPUs are properly decorated with the following, that instructs Kubernetes to schedule the pods only on
316+
Make sure the nodes that have GPUs are properly decorated with the following, that instructs Kubernetes to schedule the pods only on
215317
nodes that have available GPUs.
216318

217319
```yaml
-12 KB
Loading
64.2 KB
Loading
9.33 KB
Loading
96.5 KB
Loading

0 commit comments

Comments
 (0)