You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -4,6 +4,9 @@ title: Deploy the NVIDIA GPU Operator on CCE
4
4
tags: [nvidia,nvidia-operator,gpu, ai]
5
5
---
6
6
7
+
import Tabs from '@theme/Tabs';
8
+
import TabItem from '@theme/TabItem';
9
+
7
10
# Deploy the NVIDIA GPU Operator on CCE
8
11
9
12
The [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) is a critical tool for effectively managing GPU resources in Kubernetes clusters. It serves as an abstraction layer over Kubernetes APIs, automating tasks such as dynamic provisioning, driver updates, resource allocation, and optimization for GPU-intensive workloads, thereby simplifying the deployment and management of GPU-accelerated applications. Its functionality extends to dynamic provisioning of GPUs on demand, managing driver updates, optimizing resource allocation for varied workloads, and integrating with monitoring tools for comprehensive insights into GPU usage and health. This guide outlines how to deploy the NVIDIA GPU Operator on CCE cluster. The process involves preparing GPU nodes, installing necessary components, configuring the cluster for GPU support, deploying an application leveraging GPUs, and verifying functionality.
@@ -44,7 +47,15 @@ Wait for some minutes until the nodes get provisioned and check if they have suc
44
47
New GPU nodes should contain a label with `accelerator` as key and `nvidia*` as value (e.g. **accelerator=nvidia-t4**).
45
48
:::
46
49
47
-
## Installing the NVIDIA GPU Plugin
50
+
## Installing the Driver with NVIDIA GPU Plugin
51
+
52
+
:::important Different Driver Installation Methods - Read Carefully
53
+
54
+
If your GPU nodes use **Ubuntu** or other **major Linux distribution**, you can bypass installing the **CCE AI Suite** plugin and install the NVIDIA driver directly on the nodes through the **Nvidia GPU Operator** (skip to [Deploying via Helm](#deploying-via-helm)) and the follow the instructions in the tab **Driver managed by GPU Operator**.
55
+
56
+
This method is recommended if none of your GPU nodes are using specialized distributions like **HCE** or **openEuler**, as it allows the operator to manage the entire driver lifecycle for a more streamlined setup.
57
+
58
+
:::
48
59
49
60
### Installation
50
61
@@ -56,18 +67,48 @@ From sidebar select *Add-ons* and install the **CCE AI Suite (NVIDIA GPU)**.
56
67
57
68
### Plugin Configuration
58
69
59
-
For more information see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.com/cloud-container-engine/umn/add-ons/cloud_native_heterogeneous_computing_add-ons/cce_ai_suite_nvidia_gpu.html).
70
+
When configuring the CCE AI Suite, you must provide a download link for the NVIDIA driver.
71
+
72
+
:::caution
73
+
The selected driver must be compatible with both the GPU nodes and the NVIDIA GPU Operator; otherwise, the cluster will not be able to allocate GPU resources. It is crucial to **check for the most compatible driver version on the [NVIDIA GPU Operator Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html)**. You can find and download drivers from the **[NVIDIA Driver](https://www.nvidia.com/download/index.aspx)**.
74
+
:::
75
+
76
+
Follow these steps to find and provide the correct driver download link:
77
+
78
+
1.**Find a Compatible Driver Version**:
79
+
- Navigate to the [NVIDIA GPU Operator Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html)
80
+
- Scroll down to [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html#gpu-operator-component-matrix). This table lists the specific component versions, including the **recommended NVIDIA driver versions**, that are tested and supported. For example for the NVIDIA GPU Operator **v25.3.1**, the recommended driver version is **570.158.01**.
- Go to the official **[NVIDIA Driver](https://www.nvidia.com/download/index.aspx)** page.
86
+
- Manually search for the driver by entering your GPU's specifications, such as Product Type (e.g., Tesla), Product Series, Operating System (Linux) based on the node flavor that you are using and click **Find** to search for drivers.
- On the next page search for the **driver version** you identified in the previous step. Once you find the correct driver, click the view button to view the download page. Then right-click the **Download** button and copy the link address. This is the direct download link you will provide to the plugin.
1.**Configure the Plugin**: Paste the driver download link you obtained in previous step into the **Path to custom driver** field of the plugin and click **Install**.
The selected driver must be compatible with the GPU nodes and supported by NVIDIA GPU Operator, otherwise the cluster will not be able to allocate GPU resources. Check supported drivers at [Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html).
64
-
:::
95
+
:::info
96
+
For more information about the **CCE AI Suite (NVIDIA GPU)** plugin, see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.com/cloud-container-engine/umn/add-ons/cloud_native_heterogeneous_computing_add-ons/cce_ai_suite_nvidia_gpu.html).
97
+
:::
65
98
66
-
## Deploying the NVIDIA GPU Operator via Helm
99
+
## NVIDIA GPU Operator
67
100
68
-
Create a `values.yaml` file to include the required Helm Chart configuration values:
101
+
### Deploying via Helm
69
102
70
-
```yaml title="values.yaml"
103
+
Create a `values.yaml` file to include the required Helm Chart configuration values based on your setup:
104
+
105
+
- If you installed the NVIDIA driver using the **CCE AI Suite** (typically for HCE or openEuler nodes), use the configuration under **Driver managed by CCE AI Suite**. This setup informs the GPU Operator that the driver and toolkit are already present on the node.
106
+
107
+
- If you are using **Ubuntu** or other major Linux distribution and want the GPU Operator to manage the driver installation, use configurations under **Driver managed by GPU Operator**. This is the recommended approach for a streamlined setup on non-specialized operating systems.
108
+
109
+
<Tabs>
110
+
<TabItemvalue="plugin"label="Driver managed by CCE AI Suite"default>
111
+
```yaml title="values.yaml"
71
112
hostPaths:
72
113
driverInstallDir: "/usr/local/nvidia/"
73
114
@@ -76,29 +117,90 @@ For more information see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.
76
117
77
118
toolkit:
78
119
enabled: false
79
-
```
80
120
81
-
:::important
121
+
```
82
122
83
-
- `hostPaths.driverInstallDir`: The driver installation directory on CCE is different. *Do not change* this value!
84
-
- `driver.enabled`: Driver installation is disabled because it's already installed via CCE AI Suite.
85
-
- `toolkit.enabled`: The container toolkit installation is disabled because it's already installed via CCE AI Suite.
123
+
:::important
124
+
-`hostPaths.driverInstallDir`: The driver installation directory when managed by CCE AI Suite is different than default. **Do not change this value!**
125
+
-`driver.enabled`: Driver installation is disabled because it's already installed via CCE AI Suite.
126
+
-`toolkit.enabled`: The container toolkit installation is disabled because it's already installed via CCE AI Suite.
127
+
:::
86
128
87
-
:::
129
+
</TabItem>
130
+
<TabItemvalue="gpu-operator"label="Driver managed by GPU Operator">
131
+
```yaml title="values.yaml"
132
+
driver:
133
+
enabled: true
134
+
135
+
toolkit:
136
+
enabled: true
137
+
```
138
+
139
+
:::important
140
+
141
+
- `driver.enabled: true`: Allows the operator to download and install the appropriate NVIDIA driver on the nodes.
142
+
- `toolkit.enabled: true`: Allows the operator to install the NVIDIA container toolkit, which is required for GPU-aware containers.
Of course. Here is the updated section for your article with instructions on how to check for Multi-Instance GPU (MIG) support online.
162
+
163
+
### Multi-Instance GPU (MIG) - Optional
164
+
165
+
[Multi-Instance GPU (MIG)](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/) allows a single physical GPU to be partitioned into multiple smaller, fully isolated GPU instances. Each instance has its own dedicated resources, including memory, cache, and compute cores, making it ideal for running multiple workloads in parallel without interference.
166
+
167
+
#### Verify MIG Support
168
+
169
+
:::important
170
+
Before configuring MIG, you must first ensure that the chosen GPU hardware supports this feature. MIG is available on GPUs from the **NVIDIA Ampere architecture and newer**.
171
+
:::
172
+
173
+
To verify if your specific GPU model is compatible, you should consult [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). This contains an up-to-date list of all supported GPUs.
174
+
175
+
#### Configure and Deploy with MIG
176
+
177
+
Set the `mig.strategy` value in your Helm `values.yaml` file. There are two strategies available:
178
+
179
+
- **single**: This strategy partitions the GPU into homogenous slices. All GPU instances will be of the same size.
180
+
- **mixed**: This strategy allows for a mix of different-sized GPU instances on the same physical GPU, providing more flexibility for varied workloads.
181
+
182
+
Update your Helm configuration and add the `mig` configuration to your existing `values.yaml`.
183
+
184
+
```yaml title="values.yaml"
185
+
# ... other fields ...
186
+
mig:
187
+
strategy: "single" # or "mixed"
188
+
```
189
+
190
+
After applying the changes, upgrade the GPU Operator with the MIG-enabled configuration.
191
+
192
+
```bash
193
+
helm upgrade --install gpu-operator \
194
+
-n gpu-operator --create-namespace \
195
+
nvidia/gpu-operator \
196
+
-f values.yaml \
197
+
--version=v25.3.1
198
+
```
199
+
200
+
:::info
201
+
For more information about configuring **MIG**, refer to [GPU Operator with MIG](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html).
202
+
:::
203
+
102
204
## Deploying an application with GPU Support
103
205
104
206
1. **Create a Pod Manifest**: For example, deploying a CUDA job.
@@ -211,7 +313,7 @@ docker logs Container ID
211
313
212
314
### Validating Pod Resource Requests
213
315
214
-
Make sure the nodes that have GPUs are properly decorated with the following, that instructs Kubernetes to schedule the pods only on
316
+
Make sure the nodes that have GPUs are properly decorated with the following, that instructs Kubernetes to schedule the pods only on
0 commit comments