Skip to content

Commit 5850791

Browse files
authored
feat: nvidia operator first darft (#217)
1 parent fa7594f commit 5850791

File tree

4 files changed

+252
-0
lines changed

4 files changed

+252
-0
lines changed
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
---
2+
id: deploy-the-nvidia-gpu-operator-on-cce
3+
title: Deploy the NVIDIA GPU Operator on CCE
4+
tags: [nvidia,nvidia-operator,gpu, ai]
5+
---
6+
7+
# Deploy the NVIDIA GPU Operator on CCE
8+
9+
The [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) is a critical tool for effectively managing GPU resources in Kubernetes clusters. It serves as an abstraction layer over Kubernetes APIs, automating tasks such as dynamic provisioning, driver updates, resource allocation, and optimization for GPU-intensive workloads, thereby simplifying the deployment and management of GPU-accelerated applications. Its functionality extends to dynamic provisioning of GPUs on demand, managing driver updates, optimizing resource allocation for varied workloads, and integrating with monitoring tools for comprehensive insights into GPU usage and health. This guide outlines how to deploy the NVIDIA GPU Operator on CCE cluster. The process involves preparing GPU nodes, installing necessary components, configuring the cluster for GPU support, deploying an application leveraging GPUs, and verifying functionality.
10+
11+
## Prerequisites
12+
13+
This blueprint requires:
14+
15+
- Access to the CCE cluster with **kubectl**.
16+
- Helm installed on your system.
17+
18+
## Preparing & Configuring a GPU Node Pool
19+
20+
Go to the *Open Telekom Cloud console* and choose the specific cluster you want to add the GPU node pool to. At the left sidebar select *Nodes* and click *Create Node Pool*.
21+
22+
### Node Pool Configuration
23+
24+
Use the following values to configure the newly created GPU Node Pool:
25+
26+
- **Name**: Assign a meaningful name to your GPU node pool, such as `gpu-workers`.
27+
- **Flavor Selection**: Choose a flavor that includes GPU resources. Look for options like `pi2.2xlarge` or similar GPU-accelerated instances available.
28+
- **Annotations**: If required by your cluster's configuration, add any necessary annotations.
29+
- **Taints or Tolerations**: Set taints or tolerations to manage pod scheduling. For GPU nodes, you might set a taint like `nvidia.com/gpu=true:NoExecute` and ensure pods requiring GPUs have the appropriate toleration.
30+
31+
![image](/img/docs/blueprints/by-use-case/ai/nvidia-operator/create-node-pool.png)
32+
33+
After creating the Node Pool scale it to the desired size.
34+
35+
### Verification
36+
37+
Wait for some minutes until the nodes get provisioned and check if they have successfully joined the cluster with the following command:
38+
39+
```bash
40+
kubectl get nodes --show-labels | grep "nvidia"
41+
```
42+
43+
:::info
44+
New GPU nodes should contain a label with `accelerator` as key and `nvidia*` as value (e.g. **accelerator=nvidia-t4**).
45+
:::
46+
47+
## Installing the NVIDIA GPU Plugin
48+
49+
### Installation
50+
51+
From sidebar select *Add-ons* and install the **CCE AI Suite (NVIDIA GPU)**.
52+
53+
<center>
54+
![image](/img/docs/blueprints/by-use-case/ai/nvidia-operator/install-plugin.png)
55+
</center>
56+
57+
### Plugin Configuration
58+
59+
For more information see [CCE AI Suite (NVIDIA GPU)](https://docs.otc.t-systems.com/cloud-container-engine/umn/add-ons/cloud_native_heterogeneous_computing_add-ons/cce_ai_suite_nvidia_gpu.html).
60+
![image](/img/docs/blueprints/by-use-case/ai/nvidia-operator/configure-plugin.png)
61+
62+
:::caution
63+
The selected driver must be compatible with the GPU nodes and supported by NVIDIA GPU Operator, otherwise the cluster will not be able to allocate GPU resources. Check supported drivers at [Platform Support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html).
64+
:::
65+
66+
## Deploying the NVIDIA GPU Operator via Helm
67+
68+
Create a `values.yaml` file to include the required Helm Chart configuration values:
69+
70+
```yaml title="values.yaml"
71+
hostPaths:
72+
driverInstallDir: "/usr/local/nvidia/"
73+
74+
driver:
75+
enabled: false
76+
77+
toolkit:
78+
enabled: false
79+
```
80+
81+
:::important
82+
83+
- `hostPaths.driverInstallDir`: The driver installation directory on CCE is different. *Do not change* this value!
84+
- `driver.enabled`: Driver installation is disabled because it's already installed via CCE AI Suite.
85+
- `toolkit.enabled`: The container toolkit installation is disabled because it's already installed via CCE AI Suite.
86+
87+
:::
88+
89+
Now deploy the operator via helm:
90+
91+
```bash
92+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
93+
helm repo update
94+
95+
helm install --wait gpu-operator \
96+
-n gpu-operator --create-namespace \
97+
nvidia/gpu-operator \
98+
-f values.yaml \
99+
--version=v24.9.2
100+
```
101+
102+
## Deploying an application with GPU Support
103+
104+
1. **Create a Pod Manifest**: For example, deploying a CUDA job.
105+
106+
```yaml title="cuda-example.yaml"
107+
apiVersion: v1
108+
kind: Pod
109+
metadata:
110+
name: cuda-vectoradd
111+
spec:
112+
restartPolicy: OnFailure
113+
containers:
114+
- name: cuda-vectoradd
115+
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
116+
resources:
117+
limits:
118+
nvidia.com/gpu: 1
119+
```
120+
121+
2. **Apply the Manifest**:
122+
123+
```bash
124+
kubectl apply -f cuda-example.yaml
125+
```
126+
127+
### Validation
128+
129+
1. **Check Pod Status**: Ensure pods are running.
130+
131+
```bash
132+
kubectl get pods -n default
133+
```
134+
135+
2. **Verify Logs**: Check logs for GPU activity.
136+
137+
```bash
138+
kubectl logs -f cuda-example-<pod-name> -n default
139+
```
140+
141+
The containers' logs should indicate that the operation was succesfull, e.g.:
142+
143+
```bash
144+
[Vector addition of 50000 elements]
145+
Copy input data from the host memory to the CUDA device
146+
CUDA kernel launch with 196 blocks of 256 threads
147+
Copy output data from the CUDA device to the host memory
148+
Test PASSED
149+
Done
150+
```
151+
152+
:::tip
153+
If you are looking for more sample workloads visit [NVIDIA GPU Operator Verification: Running Sample GPU Applications](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#verification-running-sample-gpu-applications).
154+
:::
155+
156+
## Troubleshooting Tips
157+
158+
### Verifying NVIDIA Drivers are Installed on Nodes
159+
160+
Ensuring that the GPU nodes have the correct NVIDIA drivers is a critical first step. SSH into one of your GPU nodes and run:
161+
162+
```bash
163+
# If the add-on version is earlier than 2.0.0, run the following command:
164+
cd /opt/cloud/cce/nvidia/bin && ./nvidia-smi
165+
166+
# If the add-on version is 2.0.0 or later and the driver installation path is changed, run the following command:
167+
cd /usr/local/nvidia/bin && ./nvidia-smi
168+
```
169+
170+
or directly on the Container:
171+
172+
```bash
173+
cd /usr/local/nvidia/bin && ./nvidia-smi
174+
```
175+
176+
This command should display details such as the driver version, GPU utilization, and any active processes. If it fails or shows an outdated driver, this indicates that the node isn’t properly set up.
177+
178+
You can find more information [here](https://docs.otc.t-systems.com/cloud-container-engine/umn/add-ons/cloud_native_heterogeneous_computing_add-ons/cce_ai_suite_nvidia_gpu.html#verifying-the-add-on).
179+
180+
### Verifying Driver Compatibility
181+
182+
If drivers are missing or incompatible, verify that the CCE AI Suite is correctly installed and configured. Reinstalling or updating the suite might be necessary if the drivers aren’t correctly deployed. Follow the [instructions](https://docs.otc.t-systems.com/cloud-container-engine/umn/faqs/node/node_running/how_do_i_rectify_failures_when_the_nvidia_driver_is_used_to_start_containers_on_gpu_nodes.html).
183+
184+
Additionally, run the following command to check the CUDA version in the container:
185+
186+
```bash
187+
cat /usr/local/cuda/version.txt
188+
```
189+
190+
Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.
191+
192+
### Reviewing Logs
193+
194+
Check whether the NVIDIA driver is running properly. Log in to the node where the add-on is running and view the driver installation log in the following path:
195+
196+
```bash
197+
/opt/cloud/cce/nvidia/nvidia_installer.log
198+
```
199+
200+
View standard output logs of the NVIDIA container. Filter the container ID by running the following command:
201+
202+
```bash
203+
docker ps -a | grep nvidia
204+
```
205+
206+
View logs by running the following command:
207+
208+
```bash
209+
docker logs Container ID
210+
```
211+
212+
### Validating Pod Resource Requests
213+
214+
Make sure the nodes that have GPUs are properly decorated with the following, that instructs Kubernetes to schedule the pods only on
215+
nodes that have available GPUs.
216+
217+
```yaml
218+
resources:
219+
limits:
220+
nvidia.com/gpu: 1
221+
```
222+
223+
:::tip
224+
Ensure that the requested number of GPUs does not exceed what’s available **on any** node.
225+
:::
226+
227+
### Addressing Scheduling Conflicts
228+
229+
- **Resource Overcommitment:**
230+
- If multiple pods are scheduled with GPU resource requests, ensure that the overall demand does not exceed the cluster’s capacity.
231+
- Overcommitting resources might lead to scheduling failures.
232+
- **Taints and Tolerations:**
233+
- GPU nodes may have specific taints (e.g., `nvidia.com/gpu=true:NoExecute`).
234+
- Verify that your GPU-enabled pods include the proper tolerations so that the scheduler can place the pods on the GPU nodes.
235+
236+
### Checking Operator Status
237+
238+
Any errors here might indicate issues that indirectly affect GPU resource allocation:
239+
240+
```bash
241+
helm list -n gpu-operator
242+
kubectl get pods -n gpu-operator
243+
```
244+
245+
### Additional Information
246+
247+
:::info see also
248+
249+
- [How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?](https://docs.otc.t-systems.com/cloud-container-engine/umn/faqs/node/node_running/how_do_i_rectify_failures_when_the_nvidia_driver_is_used_to_start_containers_on_gpu_nodes.html)
250+
- [What Should I Do If an Error Occurs When I Deploy a Service on the GPU Node?](https://docs.otc.t-systems.com/cloud-container-engine/umn/faqs/workload/workload_exception_troubleshooting/what_should_i_do_if_an_error_occurs_when_i_deploy_a_service_on_the_gpu_node.html#cce-faq-00109)
251+
- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.17.4/troubleshooting.html)
252+
:::
34.9 KB
Loading
75.4 KB
Loading
36.9 KB
Loading

0 commit comments

Comments
 (0)