|
| 1 | +# How to configure the Device Plugin to report different GPUs |
| 2 | + |
| 3 | +Volcano v1.9.0 introduces Capacity scheduling capabilities. However, the default Nvidia Device Plugin reports resources as `nvidia.com/gpu`, which does not support reporting different GPU models as shown in the example. To address this, you need to configure three steps: |
| 4 | + |
| 5 | +1. Install a custom Device Plugin |
| 6 | +2. Configure DCGM Exporter for Pod-level monitoring |
| 7 | +3. Configure Volcano to use the Capacity scheduling plugin |
| 8 | + |
| 9 | +## 1. Install a Custom Device Plugin |
| 10 | + |
| 11 | +### 1.1 Configure GPU Operator and GPU Feature Discovery |
| 12 | + |
| 13 | +Initially, we used the NVIDIA GPU Operator to manage GPU resources uniformly, with GFD and other functions already configured. Since we have NVIDIA drivers installed and need a customized Device Plugin, we need to configure the GPU Operator to enable DCGM Exporter and disable driver and Device Plugin management. |
| 14 | + |
| 15 | +### 1.2 Install a Custom Device Plugin |
| 16 | + |
| 17 | +Volcano provides queue-based resource capabilities, but to report different types of GPUs, the Device Plugin needs to be adapted. |
| 18 | + |
| 19 | +- Related Issue: [Advertising specific GPU types as separate extended resource · Issue #424 · NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin/issues/424) |
| 20 | +- Related Code: [k8s-device-plugin/cmd/nvidia-device-plugin/main.go at eb8fd565c3df0caca59bf0ff2ae918e647f46af3 · NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin/blob/eb8fd565c3df0caca59bf0ff2ae918e647f46af3/cmd/nvidia-device-plugin/main.go#L239) |
| 21 | + |
| 22 | +When installing the Device Plugin via Helm, specify the configuration file: |
| 23 | + |
| 24 | +```sh |
| 25 | +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ |
| 26 | + --version=0.15.0 \ |
| 27 | + --namespace nvidia-device-plugin \ |
| 28 | + --create-namespace \ |
| 29 | + --set config.default=other-config \ |
| 30 | + --set-file config.map.other-config=other-config.yaml \ |
| 31 | + --set-file config.map.p100-config=p100-config.yaml \ |
| 32 | + --set-file config.map.v100-config=v100-config.yaml |
| 33 | +``` |
| 34 | + |
| 35 | +Configuration file content: |
| 36 | + |
| 37 | +```yaml |
| 38 | +version: v1 |
| 39 | +flags: |
| 40 | + migStrategy: "none" |
| 41 | + failOnInitError: true |
| 42 | + nvidiaDriverRoot: "/" |
| 43 | + plugin: |
| 44 | + passDeviceSpecs: false |
| 45 | + deviceListStrategy: envvar |
| 46 | + deviceIDStrategy: uuid |
| 47 | +resources: |
| 48 | + gpus: |
| 49 | + - pattern: "Tesla V100-SXM2-32GB" |
| 50 | + name: v100 |
| 51 | + - pattern: "Tesla P100-PCIE-*" |
| 52 | + name: p100 |
| 53 | + - pattern: "NVIDIA GeForce RTX 2080 Ti" |
| 54 | + name: 2080ti |
| 55 | + - pattern: "NVIDIA TITAN Xp" |
| 56 | + name: titan |
| 57 | + - pattern: "Tesla T4" |
| 58 | + name: t4 |
| 59 | +``` |
| 60 | +
|
| 61 | +Modify the Nvidia Device Plugin source code. |
| 62 | +
|
| 63 | +Additionally, due to the Go version of my device, I needed to modify the Dockerfile and repackage the image. After modifying and repackaging, replace the Daemonset image with the new version to support marking different types of GPUs as different resources. |
| 64 | +
|
| 65 | +### 1.3 Clean Up Outdated Device Plugin Resources |
| 66 | +
|
| 67 | +Although we have reported new resources, the previous GPU labels will not disappear: |
| 68 | +
|
| 69 | +```sh |
| 70 | +kubectl get nodes -ojson | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}' |
| 71 | +``` |
| 72 | +
|
| 73 | +Sample output: |
| 74 | +
|
| 75 | +```json |
| 76 | +{ |
| 77 | + "name": "huawei-82", |
| 78 | + "allocatable": { |
| 79 | + "cpu": "80", |
| 80 | + "ephemeral-storage": "846624789946", |
| 81 | + "hugepages-1Gi": "0", |
| 82 | + "hugepages-2Mi": "0", |
| 83 | + "memory": "263491632Ki", |
| 84 | + "nvidia.com/gpu": "0", |
| 85 | + "nvidia.com/t4": "2", |
| 86 | + "pods": "110" |
| 87 | + } |
| 88 | +} |
| 89 | +``` |
| 90 | + |
| 91 | +Start `kubectl proxy`: |
| 92 | + |
| 93 | +```sh |
| 94 | +kubectl proxy |
| 95 | +# Starting to serve on 127.0.0.1:8001 |
| 96 | +``` |
| 97 | + |
| 98 | +Deletion script (note / needs to be escaped as ~1): |
| 99 | + |
| 100 | +```bash |
| 101 | +#!/bin/bash |
| 102 | + |
| 103 | +# Check if a node name is provided |
| 104 | +if [ -z "$1" ]; then |
| 105 | + echo "Usage: $0 <node-name>" |
| 106 | + exit 1 |
| 107 | +fi |
| 108 | + |
| 109 | +NODE_NAME=$1 |
| 110 | + |
| 111 | +# Prepare the JSON patch data |
| 112 | +PATCH_DATA=$(cat <<EOF |
| 113 | +[ |
| 114 | + {"op": "remove", "path": "/status/capacity/nvidia.com~1gpu"} |
| 115 | +] |
| 116 | +EOF |
| 117 | +) |
| 118 | + |
| 119 | +# Execute the PATCH request |
| 120 | +curl --header "Content-Type: application/json-patch+json" \ |
| 121 | + --request PATCH \ |
| 122 | + --data "$PATCH_DATA" \ |
| 123 | + http://127.0.0.1:8001/api/v1/nodes/$NODE_NAME/status |
| 124 | + |
| 125 | +echo "Patch request sent for node $NODE_NAME" |
| 126 | +``` |
| 127 | + |
| 128 | +Pass the Node name and clean up: |
| 129 | + |
| 130 | +```sh |
| 131 | +vim patch_node_gpu.sh |
| 132 | +./patch_node_gpu.sh huawei-82 |
| 133 | +``` |
| 134 | + |
| 135 | +This completes the first stage: re-reporting GPU resources. |
| 136 | + |
| 137 | +## 2. Configure DCGM Exporter for Pod-Level Monitoring |
| 138 | + |
| 139 | +After changing the GPU resource name, we found that DCGM Exporter could not obtain Pod-level GPU usage metrics. The reason is that DCGM Exporter must fully match the resource name `nvidia.com/gpu` or those with the prefix `nvidia.com/mig-`. |
| 140 | + |
| 141 | +To address this, modify the DCGM Exporter logic, repackage the image, and replace it. |
| 142 | + |
| 143 | +## 3. Configure Volcano to Use the Capacity Scheduling Plugin |
| 144 | + |
| 145 | +Volcano provides a guide titled "How to use capacity plugin", but this guide is not entirely accurate. When configuring the scheduler ConfigMap, you also need to add the reclaim plugin to enable elasticity. |
| 146 | + |
| 147 | +```yaml |
| 148 | +kind: ConfigMap |
| 149 | +apiVersion: v1 |
| 150 | +metadata: |
| 151 | + name: volcano-scheduler-configmap |
| 152 | + namespace: volcano-system |
| 153 | +data: |
| 154 | + volcano-scheduler.conf: | |
| 155 | + actions: "enqueue, allocate, backfill, reclaim" # add reclaim |
| 156 | + tiers: |
| 157 | + - plugins: |
| 158 | + - name: priority |
| 159 | + - name: gang |
| 160 | + enablePreemptable: false |
| 161 | + - name: conformance |
| 162 | + - plugins: |
| 163 | + - name: drf |
| 164 | + enablePreemptable: false |
| 165 | + - name: predicates |
| 166 | + - name: capacity # add this field and remove proportion plugin. |
| 167 | + - name: nodeorder |
| 168 | + - name: binpack |
| 169 | +``` |
| 170 | +
|
| 171 | +Additionally, when a Pod requests multiple dimensions of resources (such as CPU, memory, GPU), ensure that each dimension of resources does not exceed the Deserved value to avoid preemption. |
0 commit comments