Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 83 additions & 1 deletion docs/gpu-sharing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,15 @@ GPU sharing is disabled by default. To enable it, add the following flag to the
--set "global.gpuSharing=true"
```

To verify GPU sharing is enabled, check the scheduler configuration:
```
# Check if GPU sharing is enabled in the scheduler config (Config CRD)
kubectl get config -n kai-scheduler kai-config -o jsonpath='{.spec.admission.gpuSharing}'

# Verify the binder component is running
kubectl get pods -n kai-scheduler -l app=binder
```

### Runtime Class Configuration
KAI Scheduler's binder component creates reservation pods that require access to the GPU devices. These pods must run on a container runtime that can provide NVML support. By default, KAI Scheduler uses the `nvidia` Runtime Class, which is typically configured by the NVIDIA device plugin.

Expand All @@ -29,6 +38,21 @@ To submit a pod that can share a GPU device, run this command:
kubectl apply -f gpu-sharing.yaml
```

To check the pod status and verify GPU sharing configuration:
```
# Check pod status and view detailed information including events
kubectl describe pod gpu-sharing

# Verify GPU sharing annotation
kubectl get pod gpu-sharing -o jsonpath='{.metadata.annotations.gpu-fraction}'

# Check if reservation pod was created (required for GPU sharing)
kubectl get pods -n kai-resource-reservation

# Check pod logs to verify GPU access (only available when pod is running)
kubectl logs gpu-sharing -c gpu-workload
```

In the gpu-sharing.yaml file, the pod includes a `gpu-fraction` annotation with a value of 0.5, meaning:
* The pod is allowed to consume up to half of a GPU device memory
* Other pods with total request of up to 0.5 GPU memory will be able to share this device as well
Expand All @@ -39,6 +63,22 @@ To submit a pod that request a specific amount of GPU memory, run this command:
```
kubectl apply -f gpu-memory.yaml
```

To check the pod status and verify GPU memory configuration:
```
# Check pod status and view detailed information including events
kubectl describe pod gpu-sharing

# Verify GPU memory annotation (value in Mib)
kubectl get pod gpu-sharing -o jsonpath='{.metadata.annotations.gpu-memory}'

# Check if reservation pod was created (required for GPU sharing)
kubectl get pods -n kai-resource-reservation

# Check pod logs to verify GPU access (only available when pod is running)
kubectl logs gpu-sharing -c gpu-workload
```

In the gpu-memory.yaml file, the pod includes a `gpu-memory` annotation with a value of 2000 (in Mib), meaning:
* The pod is allowed to consume up to 2000 Mib of a GPU device memory
* The remaining GPU device memory can be shared with other pods in the cluster
Expand All @@ -52,8 +92,50 @@ To allocate GPU fraction to a specific container in a multi-container pod:
kubectl apply -f gpu-sharing-non-default-container.yaml
```

To check the pod status and verify container-specific GPU allocation:
```
# Check pod status and view detailed information including events
kubectl describe pod gpu-sharing-non-default

# Verify GPU fraction and container name annotations
kubectl get pod gpu-sharing-non-default -o jsonpath='GPU Fraction: {.metadata.annotations.gpu-fraction}, Container: {.metadata.annotations.gpu-fraction-container-name}{"\n"}'

# Check if reservation pod was created (required for GPU sharing)
kubectl get pods -n kai-resource-reservation

# Check logs for the specific container that received GPU allocation (only available when pod is running)
kubectl logs gpu-sharing-non-default -c gpu-workload
```

In the gpu-sharing-non-default-container.yaml file, the pod includes:
* `gpu-fraction: "0.5"` - Requests half of a GPU device memory
* `gpu-fraction-container-name: "gpu-workload"` - Specifies that the container named "gpu-workload" should receive the GPU allocation instead of the default first container

This is useful for pods with sidecar containers where only one specific container needs GPU access. This works the same for init and regular containers.
This is useful for pods with sidecar containers where only one specific container needs GPU access. This works the same for init and regular containers.

### Troubleshooting

If pods are not being scheduled or GPU sharing is not working as expected, use the following commands to diagnose issues:

```
# Check if GPU sharing is enabled
kubectl get config -n kai-scheduler kai-config -o jsonpath='{.spec.admission.gpuSharing}'

# Verify reservation pods are running
kubectl get pods -n kai-resource-reservation

# Check reservation pod logs
kubectl logs -n kai-resource-reservation <reservation-pod-name>

# Check scheduler logs for GPU sharing related messages
kubectl logs -n kai-scheduler -l app=kai-scheduler --tail=100 | grep -i gpu

# Check binder logs for GPU reservation issues
kubectl logs -n kai-scheduler -l app=binder --tail=100 | grep -i "gpu\|reservation"

# Check pod events for scheduling failures
kubectl describe pod <pod-name>

# Verify pod annotations and queue assignment
kubectl get pod <pod-name> -o jsonpath='Queue: {.metadata.labels.kai\.scheduler/queue}, GPU Fraction: {.metadata.annotations.gpu-fraction}, GPU Memory: {.metadata.annotations.gpu-memory}{"\n"}'
```