-
Notifications
You must be signed in to change notification settings - Fork 37
Description
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):
- Kernel Version:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): KServe (OpenShift AI 3.0)
- GPU Operator Version:
- NIM Operator Version: NIM Operator with the commits from Use v1 DRA APIs #732
- LLM NIM Versions:
- NeMo Service Versions:
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
After upgrading the operator, an existing NIMService with deployment mode RawDeployment fail to reconcile.
Error: message: 'admission webhook "inferenceservice.kserve-webhook-server.validator" denied the request: update rejected: deploymentMode cannot be changed from ''RawDeployment'' to ''Standard'''
Cause: The NIMService was using RawDeployment. After the operator upgrade, the operator sets annotation to use Standard instead of the legacy RawDeployment deployment mode.
Suggested solution options:
- Pass the validation if
Standardis used in the annotation butRawDeploymentis in the status. - Set the annotation to use
RawDeploymentinstead ofStandard, if the NIMService is already usingRawDeployment. - Don't set the annotation for any existing NIMService.
Further consideration: Deployment mode Standard and Knative are only applicable in newer version of KServe. For older version of KServe, RawDeployment and Serverless should be used.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
- Create a NIMService with an older version of NIM operator and an older version of KServe. Use the deployment mode
RawDeployment. - Upgrade the NIM operator to include commits in Use v1 DRA APIs #732.
- Check the status of the NIMService.
4. Information to attach
-
Operator pod status:
kubectl get pods -n OPERATOR_NAMESPACEkubectl logs <operator-pod> -n OPERATOR_NAMESPACE
-
NIM Cache status:
kubectl get nimcache -Akubectl describe nimcache -n <namespace>kubectl get events -n <namespace>kubectl get logs <caching-job> -n <namespace>kubectl get pv, pvc -n <namespace>
-
NIM Service status:
kubectl get nimservice -Akubectl describe nimservice -n <namespace>kubectl get events -n <namespace>kubectl get logs <nim-service-pod> -n <namespace>
-
If a pod/deployment is in an error state or pending state
kubectl describe pod -n <namespace> POD_NAME -
If a pod/deployment is in an error state or pending state
kubectl logs -n <namespace> POD_NAME --all-containers -
Output from running
nvidia-smifrom the driver container deployed by the GPU Operator:kubectl exec DRIVER_POD_NAME -n <GPU_OPERATOR_NAMESPACE> -c nvidia-driver-ctr -- nvidia-smi