Skip to content

NIMService deployment mode error after NIM operator upgrade #735

@xieshenzh

Description

@xieshenzh

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): KServe (OpenShift AI 3.0)
  • GPU Operator Version:
  • NIM Operator Version: NIM Operator with the commits from Use v1 DRA APIs #732
  • LLM NIM Versions:
  • NeMo Service Versions:

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

After upgrading the operator, an existing NIMService with deployment mode RawDeployment fail to reconcile.

Error: message: 'admission webhook "inferenceservice.kserve-webhook-server.validator" denied the request: update rejected: deploymentMode cannot be changed from ''RawDeployment'' to ''Standard'''

Cause: The NIMService was using RawDeployment. After the operator upgrade, the operator sets annotation to use Standard instead of the legacy RawDeployment deployment mode.

Suggested solution options:

  1. Pass the validation if Standard is used in the annotation but RawDeployment is in the status.
  2. Set the annotation to use RawDeployment instead of Standard, if the NIMService is already using RawDeployment.
  3. Don't set the annotation for any existing NIMService.

Further consideration: Deployment mode Standard and Knative are only applicable in newer version of KServe. For older version of KServe, RawDeployment and Serverless should be used.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

  1. Create a NIMService with an older version of NIM operator and an older version of KServe. Use the deployment mode RawDeployment.
  2. Upgrade the NIM operator to include commits in Use v1 DRA APIs #732.
  3. Check the status of the NIMService.

4. Information to attach

  • Operator pod status:

    • kubectl get pods -n OPERATOR_NAMESPACE
    • kubectl logs <operator-pod> -n OPERATOR_NAMESPACE
  • NIM Cache status:

    • kubectl get nimcache -A
    • kubectl describe nimcache -n <namespace>
    • kubectl get events -n <namespace>
    • kubectl get logs <caching-job> -n <namespace>
    • kubectl get pv, pvc -n <namespace>
  • NIM Service status:

    • kubectl get nimservice -A
    • kubectl describe nimservice -n <namespace>
    • kubectl get events -n <namespace>
    • kubectl get logs <nim-service-pod> -n <namespace>
  • If a pod/deployment is in an error state or pending state kubectl describe pod -n <namespace> POD_NAME

  • If a pod/deployment is in an error state or pending state kubectl logs -n <namespace> POD_NAME --all-containers

  • Output from running nvidia-smi from the driver container deployed by the GPU Operator: kubectl exec DRIVER_POD_NAME -n <GPU_OPERATOR_NAMESPACE> -c nvidia-driver-ctr -- nvidia-smi

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions