NIMService deployment mode error after NIM operator upgrade

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04):
* Kernel Version:
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): KServe (OpenShift AI 3.0)
* GPU Operator Version:
* NIM Operator Version: NIM Operator with the commits from https://github.com/NVIDIA/k8s-nim-operator/pull/732
* LLM NIM Versions:
* NeMo Service Versions:

### 2. Issue or feature description
_Briefly explain the issue in terms of expected behavior and current behavior._

After upgrading the operator, an existing NIMService with deployment mode `RawDeployment` fail to reconcile.

Error: `message: 'admission webhook "inferenceservice.kserve-webhook-server.validator" denied the request: update rejected: deploymentMode cannot be changed from ''RawDeployment'' to ''Standard'''`

Cause: The NIMService was using `RawDeployment`. After the operator upgrade, the operator sets annotation to use `Standard` instead of the legacy `RawDeployment` deployment mode.

Suggested solution options: 
1. Pass the validation if `Standard` is used in the annotation but `RawDeployment` is in the status.
2. Set the annotation to use `RawDeployment` instead of `Standard`, if the NIMService is already using `RawDeployment`.
3. Don't set the annotation for any existing NIMService.

Further consideration: Deployment mode `Standard` and `Knative` are only applicable in newer version of KServe. For older version of KServe, `RawDeployment` and `Serverless` should be used.

### 3. Steps to reproduce the issue
_Detailed steps to reproduce the issue._
1. Create a NIMService with an older version of NIM operator and an older version of KServe. Use the deployment mode `RawDeployment`.
2. Upgrade the NIM operator to include commits in https://github.com/NVIDIA/k8s-nim-operator/pull/732.
3. Check the status of the NIMService.

### 4. Information to attach

 - [ ] Operator pod status: 
    * `kubectl get pods -n OPERATOR_NAMESPACE`
    * `kubectl logs <operator-pod> -n OPERATOR_NAMESPACE`
 - [ ] NIM Cache status: 
    * `kubectl get nimcache -A`
    * `kubectl describe nimcache -n <namespace>`
    * `kubectl get events -n <namespace>`
    * `kubectl get logs <caching-job> -n <namespace>`
    * `kubectl get pv, pvc -n <namespace>`
 - [ ] NIM Service status: 
    * `kubectl get nimservice -A`
    * `kubectl describe nimservice -n <namespace>`
    * `kubectl get events -n <namespace>`
    * `kubectl get logs <nim-service-pod> -n <namespace>`

 - [ ] If a pod/deployment is in an error state or pending state `kubectl describe pod -n <namespace> POD_NAME`
 - [ ] If a pod/deployment is in an error state or pending state `kubectl logs -n <namespace> POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container deployed by the GPU Operator: `kubectl exec DRIVER_POD_NAME -n <GPU_OPERATOR_NAMESPACE> -c nvidia-driver-ctr -- nvidia-smi`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIMService deployment mode error after NIM operator upgrade #735

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NIMService deployment mode error after NIM operator upgrade #735

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions