fix: skip INFERENCE_SERVICE_NAME injection on pre-existing deployments to avoid pod restart on upgrade (RHOAIENG-59268) by andresllh · Pull Request #1435 · opendatahub-io/kserve

andresllh · 2026-04-22T19:37:55Z

Summary

Fixes a bug where upgrading from RHOAI 3.3.2 to 3.4 caused all RawDeployment InferenceServices in the cluster to restart simultaneously
Root cause: upstream commit be2cb412f (feat: new env var INFERENCE_SERVICE_NAME feat: new env var INFERENCE_SERVICE_NAME kserve/kserve#5013) unconditionally injects INFERENCE_SERVICE_NAME into every pod template on reconcile, changing the pod template hash and triggering a rolling restart for every existing deployment
Fix: before injecting the env var, check if the existing Deployment already has it; if not (pre-upgrade state), skip injection to preserve the pod template hash; new ISVCs always get the env var on creation

Test plan

Reproduced the bug: swapping to the 3.4 controller against a 3.3.2 ISVC deployment created a new ReplicaSet and injected INFERENCE_SERVICE_NAME
Verified the fix: swapping to the fixed controller against the same pre-upgrade ISVC left the ReplicaSet unchanged and did not inject the env var
Confirmed new ISVCs still receive INFERENCE_SERVICE_NAME on creation

Summary by CodeRabbit

Bug Fixes
- Optimized environment-variable injection for InferenceService predictors and transformers to avoid unnecessary pod rollouts during upgrades. The controller now detects existing deployments and skips reinjecting variables when already present, reducing service disruption and outages during operator upgrades. Error handling for deployment lookups was also improved.

coderabbitai · 2026-04-22T19:38:13Z

Warning

Rate limit exceeded

@andresllh has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 28 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 28 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8607bccb-3855-47d1-ac32-3470e6b0b045

📥 Commits

Reviewing files that changed from the base of the PR and between b12ccb2 and 89a77a1.

📒 Files selected for processing (4)

pkg/controller/v1beta1/inferenceservice/components/component.go
pkg/controller/v1beta1/inferenceservice/components/component_test.go
pkg/controller/v1beta1/inferenceservice/components/predictor.go
pkg/controller/v1beta1/inferenceservice/components/transformer.go

📝 Walkthrough

Walkthrough

The reconcile logic for predictor and transformer pod specs now conditionally injects the INFERENCE_SERVICE_NAME environment variable. The controller first attempts to GET the existing Deployment for the component; if the Deployment exists and its designated container already defines the env var, injection is skipped to avoid triggering a restart during upgrades. If the Deployment is NotFound (new ISVC) or the env var is absent, injection proceeds as before. The reconcile now handles non-NotFound GET errors explicitly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Security & Code Quality Issues

Error handling of Deployment GET results is still sensitive to indistinct handling of API errors. Distinguish apierrors.IsNotFound(err) from other errors; improper handling of exceptional conditions is CWE-703. Action: ensure non-NotFound errors return/requeue and are explicitly logged with context (component, namespace, error).
Potential race condition between Deployment inspection and subsequent pod changes (time-of-check vs time-of-use). Race conditions are CWE-362. Action: consider reconciling based on observed state changes (requeue on update) or make injection decision idempotent and safe if state changes.
Silent skipping of env var injection may create mixed cluster state where some pods lack the env var. This is a configuration inconsistency risk. Action: document invariant or enforce through admission/validation or explicit reconciliation that ensures all runtime pods meet required env expectations.
Namespace assumption risk: code assumes Deployment lives in isvc.Namespace. If cross-namespace patterns exist, this may be incorrect. Action: validate namespace invariants or resolve resource references explicitly.
Insufficient logging context on skips and errors reduces diagnosability. Action: include isvc name, namespace, component (predictor/transformer), and Deployment name in logs.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main fix: conditional injection of INFERENCE_SERVICE_NAME on pre-existing deployments to prevent pod restarts during upgrades, which directly addresses the changeset's core objective.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/v1beta1/inferenceservice/components/predictor.go`:
- Around line 160-162: The transformer currently unconditionally injects
INFERENCE_SERVICE_NAME causing unnecessary rolling restarts; update the
transformer logic to mirror the predictor behavior by checking the existing
Deployment before calling AddEnvVarToPodSpec: if there's no existing Deployment
(new ISVC) allow injecting, but if an existing Deployment exists only call
AddEnvVarToPodSpec when that Deployment's PodSpec already contains the
INFERENCE_SERVICE_NAME env var. Locate the transformer component code that
invokes AddEnvVarToPodSpec and add the same existence-and-env-presence guard
used in the predictor (check existing Deployment's containers for
INFERENCE_SERVICE_NAME) so injection only happens when safe.
- Around line 163-177: The current logic treats any Get error as "no deployment"
and only checks a single container name, causing incorrect injection decisions;
update p.client.Get(...) error handling to return the error when errGet != nil
and !apierrors.IsNotFound(errGet) (fail closed), and when the Deployment exists
(errGet == nil) compute injectEnvVar by iterating the new pod spec containers
(e.g., podSpec.Containers or the variable used to build the Deployment) and for
each new container find the matching container in
existingDeployment.Spec.Template.Spec.Containers by name and verify
constants.InferenceServiceNameEnvVarKey is present; set injectEnvVar = true only
if every new container has that env var in the existing deployment, otherwise
set injectEnvVar = false and log the skip using p.Log.Info; refer to
constants.PredictorServiceName and constants.InferenceServiceContainerName where
appropriate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: da29cca9-e627-473b-9bd0-722523d798c6

📥 Commits

Reviewing files that changed from the base of the PR and between 5e62bc9 and 9863516.

📒 Files selected for processing (1)

pkg/controller/v1beta1/inferenceservice/components/predictor.go

spolti · 2026-04-23T11:28:41Z

/retest

spolti

Please add some tests as well.

coderabbitai

♻️ Duplicate comments (1)

pkg/controller/v1beta1/inferenceservice/components/predictor.go (1)
161-189: ⚠️ Potential issue | 🟠 Major

Check scope (single container) and inject scope (all containers) still disagree.

The existing-deployment probe only inspects constants.InferenceServiceContainerName (line 173), but the injection on lines 183-188 walks every entry in podSpec.Containers. In collocation mode (predictor + transformer sidecar) the decision is therefore made on the kserve-container alone and then applied to sidecars whose existing state was never examined. Concretely, if the sidecar in the current Deployment doesn't carry INFERENCE_SERVICE_NAME but the kserve-container does, this path sets injectEnvVar = true and mutates the sidecar's env — changing the pod template hash and causing the very upgrade restart this PR is trying to avoid.

Either narrow the injection to constants.InferenceServiceContainerName to match the check, or (preferred) build a per-container map from the existing Deployment and gate each AddEnvVarToPodSpec call against its own container's prior state. A concrete diff was proposed in an earlier review round and still applies.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1beta1/inferenceservice/components/predictor.go` around lines
161 - 189, The code decides whether to inject INFERENCE_SERVICE_NAME by
inspecting only constants.InferenceServiceContainerName in existingDeployment
but then applies injection to every container in podSpec.Containers, which can
wrongly mutate sidecars and trigger restarts; fix by building a per-container
map from existingDeployment (map container name -> hasEnv) using
utils.GetEnvVarValue, then in the injection loop call
isvcutils.AddEnvVarToPodSpec only for containers where the map shows the env is
missing (or alternatively restrict injection to
constants.InferenceServiceContainerName to match the check); update references
to existingDeployment, constants.InferenceServiceContainerName, injectEnvVar,
podSpec.Containers, and isvcutils.AddEnvVarToPodSpec accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controller/v1beta1/inferenceservice/components/predictor.go`:
- Around line 161-189: The code decides whether to inject INFERENCE_SERVICE_NAME
by inspecting only constants.InferenceServiceContainerName in existingDeployment
but then applies injection to every container in podSpec.Containers, which can
wrongly mutate sidecars and trigger restarts; fix by building a per-container
map from existingDeployment (map container name -> hasEnv) using
utils.GetEnvVarValue, then in the injection loop call
isvcutils.AddEnvVarToPodSpec only for containers where the map shows the env is
missing (or alternatively restrict injection to
constants.InferenceServiceContainerName to match the check); update references
to existingDeployment, constants.InferenceServiceContainerName, injectEnvVar,
podSpec.Containers, and isvcutils.AddEnvVarToPodSpec accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: f6d1ccc8-a343-488b-b79b-901cb3a52fbb

📥 Commits

Reviewing files that changed from the base of the PR and between 9863516 and b12ccb2.

📒 Files selected for processing (2)

pkg/controller/v1beta1/inferenceservice/components/predictor.go
pkg/controller/v1beta1/inferenceservice/components/transformer.go

andresllh · 2026-04-23T15:42:54Z

/retest-required

andresllh · 2026-04-23T16:15:02Z

/retest

andresllh · 2026-04-23T16:53:54Z

/retest-required

…s to avoid pod restart on upgrade (RHOAIENG-59268) When upgrading from RHOAI 3.3.2 to 3.4, the new controller unconditionally injected INFERENCE_SERVICE_NAME into all RawDeployment pod templates, causing a rolling restart of every InferenceService in the cluster simultaneously. The fix checks whether the existing Deployment already has the env var. If it does not (pre-upgrade state), injection is skipped to preserve the pod template hash and avoid the restart. New ISVCs always get the env var on creation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Andres Llausas <allausas@redhat.com>

…n (RHOAIENG-59268) - Use apierrors.IsNotFound to distinguish missing deployment from real API errors; return error on non-NotFound failures instead of silently injecting - Apply the same INFERENCE_SERVICE_NAME injection guard to the transformer component to prevent rolling restarts on upgrade Signed-off-by: Andres Llausas <allausas@redhat.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lper (RHOAIENG-59268) Move the upgrade-safety check into shouldInjectInferenceServiceName in component.go so predictor and transformer share the same logic without duplication. Signed-off-by: Andres Llausas <allausas@redhat.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…9268) Covers: new ISVC (no deployment), pre-upgrade (env var absent → skip), post-upgrade (env var present → inject), and unmatched container name. Signed-off-by: Andres Llausas <allausas@redhat.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Andres Llausas <allausas@redhat.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andresllh · 2026-04-23T20:08:05Z

/retest-required

andresllh · 2026-04-23T20:13:14Z

/retest

andresllh · 2026-04-23T20:22:03Z

/retest

andresllh · 2026-04-23T20:30:09Z

/retest-required

andresllh · 2026-04-23T20:31:09Z

/retest

andresllh · 2026-04-24T12:54:26Z

/retest

andresllh · 2026-04-24T12:57:13Z

/retest

andresllh · 2026-04-24T13:08:04Z

/retest

VedantMahabaleshwarkar

/lgtm

openshift-ci · 2026-04-24T13:42:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andresllh, VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [VedantMahabaleshwarkar,andresllh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andresllh · 2026-04-24T14:45:49Z

/group-test

andresllh · 2026-04-24T16:28:14Z

/group-test

andresllh · 2026-04-24T18:42:04Z

/group-test

israel-hdez · 2026-04-24T23:39:30Z

/group-test

rhods-ci-bot · 2026-04-25T01:05:17Z

@andresllh: The following test has Failed:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

Install ORAS (see the ORAS installation guide).
Download artifacts with the following commands:

mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:kserve-group-test-gw2vg

github-project-automation Bot added this to ODH Model Serving Planning Apr 22, 2026

github-project-automation Bot moved this to New/Backlog in ODH Model Serving Planning Apr 22, 2026

openshift-ci Bot added the approved label Apr 22, 2026

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go Outdated

Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go Outdated

spolti reviewed Apr 23, 2026

View reviewed changes

Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go

spolti reviewed Apr 23, 2026

View reviewed changes

Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go Outdated

spolti requested changes Apr 23, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

andresllh requested review from VedantMahabaleshwarkar and spolti April 23, 2026 15:28

andresllh and others added 6 commits April 23, 2026 13:00

chore: fix copyright year in component_test.go

f32cf22

Signed-off-by: Andres Llausas <allausas@redhat.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: gofmt alignment in component_test.go

89a77a1

Signed-off-by: Andres Llausas <allausas@redhat.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andresllh force-pushed the RHOAIENG-59268-master branch from f1f4c8f to 89a77a1 Compare April 23, 2026 17:02

VedantMahabaleshwarkar approved these changes Apr 24, 2026

View reviewed changes

openshift-ci Bot assigned VedantMahabaleshwarkar Apr 24, 2026

openshift-ci Bot added the lgtm label Apr 24, 2026

Conversation

andresllh commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Estimated code review effort

Security & Code Quality Issues

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

spolti commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

spolti left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 23, 2026

Uh oh!

andresllh commented Apr 24, 2026

Uh oh!

andresllh commented Apr 24, 2026

Uh oh!

andresllh commented Apr 24, 2026

Uh oh!

VedantMahabaleshwarkar left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

andresllh commented Apr 24, 2026

Uh oh!

andresllh commented Apr 24, 2026

Uh oh!

andresllh commented Apr 24, 2026

Uh oh!

israel-hdez commented Apr 24, 2026

Uh oh!

rhods-ci-bot commented Apr 25, 2026

OCI Artifact Browser URL

Inspecting Test Artifacts Manually

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andresllh commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading