Skip to content

fix: skip INFERENCE_SERVICE_NAME injection on pre-existing deployments to avoid pod restart on upgrade (RHOAIENG-59268)#1435

Open
andresllh wants to merge 6 commits intoopendatahub-io:masterfrom
andresllh:RHOAIENG-59268-master
Open

fix: skip INFERENCE_SERVICE_NAME injection on pre-existing deployments to avoid pod restart on upgrade (RHOAIENG-59268)#1435
andresllh wants to merge 6 commits intoopendatahub-io:masterfrom
andresllh:RHOAIENG-59268-master

Conversation

@andresllh
Copy link
Copy Markdown
Member

@andresllh andresllh commented Apr 22, 2026

Summary

  • Fixes a bug where upgrading from RHOAI 3.3.2 to 3.4 caused all RawDeployment InferenceServices in the cluster to restart simultaneously
  • Root cause: upstream commit be2cb412f (feat: new env var INFERENCE_SERVICE_NAME feat: new env var INFERENCE_SERVICE_NAME kserve/kserve#5013) unconditionally injects INFERENCE_SERVICE_NAME into every pod template on reconcile, changing the pod template hash and triggering a rolling restart for every existing deployment
  • Fix: before injecting the env var, check if the existing Deployment already has it; if not (pre-upgrade state), skip injection to preserve the pod template hash; new ISVCs always get the env var on creation

Test plan

  • Reproduced the bug: swapping to the 3.4 controller against a 3.3.2 ISVC deployment created a new ReplicaSet and injected INFERENCE_SERVICE_NAME
  • Verified the fix: swapping to the fixed controller against the same pre-upgrade ISVC left the ReplicaSet unchanged and did not inject the env var
  • Confirmed new ISVCs still receive INFERENCE_SERVICE_NAME on creation

Related

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Optimized environment-variable injection for InferenceService predictors and transformers to avoid unnecessary pod rollouts during upgrades. The controller now detects existing deployments and skips reinjecting variables when already present, reducing service disruption and outages during operator upgrades. Error handling for deployment lookups was also improved.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@andresllh has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 28 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 28 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8607bccb-3855-47d1-ac32-3470e6b0b045

📥 Commits

Reviewing files that changed from the base of the PR and between b12ccb2 and 89a77a1.

📒 Files selected for processing (4)
  • pkg/controller/v1beta1/inferenceservice/components/component.go
  • pkg/controller/v1beta1/inferenceservice/components/component_test.go
  • pkg/controller/v1beta1/inferenceservice/components/predictor.go
  • pkg/controller/v1beta1/inferenceservice/components/transformer.go
📝 Walkthrough

Walkthrough

The reconcile logic for predictor and transformer pod specs now conditionally injects the INFERENCE_SERVICE_NAME environment variable. The controller first attempts to GET the existing Deployment for the component; if the Deployment exists and its designated container already defines the env var, injection is skipped to avoid triggering a restart during upgrades. If the Deployment is NotFound (new ISVC) or the env var is absent, injection proceeds as before. The reconcile now handles non-NotFound GET errors explicitly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Security & Code Quality Issues

  • Error handling of Deployment GET results is still sensitive to indistinct handling of API errors. Distinguish apierrors.IsNotFound(err) from other errors; improper handling of exceptional conditions is CWE-703. Action: ensure non-NotFound errors return/requeue and are explicitly logged with context (component, namespace, error).

  • Potential race condition between Deployment inspection and subsequent pod changes (time-of-check vs time-of-use). Race conditions are CWE-362. Action: consider reconciling based on observed state changes (requeue on update) or make injection decision idempotent and safe if state changes.

  • Silent skipping of env var injection may create mixed cluster state where some pods lack the env var. This is a configuration inconsistency risk. Action: document invariant or enforce through admission/validation or explicit reconciliation that ensures all runtime pods meet required env expectations.

  • Namespace assumption risk: code assumes Deployment lives in isvc.Namespace. If cross-namespace patterns exist, this may be incorrect. Action: validate namespace invariants or resolve resource references explicitly.

  • Insufficient logging context on skips and errors reduces diagnosability. Action: include isvc name, namespace, component (predictor/transformer), and Deployment name in logs.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: conditional injection of INFERENCE_SERVICE_NAME on pre-existing deployments to prevent pod restarts during upgrades, which directly addresses the changeset's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/v1beta1/inferenceservice/components/predictor.go`:
- Around line 160-162: The transformer currently unconditionally injects
INFERENCE_SERVICE_NAME causing unnecessary rolling restarts; update the
transformer logic to mirror the predictor behavior by checking the existing
Deployment before calling AddEnvVarToPodSpec: if there's no existing Deployment
(new ISVC) allow injecting, but if an existing Deployment exists only call
AddEnvVarToPodSpec when that Deployment's PodSpec already contains the
INFERENCE_SERVICE_NAME env var. Locate the transformer component code that
invokes AddEnvVarToPodSpec and add the same existence-and-env-presence guard
used in the predictor (check existing Deployment's containers for
INFERENCE_SERVICE_NAME) so injection only happens when safe.
- Around line 163-177: The current logic treats any Get error as "no deployment"
and only checks a single container name, causing incorrect injection decisions;
update p.client.Get(...) error handling to return the error when errGet != nil
and !apierrors.IsNotFound(errGet) (fail closed), and when the Deployment exists
(errGet == nil) compute injectEnvVar by iterating the new pod spec containers
(e.g., podSpec.Containers or the variable used to build the Deployment) and for
each new container find the matching container in
existingDeployment.Spec.Template.Spec.Containers by name and verify
constants.InferenceServiceNameEnvVarKey is present; set injectEnvVar = true only
if every new container has that env var in the existing deployment, otherwise
set injectEnvVar = false and log the skip using p.Log.Info; refer to
constants.PredictorServiceName and constants.InferenceServiceContainerName where
appropriate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: da29cca9-e627-473b-9bd0-722523d798c6

📥 Commits

Reviewing files that changed from the base of the PR and between 5e62bc9 and 9863516.

📒 Files selected for processing (1)
  • pkg/controller/v1beta1/inferenceservice/components/predictor.go

Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go Outdated
Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go Outdated
@spolti
Copy link
Copy Markdown
Member

spolti commented Apr 23, 2026

/retest

Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go
Comment thread pkg/controller/v1beta1/inferenceservice/components/predictor.go Outdated
Copy link
Copy Markdown
Member

@spolti spolti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests as well.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/controller/v1beta1/inferenceservice/components/predictor.go (1)

161-189: ⚠️ Potential issue | 🟠 Major

Check scope (single container) and inject scope (all containers) still disagree.

The existing-deployment probe only inspects constants.InferenceServiceContainerName (line 173), but the injection on lines 183-188 walks every entry in podSpec.Containers. In collocation mode (predictor + transformer sidecar) the decision is therefore made on the kserve-container alone and then applied to sidecars whose existing state was never examined. Concretely, if the sidecar in the current Deployment doesn't carry INFERENCE_SERVICE_NAME but the kserve-container does, this path sets injectEnvVar = true and mutates the sidecar's env — changing the pod template hash and causing the very upgrade restart this PR is trying to avoid.

Either narrow the injection to constants.InferenceServiceContainerName to match the check, or (preferred) build a per-container map from the existing Deployment and gate each AddEnvVarToPodSpec call against its own container's prior state. A concrete diff was proposed in an earlier review round and still applies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/v1beta1/inferenceservice/components/predictor.go` around lines
161 - 189, The code decides whether to inject INFERENCE_SERVICE_NAME by
inspecting only constants.InferenceServiceContainerName in existingDeployment
but then applies injection to every container in podSpec.Containers, which can
wrongly mutate sidecars and trigger restarts; fix by building a per-container
map from existingDeployment (map container name -> hasEnv) using
utils.GetEnvVarValue, then in the injection loop call
isvcutils.AddEnvVarToPodSpec only for containers where the map shows the env is
missing (or alternatively restrict injection to
constants.InferenceServiceContainerName to match the check); update references
to existingDeployment, constants.InferenceServiceContainerName, injectEnvVar,
podSpec.Containers, and isvcutils.AddEnvVarToPodSpec accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controller/v1beta1/inferenceservice/components/predictor.go`:
- Around line 161-189: The code decides whether to inject INFERENCE_SERVICE_NAME
by inspecting only constants.InferenceServiceContainerName in existingDeployment
but then applies injection to every container in podSpec.Containers, which can
wrongly mutate sidecars and trigger restarts; fix by building a per-container
map from existingDeployment (map container name -> hasEnv) using
utils.GetEnvVarValue, then in the injection loop call
isvcutils.AddEnvVarToPodSpec only for containers where the map shows the env is
missing (or alternatively restrict injection to
constants.InferenceServiceContainerName to match the check); update references
to existingDeployment, constants.InferenceServiceContainerName, injectEnvVar,
podSpec.Containers, and isvcutils.AddEnvVarToPodSpec accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: f6d1ccc8-a343-488b-b79b-901cb3a52fbb

📥 Commits

Reviewing files that changed from the base of the PR and between 9863516 and b12ccb2.

📒 Files selected for processing (2)
  • pkg/controller/v1beta1/inferenceservice/components/predictor.go
  • pkg/controller/v1beta1/inferenceservice/components/transformer.go

@andresllh
Copy link
Copy Markdown
Member Author

/retest-required

@andresllh
Copy link
Copy Markdown
Member Author

/retest

@andresllh
Copy link
Copy Markdown
Member Author

/retest-required

andresllh and others added 6 commits April 23, 2026 13:00
…s to avoid pod restart on upgrade (RHOAIENG-59268)

When upgrading from RHOAI 3.3.2 to 3.4, the new controller unconditionally
injected INFERENCE_SERVICE_NAME into all RawDeployment pod templates, causing
a rolling restart of every InferenceService in the cluster simultaneously.

The fix checks whether the existing Deployment already has the env var. If it
does not (pre-upgrade state), injection is skipped to preserve the pod template
hash and avoid the restart. New ISVCs always get the env var on creation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andres Llausas <allausas@redhat.com>
…n (RHOAIENG-59268)

- Use apierrors.IsNotFound to distinguish missing deployment from real API errors;
  return error on non-NotFound failures instead of silently injecting
- Apply the same INFERENCE_SERVICE_NAME injection guard to the transformer
  component to prevent rolling restarts on upgrade

Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lper (RHOAIENG-59268)

Move the upgrade-safety check into shouldInjectInferenceServiceName in component.go
so predictor and transformer share the same logic without duplication.

Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…9268)

Covers: new ISVC (no deployment), pre-upgrade (env var absent → skip),
post-upgrade (env var present → inject), and unmatched container name.

Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Andres Llausas <allausas@redhat.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andresllh andresllh force-pushed the RHOAIENG-59268-master branch from f1f4c8f to 89a77a1 Compare April 23, 2026 17:02
@andresllh
Copy link
Copy Markdown
Member Author

/retest-required

@andresllh
Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@andresllh
Copy link
Copy Markdown
Member Author

/retest

@andresllh
Copy link
Copy Markdown
Member Author

/retest-required

@andresllh
Copy link
Copy Markdown
Member Author

/retest

3 similar comments
@andresllh
Copy link
Copy Markdown
Member Author

/retest

@andresllh
Copy link
Copy Markdown
Member Author

/retest

@andresllh
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown

@VedantMahabaleshwarkar VedantMahabaleshwarkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andresllh, VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [VedantMahabaleshwarkar,andresllh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andresllh
Copy link
Copy Markdown
Member Author

/group-test

3 similar comments
@andresllh
Copy link
Copy Markdown
Member Author

/group-test

@andresllh
Copy link
Copy Markdown
Member Author

/group-test

@israel-hdez
Copy link
Copy Markdown

/group-test

@rhods-ci-bot
Copy link
Copy Markdown

@andresllh: The following test has Failed:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:kserve-group-test-gw2vg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: New/Backlog

Development

Successfully merging this pull request may close these issues.

5 participants